Blame - Doc/library/re.rst - platform/external/python/cpython2

blob: 7b76d0c47d2e7ec453c036311261143e5391623d [file] [log] [blame]

Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1
				2	:mod:`re` --- Regular expression operations
				3	===========================================
				4
				5	.. module:: re
				6	:synopsis: Regular expression operations.
				7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
				10
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	11	This module provides regular expression matching operations similar to
				12	those found in Perl. Both patterns and strings to be searched can be
Georg Brandl	382edff	2009-03-31 15:43:20 +0000	[diff] [blame]	13	Unicode strings as well as 8-bit strings.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	14
				15	Regular expressions use the backslash character (``'\'``) to indicate
				16	special forms or to allow special characters to be used without invoking
				17	their special meaning. This collides with Python's usage of the same
				18	character for the same purpose in string literals; for example, to match
				19	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				20	string, because the regular expression must be ``\\``, and each
				21	backslash must be expressed as ``\\`` inside a regular Python string
				22	literal.
				23
				24	The solution is to use Python's raw string notation for regular expression
				25	patterns; backslashes are not handled in any special way in a string literal
				26	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				27	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	28	newline. Usually patterns will be expressed in Python code using this raw
				29	string notation.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	30
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	31	It is important to note that most regular expression operations are available as
				32	module-level functions and :class:`RegexObject` methods. The functions are
				33	shortcuts that don't require you to compile a regex object first, but miss some
				34	fine-tuning parameters.
				35
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	36
				37	.. _re-syntax:
				38
				39	Regular Expression Syntax
				40	-------------------------
				41
				42	A regular expression (or RE) specifies a set of strings that matches it; the
				43	functions in this module let you check if a particular string matches a given
				44	regular expression (or if a given regular expression matches a particular
				45	string, which comes down to the same thing).
				46
				47	Regular expressions can be concatenated to form new regular expressions; if A
				48	and B are both regular expressions, then AB is also a regular expression.
				49	In general, if a string p matches A and another string q matches B, the
				50	string pq will match AB. This holds unless A or B contain low precedence
				51	operations; boundary conditions between A and B; or have numbered group
				52	references. Thus, complex expressions can easily be constructed from simpler
				53	primitive expressions like the ones described here. For details of the theory
				54	and implementation of regular expressions, consult the Friedl book referenced
				55	above, or almost any textbook about compiler construction.
				56
				57	A brief explanation of the format of regular expressions follows. For further
Georg Brandl	1cf0522	2008-02-05 12:01:24 +0000	[diff] [blame]	58	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	59
				60	Regular expressions can contain both special and ordinary characters. Most
				61	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				62	expressions; they simply match themselves. You can concatenate ordinary
				63	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				64	section, we'll write RE's in ``this special style``, usually without quotes, and
				65	strings to be matched ``'in single quotes'``.)
				66
				67	Some characters, like ``'\|'`` or ``'('``, are special. Special
				68	characters either stand for classes of ordinary characters, or affect
				69	how the regular expressions around them are interpreted. Regular
				70	expression pattern strings may not contain null bytes, but can specify
				71	the null byte using the ``\number`` notation, e.g., ``'\x00'``.
				72
Martin Panter	197332a	2016-10-15 01:18:16 +0000	[diff] [blame]	73	Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
				74	directly nested. This avoids ambiguity with the non-greedy modifier suffix
				75	``?``, and with other modifiers in other implementations. To apply a second
				76	repetition to an inner repetition, parentheses may be used. For example,
				77	the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
				78
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	79
				80	The special characters are:
				81
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	82	``'.'``
				83	(Dot.) In the default mode, this matches any character except a newline. If
				84	the :const:`DOTALL` flag has been specified, this matches any character
				85	including a newline.
				86
				87	``'^'``
				88	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				89	matches immediately after each newline.
				90
				91	``'$'``
				92	Matches the end of the string or just before the newline at the end of the
				93	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				94	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				95	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Amaury Forgeot d'Arc	d08a8eb	2008-01-10 21:59:42 +0000	[diff] [blame]	96	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				97	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				98	the newline, and one at the end of the string.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	99
				100	``'*'``
				101	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				102	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				103	by any number of 'b's.
				104
				105	``'+'``
				106	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				107	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				108	match just 'a'.
				109
				110	``'?'``
				111	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				112	``ab?`` will match either 'a' or 'ab'.
				113
				114	``*?``, ``+?``, ``??``
				115	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				116	as much text as possible. Sometimes this behaviour isn't desired; if the RE
Georg Brandl	5892ab1	2016-04-12 07:51:41 +0200	[diff] [blame]	117	``<.*>`` is matched against ``<a> b <c>``, it will match the entire
				118	string, and not just ``<a>``. Adding ``?`` after the qualifier makes it
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	119	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
Georg Brandl	5892ab1	2016-04-12 07:51:41 +0200	[diff] [blame]	120	characters as possible will be matched. Using the RE ``<.*?>`` will match
				121	only ``<a>``.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	122
				123	``{m}``
				124	Specifies that exactly m copies of the previous RE should be matched; fewer
				125	matches cause the entire RE not to match. For example, ``a{6}`` will match
				126	exactly six ``'a'`` characters, but not five.
				127
				128	``{m,n}``
				129	Causes the resulting RE to match from m to n repetitions of the preceding
				130	RE, attempting to match as many repetitions as possible. For example,
				131	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				132	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				133	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				134	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				135	modifier would be confused with the previously described form.
				136
				137	``{m,n}?``
				138	Causes the resulting RE to match from m to n repetitions of the preceding
				139	RE, attempting to match as few repetitions as possible. This is the
				140	non-greedy version of the previous qualifier. For example, on the
				141	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				142	while ``a{3,5}?`` will only match 3 characters.
				143
				144	``'\'``
				145	Either escapes special characters (permitting you to match characters like
				146	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				147	sequences are discussed below.
				148
				149	If you're not using a raw string to express the pattern, remember that Python
				150	also uses the backslash as an escape sequence in string literals; if the escape
				151	sequence isn't recognized by Python's parser, the backslash and subsequent
				152	character are included in the resulting string. However, if Python would
				153	recognize the resulting sequence, the backslash should be repeated twice. This
				154	is complicated and hard to understand, so it's highly recommended that you use
				155	raw strings for all but the simplest expressions.
				156
				157	``[]``
Ezio Melotti	a195873	2011-10-20 19:31:08 +0300	[diff] [blame]	158	Used to indicate a set of characters. In a set:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	159
Ezio Melotti	a195873	2011-10-20 19:31:08 +0300	[diff] [blame]	160	* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
				161	``'m'``, or ``'k'``.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	162
Ezio Melotti	a195873	2011-10-20 19:31:08 +0300	[diff] [blame]	163	* Ranges of characters can be indicated by giving two characters and separating
				164	them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
				165	``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
				166	``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
				167	``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
				168	it will match a literal ``'-'``.
				169
				170	* Special characters lose their special meaning inside sets. For example,
				171	``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
				172	``'*'``, or ``')'``.
				173
				174	* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
				175	inside a set, although the characters they match depends on whether
				176	:const:`LOCALE` or :const:`UNICODE` mode is in force.
				177
				178	* Characters that are not within a range can be matched by :dfn:`complementing`
				179	the set. If the first character of the set is ``'^'``, all the characters
				180	that are not in the set will be matched. For example, ``[^5]`` will match
				181	any character except ``'5'``, and ``[^^]`` will match any character except
				182	``'^'``. ``^`` has no special meaning if it's not the first character in
				183	the set.
				184
				185	* To match a literal ``']'`` inside a set, precede it with a backslash, or
				186	place it at the beginning of the set. For example, both ``[()[\]{}]`` and
				187	``[]()[{}]`` will both match a parenthesis.
Mark Summerfield	700a635	2008-05-31 13:05:34 +0000	[diff] [blame]	188
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	189	``'\|'``
				190	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				191	will match either A or B. An arbitrary number of REs can be separated by the
				192	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				193	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				194	right. When one pattern completely matches, that branch is accepted. This means
				195	that once ``A`` matches, ``B`` will not be tested further, even if it would
				196	produce a longer overall match. In other words, the ``'\|'`` operator is never
				197	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				198	character class, as in ``[\|]``.
				199
				200	``(...)``
				201	Matches whatever regular expression is inside the parentheses, and indicates the
				202	start and end of a group; the contents of a group can be retrieved after a match
				203	has been performed, and can be matched later in the string with the ``\number``
				204	special sequence, described below. To match the literals ``'('`` or ``')'``,
				205	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				206
				207	``(?...)``
				208	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				209	otherwise). The first character after the ``'?'`` determines what the meaning
				210	and further syntax of the construct is. Extensions usually do not create a new
				211	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				212	currently supported extensions.
				213
				214	``(?iLmsux)``
				215	(One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
				216	``'u'``, ``'x'``.) The group matches the empty string; the letters
				217	set the corresponding flags: :const:`re.I` (ignore case),
				218	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				219	:const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
				220	and :const:`re.X` (verbose), for the entire regular expression. (The
				221	flags are described in :ref:`contents-of-module-re`.) This
				222	is useful if you wish to include the flags as part of the regular
				223	expression, instead of passing a flag argument to the
Georg Brandl	74f8fc0	2009-07-26 13:36:39 +0000	[diff] [blame]	224	:func:`re.compile` function.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	225
				226	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				227	used first in the expression string, or after one or more whitespace characters.
				228	If there are non-whitespace characters before the flag, the results are
				229	undefined.
				230
				231	``(?:...)``
Georg Brandl	3b85b9b	2010-11-26 08:20:18 +0000	[diff] [blame]	232	A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	233	expression is inside the parentheses, but the substring matched by the group
				234	cannot be retrieved after performing a match or referenced later in the
				235	pattern.
				236
				237	``(?P<name>...)``
				238	Similar to regular parentheses, but the substring matched by the group is
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	239	accessible via the symbolic group name name. Group names must be valid
				240	Python identifiers, and each group name must be defined only once within a
				241	regular expression. A symbolic group is also a numbered group, just as if
				242	the group were not named.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	243
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	244	Named groups can be referenced in three contexts. If the pattern is
				245	``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
				246	single or double quotes):
				247
				248	+---------------------------------------+----------------------------------+
				249	\| Context of reference to group "quote" \| Ways to reference it \|
				250	+=======================================+==================================+
				251	\| in the same pattern itself \| * ``(?P=quote)`` (as shown) \|
				252	\| \| * ``\1`` \|
				253	+---------------------------------------+----------------------------------+
				254	\| when processing match object ``m`` \| * ``m.group('quote')`` \|
				255	\| \| * ``m.end('quote')`` (etc.) \|
				256	+---------------------------------------+----------------------------------+
				257	\| in a string passed to the ``repl`` \| * ``\g<quote>`` \|
				258	\| argument of ``re.sub()`` \| * ``\g<1>`` \|
				259	\| \| * ``\1`` \|
				260	+---------------------------------------+----------------------------------+
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	261
				262	``(?P=name)``
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	263	A backreference to a named group; it matches whatever text was matched by the
				264	earlier group named name.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	265
				266	``(?#...)``
				267	A comment; the contents of the parentheses are simply ignored.
				268
				269	``(?=...)``
				270	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				271	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				272	``'Isaac '`` only if it's followed by ``'Asimov'``.
				273
				274	``(?!...)``
				275	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				276	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				277	followed by ``'Asimov'``.
				278
				279	``(?<=...)``
				280	Matches if the current position in the string is preceded by a match for ``...``
				281	that ends at the current position. This is called a :dfn:`positive lookbehind
				282	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				283	lookbehind will back up 3 characters and check if the contained pattern matches.
				284	The contained pattern must only match strings of some fixed length, meaning that
Serhiy Storchaka	4809d1f	2015-02-21 12:08:36 +0200	[diff] [blame]	285	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Group
				286	references are not supported even if they match strings of some fixed length.
				287	Note that
Ezio Melotti	1142773	2012-04-29 07:34:22 +0300	[diff] [blame]	288	patterns which start with positive lookbehind assertions will not match at the
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	289	beginning of the string being searched; you will most likely want to use the
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	290	:func:`search` function rather than the :func:`match` function:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	291
				292	>>> import re
				293	>>> m = re.search('(?<=abc)def', 'abcdef')
				294	>>> m.group(0)
				295	'def'
				296
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	297	This example looks for a word following a hyphen:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	298
				299	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				300	>>> m.group(0)
				301	'egg'
				302
				303	``(?<!...)``
				304	Matches if the current position in the string is not preceded by a match for
				305	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				306	positive lookbehind assertions, the contained pattern must only match strings of
Serhiy Storchaka	4809d1f	2015-02-21 12:08:36 +0200	[diff] [blame]	307	some fixed length and shouldn't contain group references.
				308	Patterns which start with negative lookbehind assertions may
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	309	match at the beginning of the string being searched.
				310
				311	``(?(id/name)yes-pattern\|no-pattern)``
				312	Will try to match with ``yes-pattern`` if the group with given id or name
				313	exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
				314	can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
				315	matching pattern, which will match with ``'<user@host.com>'`` as well as
				316	``'user@host.com'``, but not with ``'<user@host.com'``.
				317
				318	.. versionadded:: 2.4
				319
				320	The special sequences consist of ``'\'`` and a character from the list below.
				321	If the ordinary character is not on the list, then the resulting RE will match
				322	the second character. For example, ``\$`` matches the character ``'$'``.
				323
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	324	``\number``
				325	Matches the contents of the group of the same number. Groups are numbered
				326	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl	980db0a	2013-10-06 12:58:20 +0200	[diff] [blame]	327	but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	328	can only be used to match one of the first 99 groups. If the first digit of
				329	number is 0, or number is 3 octal digits long, it will not be interpreted as
				330	a group match, but as the character with octal value number. Inside the
				331	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				332	characters.
				333
				334	``\A``
				335	Matches only at the start of the string.
				336
				337	``\b``
				338	Matches the empty string, but only at the beginning or end of a word. A word is
				339	defined as a sequence of alphanumeric or underscore characters, so the end of a
				340	word is indicated by whitespace or a non-alphanumeric, non-underscore character.
Ezio Melotti	38ae5b2	2012-02-29 11:40:00 +0200	[diff] [blame]	341	Note that formally, ``\b`` is defined as the boundary between a ``\w`` and
				342	a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end
				343	of the string, so the precise set of characters deemed to be alphanumeric
				344	depends on the values of the ``UNICODE`` and ``LOCALE`` flags.
				345	For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
				346	``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	347	Inside a character range, ``\b`` represents the backspace character, for
				348	compatibility with Python's string literals.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	349
				350	``\B``
				351	Matches the empty string, but only when it is not at the beginning or end of a
Ezio Melotti	38ae5b2	2012-02-29 11:40:00 +0200	[diff] [blame]	352	word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``,
				353	but not ``'py'``, ``'py.'``, or ``'py!'``.
				354	``\B`` is just the opposite of ``\b``, so is also subject to the settings
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	355	of ``LOCALE`` and ``UNICODE``.
				356
				357	``\d``
				358	When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
				359	is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
Mark Dickinson	fe67bd9	2009-07-28 20:35:03 +0000	[diff] [blame]	360	whatever is classified as a decimal digit in the Unicode character properties
				361	database.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	362
				363	``\D``
				364	When the :const:`UNICODE` flag is not specified, matches any non-digit
				365	character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
				366	will match anything other than character marked as digits in the Unicode
				367	character properties database.
				368
				369	``\s``
Senthil Kumaran	dc0b324	2012-04-11 03:22:58 +0800	[diff] [blame]	370	When the :const:`UNICODE` flag is not specified, it matches any whitespace
				371	character, this is equivalent to the set ``[ \t\n\r\f\v]``. The
				372	:const:`LOCALE` flag has no extra effect on matching of the space.
				373	If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]``
				374	plus whatever is classified as space in the Unicode character properties
				375	database.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	376
				377	``\S``
Benjamin Peterson	72275ef	2014-11-25 14:54:45 -0600	[diff] [blame]	378	When the :const:`UNICODE` flag is not specified, matches any non-whitespace
Senthil Kumaran	dc0b324	2012-04-11 03:22:58 +0800	[diff] [blame]	379	character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The
				380	:const:`LOCALE` flag has no extra effect on non-whitespace match. If
				381	:const:`UNICODE` is set, then any character not marked as space in the
				382	Unicode character properties database is matched.
				383
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	384
				385	``\w``
				386	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				387	any alphanumeric character and the underscore; this is equivalent to the set
				388	``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
				389	whatever characters are defined as alphanumeric for the current locale. If
				390	:const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
				391	is classified as alphanumeric in the Unicode character properties database.
				392
				393	``\W``
				394	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				395	any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
				396	With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
				397	not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
Zachary Ware	7ca2a90	2014-10-19 01:06:58 -0500	[diff] [blame]	398	this will match anything other than ``[0-9_]`` plus characters classified as
Senthil Kumaran	15b6f3f	2012-03-11 20:37:39 -0700	[diff] [blame]	399	not alphanumeric in the Unicode character properties database.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	400
				401	``\Z``
				402	Matches only at the end of the string.
				403
Senthil Kumaran	15b6f3f	2012-03-11 20:37:39 -0700	[diff] [blame]	404	If both :const:`LOCALE` and :const:`UNICODE` flags are included for a
				405	particular sequence, then :const:`LOCALE` flag takes effect first followed by
				406	the :const:`UNICODE`.
				407
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	408	Most of the standard escapes supported by Python string literals are also
				409	accepted by the regular expression parser::
				410
				411	\a \b \f \n
				412	\r \t \v \x
				413	\\
				414
Ezio Melotti	48d886b	2012-04-29 04:46:34 +0300	[diff] [blame]	415	(Note that ``\b`` is used to represent word boundaries, and means "backspace"
				416	only inside character classes.)
				417
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	418	Octal escapes are included in a limited form: If the first digit is a 0, or if
				419	there are three octal digits, it is considered an octal escape. Otherwise, it is
				420	a group reference. As for string literals, octal escapes are always at most
				421	three digits in length.
				422
Georg Brandl	ae4ca79	2014-10-28 21:41:51 +0100	[diff] [blame]	423	.. seealso::
				424
				425	Mastering Regular Expressions
				426	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
				427	second edition of the book no longer covers Python at all, but the first
				428	edition covered writing good regular expression patterns in great detail.
				429
				430
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	431
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	432	.. _contents-of-module-re:
				433
				434	Module Contents
				435	---------------
				436
				437	The module defines several functions, constants, and an exception. Some of the
				438	functions are simplified versions of the full featured methods for compiled
				439	regular expressions. Most non-trivial applications always use the compiled
				440	form.
				441
				442
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	443	.. function:: compile(pattern, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	444
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	445	Compile a regular expression pattern into a regular expression object, which
Ezio Melotti	33b810d	2014-06-20 00:47:11 +0300	[diff] [blame]	446	can be used for matching using its :func:`~RegexObject.match` and
				447	:func:`~RegexObject.search` methods, described below.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	448
				449	The expression's behaviour can be modified by specifying a flags value.
				450	Values can be any of the following variables, combined using bitwise OR (the
				451	``\|`` operator).
				452
				453	The sequence ::
				454
Gregory P. Smith	0261e5d	2009-03-02 04:53:24 +0000	[diff] [blame]	455	prog = re.compile(pattern)
				456	result = prog.match(string)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	457
				458	is equivalent to ::
				459
Gregory P. Smith	0261e5d	2009-03-02 04:53:24 +0000	[diff] [blame]	460	result = re.match(pattern, string)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	461
Georg Brandl	74f8fc0	2009-07-26 13:36:39 +0000	[diff] [blame]	462	but using :func:`re.compile` and saving the resulting regular expression
				463	object for reuse is more efficient when the expression will be used several
				464	times in a single program.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	465
Gregory P. Smith	0261e5d	2009-03-02 04:53:24 +0000	[diff] [blame]	466	.. note::
				467
				468	The compiled versions of the most recent patterns passed to
				469	:func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
				470	programs that use only a few regular expressions at a time needn't worry
				471	about compiling regular expressions.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	472
				473
Sandro Tosi	e827c13	2012-01-01 12:52:24 +0100	[diff] [blame]	474	.. data:: DEBUG
				475
				476	Display debug information about compiled expression.
				477
				478
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	479	.. data:: I
				480	IGNORECASE
				481
				482	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
				483	lowercase letters, too. This is not affected by the current locale.
				484
				485
				486	.. data:: L
				487	LOCALE
				488
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	489	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
				490	current locale.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	491
				492
				493	.. data:: M
				494	MULTILINE
				495
				496	When specified, the pattern character ``'^'`` matches at the beginning of the
				497	string and at the beginning of each line (immediately following each newline);
				498	and the pattern character ``'$'`` matches at the end of the string and at the
				499	end of each line (immediately preceding each newline). By default, ``'^'``
				500	matches only at the beginning of the string, and ``'$'`` only at the end of the
				501	string and immediately before the newline (if any) at the end of the string.
				502
				503
				504	.. data:: S
				505	DOTALL
				506
				507	Make the ``'.'`` special character match any character at all, including a
				508	newline; without this flag, ``'.'`` will match anything except a newline.
				509
				510
				511	.. data:: U
				512	UNICODE
				513
				514	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
				515	on the Unicode character properties database.
				516
				517	.. versionadded:: 2.0
				518
				519
				520	.. data:: X
				521	VERBOSE
				522
Zachary Ware	77d61d4	2015-11-11 23:32:14 -0600	[diff] [blame]	523	This flag allows you to write regular expressions that look nicer and are
				524	more readable by allowing you to visually separate logical sections of the
				525	pattern and add comments. Whitespace within the pattern is ignored, except
				526	when in a character class or when preceded by an unescaped backslash.
				527	When a line contains a ``#`` that is not in a character class and is not
				528	preceded by an unescaped backslash, all characters from the leftmost such
				529	``#`` through the end of the line are ignored.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	530
Zachary Ware	77d61d4	2015-11-11 23:32:14 -0600	[diff] [blame]	531	This means that the two following regular expression objects that match a
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	532	decimal number are functionally equal::
				533
				534	a = re.compile(r"""\d + # the integral part
				535	\. # the decimal point
				536	\d * # some fractional digits""", re.X)
				537	b = re.compile(r"\d+\.\d*")
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	538
				539
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	540	.. function:: search(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	541
Terry Jan Reedy	9f7f62f	2014-05-30 16:19:50 -0400	[diff] [blame]	542	Scan through string looking for the first location where the regular expression
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	543	pattern produces a match, and return a corresponding :class:`MatchObject`
				544	instance. Return ``None`` if no position in the string matches the pattern; note
				545	that this is different from finding a zero-length match at some point in the
				546	string.
				547
				548
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	549	.. function:: match(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	550
				551	If zero or more characters at the beginning of string match the regular
				552	expression pattern, return a corresponding :class:`MatchObject` instance.
				553	Return ``None`` if the string does not match the pattern; note that this is
				554	different from a zero-length match.
				555
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	556	Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
				557	at the beginning of the string and not at the beginning of each line.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	558
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	559	If you want to locate a match anywhere in string, use :func:`search`
				560	instead (see also :ref:`search-vs-match`).
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	561
				562
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	563	.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	564
				565	Split string by the occurrences of pattern. If capturing parentheses are
				566	used in pattern, then the text of all groups in the pattern are also returned
				567	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				568	splits occur, and the remainder of the string is returned as the final element
				569	of the list. (Incompatibility note: in the original Python 1.5 release,
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	570	maxsplit was ignored. This has been fixed in later releases.)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	571
				572	>>> re.split('\W+', 'Words, words, words.')
				573	['Words', 'words', 'words', '']
				574	>>> re.split('(\W+)', 'Words, words, words.')
				575	['Words', ', ', 'words', ', ', 'words', '.', '']
				576	>>> re.split('\W+', 'Words, words, words.', 1)
				577	['Words', 'words, words.']
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	578	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
				579	['0', '3', '9']
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	580
Georg Brandl	70992c3	2008-03-06 07:19:15 +0000	[diff] [blame]	581	If there are capturing groups in the separator and it matches at the start of
				582	the string, the result will start with an empty string. The same holds for
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	583	the end of the string:
Georg Brandl	70992c3	2008-03-06 07:19:15 +0000	[diff] [blame]	584
				585	>>> re.split('(\W+)', '...words, words...')
				586	['', '...', 'words', ', ', 'words', '...', '']
				587
				588	That way, separator components are always found at the same relative
				589	indices within the result list (e.g., if there's one capturing group
				590	in the separator, the 0th, the 2nd and so forth).
				591
Skip Montanaro	222907d	2007-09-01 17:40:03 +0000	[diff] [blame]	592	Note that split will never split a string on an empty pattern match.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	593	For example:
Skip Montanaro	222907d	2007-09-01 17:40:03 +0000	[diff] [blame]	594
				595	>>> re.split('x*', 'foo')
				596	['foo']
				597	>>> re.split("(?m)^$", "foo\n\nbar\n")
				598	['foo\n\nbar\n']
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	599
Ezio Melotti	1e5d318	2010-11-26 09:30:44 +0000	[diff] [blame]	600	.. versionchanged:: 2.7
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	601	Added the optional flags argument.
				602
Georg Brandl	70992c3	2008-03-06 07:19:15 +0000	[diff] [blame]	603
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	604	.. function:: findall(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	605
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	606	Return all non-overlapping matches of pattern in string, as a list of
Georg Brandl	b46d6ff	2008-07-19 13:48:44 +0000	[diff] [blame]	607	strings. The string is scanned left-to-right, and matches are returned in
				608	the order found. If one or more groups are present in the pattern, return a
				609	list of groups; this will be a list of tuples if the pattern has more than
				610	one group. Empty matches are included in the result unless they touch the
				611	beginning of another match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	612
				613	.. versionadded:: 1.5.2
				614
				615	.. versionchanged:: 2.4
				616	Added the optional flags argument.
				617
				618
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	619	.. function:: finditer(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	620
Georg Brandl	e7a0990	2007-10-21 12:10:28 +0000	[diff] [blame]	621	Return an :term:`iterator` yielding :class:`MatchObject` instances over all
Georg Brandl	b46d6ff	2008-07-19 13:48:44 +0000	[diff] [blame]	622	non-overlapping matches for the RE pattern in string. The string is
				623	scanned left-to-right, and matches are returned in the order found. Empty
				624	matches are included in the result unless they touch the beginning of another
				625	match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	626
				627	.. versionadded:: 2.2
				628
				629	.. versionchanged:: 2.4
				630	Added the optional flags argument.
				631
				632
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	633	.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	634
				635	Return the string obtained by replacing the leftmost non-overlapping occurrences
				636	of pattern in string by the replacement repl. If the pattern isn't found,
				637	string is returned unchanged. repl can be a string or a function; if it is
				638	a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi	a7eb3c8	2011-08-19 22:54:33 +0200	[diff] [blame]	639	converted to a single newline character, ``\r`` is converted to a carriage return, and
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	640	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				641	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	642	For example:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	643
				644	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				645	... r'static PyObject*\npy_\1(void)\n{',
				646	... 'def myfunc():')
				647	'static PyObject*\npy_myfunc(void)\n{'
				648
				649	If repl is a function, it is called for every non-overlapping occurrence of
				650	pattern. The function takes a single match object argument, and returns the
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	651	replacement string. For example:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	652
				653	>>> def dashrepl(matchobj):
				654	... if matchobj.group(0) == '-': return ' '
				655	... else: return '-'
				656	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				657	'pro--gram files'
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	658	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
				659	'Baked Beans & Spam'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	660
Georg Brandl	04fd324	2009-08-13 07:48:05 +0000	[diff] [blame]	661	The pattern may be a string or an RE object.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	662
				663	The optional argument count is the maximum number of pattern occurrences to be
				664	replaced; count must be a non-negative integer. If omitted or zero, all
				665	occurrences will be replaced. Empty matches for the pattern are replaced only
				666	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				667	``'-a-b-c-'``.
				668
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	669	In string-type repl arguments, in addition to the character escapes and
				670	backreferences described above,
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	671	``\g<name>`` will use the substring matched by the group named ``name``, as
				672	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				673	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				674	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				675	reference to group 20, not a reference to group 2 followed by the literal
				676	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				677	substring matched by the RE.
				678
Ezio Melotti	1e5d318	2010-11-26 09:30:44 +0000	[diff] [blame]	679	.. versionchanged:: 2.7
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	680	Added the optional flags argument.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	681
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	682
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	683	.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	684
				685	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				686	number_of_subs_made)``.
				687
Ezio Melotti	1e5d318	2010-11-26 09:30:44 +0000	[diff] [blame]	688	.. versionchanged:: 2.7
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	689	Added the optional flags argument.
				690
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	691
				692	.. function:: escape(string)
				693
				694	Return string with all non-alphanumerics backslashed; this is useful if you
				695	want to match an arbitrary literal string that may have regular expression
				696	metacharacters in it.
				697
				698
R. David Murray	a63f9b6	2010-07-10 14:25:18 +0000	[diff] [blame]	699	.. function:: purge()
				700
				701	Clear the regular expression cache.
				702
				703
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	704	.. exception:: error
				705
				706	Exception raised when a string passed to one of the functions here is not a
				707	valid regular expression (for example, it might contain unmatched parentheses)
				708	or when some other error occurs during compilation or matching. It is never an
				709	error if a string contains no match for a pattern.
				710
				711
				712	.. _re-objects:
				713
				714	Regular Expression Objects
				715	--------------------------
				716
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	717	.. class:: RegexObject
				718
				719	The :class:`RegexObject` class supports the following methods and attributes:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	720
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	721	.. method:: RegexObject.search(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	722
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	723	Scan through string looking for a location where this regular expression
				724	produces a match, and return a corresponding :class:`MatchObject` instance.
				725	Return ``None`` if no position in the string matches the pattern; note that this
				726	is different from finding a zero-length match at some point in the string.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	727
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	728	The optional second parameter pos gives an index in the string where the
				729	search is to start; it defaults to ``0``. This is not completely equivalent to
				730	slicing the string; the ``'^'`` pattern character matches at the real beginning
				731	of the string and at positions just after a newline, but not necessarily at the
				732	index where the search is to start.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	733
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	734	The optional parameter endpos limits how far the string will be searched; it
				735	will be as if the string is endpos characters long, so only the characters
				736	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
				737	than pos, no match will be found, otherwise, if rx is a compiled regular
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	738	expression object, ``rx.search(string, 0, 50)`` is equivalent to
				739	``rx.search(string[:50], 0)``.
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	740
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	741	>>> pattern = re.compile("d")
				742	>>> pattern.search("dog") # Match at index 0
				743	<_sre.SRE_Match object at ...>
				744	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	745
				746
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	747	.. method:: RegexObject.match(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	748
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	749	If zero or more characters at the beginning of string match this regular
				750	expression, return a corresponding :class:`MatchObject` instance. Return
				751	``None`` if the string does not match the pattern; note that this is different
				752	from a zero-length match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	753
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	754	The optional pos and endpos parameters have the same meaning as for the
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	755	:meth:`~RegexObject.search` method.
				756
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	757	>>> pattern = re.compile("o")
				758	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
				759	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				760	<_sre.SRE_Match object at ...>
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	761
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	762	If you want to locate a match anywhere in string, use
				763	:meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`).
				764
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	765
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	766	.. method:: RegexObject.split(string, maxsplit=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	767
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	768	Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	769
				770
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	771	.. method:: RegexObject.findall(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	772
Georg Brandl	f93ce0c	2010-05-22 08:17:23 +0000	[diff] [blame]	773	Similar to the :func:`findall` function, using the compiled pattern, but
				774	also accepts optional pos and endpos parameters that limit the search
				775	region like for :meth:`match`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	776
				777
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	778	.. method:: RegexObject.finditer(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	779
Georg Brandl	f93ce0c	2010-05-22 08:17:23 +0000	[diff] [blame]	780	Similar to the :func:`finditer` function, using the compiled pattern, but
				781	also accepts optional pos and endpos parameters that limit the search
				782	region like for :meth:`match`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	783
				784
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	785	.. method:: RegexObject.sub(repl, string, count=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	786
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	787	Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	788
				789
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	790	.. method:: RegexObject.subn(repl, string, count=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	791
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	792	Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	793
				794
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	795	.. attribute:: RegexObject.flags
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	796
Georg Brandl	94a1057	2012-03-17 17:31:32 +0100	[diff] [blame]	797	The regex matching flags. This is a combination of the flags given to
				798	:func:`.compile` and any ``(?...)`` inline flags in the pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	799
				800
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	801	.. attribute:: RegexObject.groups
Georg Brandl	b46f0d7	2008-12-05 07:49:49 +0000	[diff] [blame]	802
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	803	The number of capturing groups in the pattern.
Georg Brandl	b46f0d7	2008-12-05 07:49:49 +0000	[diff] [blame]	804
				805
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	806	.. attribute:: RegexObject.groupindex
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	807
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	808	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				809	numbers. The dictionary is empty if no symbolic groups were used in the
				810	pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	811
				812
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	813	.. attribute:: RegexObject.pattern
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	814
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	815	The pattern string from which the RE object was compiled.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	816
				817
				818	.. _match-objects:
				819
				820	Match Objects
				821	-------------
				822
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	823	.. class:: MatchObject
				824
Ezio Melotti	51c374d	2012-11-04 06:46:28 +0200	[diff] [blame]	825	Match objects always have a boolean value of ``True``.
				826	Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
				827	when there is no match, you can test whether there was a match with a simple
				828	``if`` statement::
				829
				830	match = re.search(pattern, string)
				831	if match:
				832	process(match)
				833
				834	Match objects support the following methods and attributes:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	835
				836
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	837	.. method:: MatchObject.expand(template)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	838
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	839	Return the string obtained by doing backslash substitution on the template
				840	string template, as done by the :meth:`~RegexObject.sub` method. Escapes
				841	such as ``\n`` are converted to the appropriate characters, and numeric
				842	backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``,
				843	``\g<name>``) are replaced by the contents of the corresponding group.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	844
				845
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	846	.. method:: MatchObject.group([group1, ...])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	847
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	848	Returns one or more subgroups of the match. If there is a single argument, the
				849	result is a single string; if there are multiple arguments, the result is a
				850	tuple with one item per argument. Without arguments, group1 defaults to zero
				851	(the whole match is returned). If a groupN argument is zero, the corresponding
				852	return value is the entire matching string; if it is in the inclusive range
				853	[1..99], it is the string matching the corresponding parenthesized group. If a
				854	group number is negative or larger than the number of groups defined in the
				855	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				856	part of the pattern that did not match, the corresponding result is ``None``.
				857	If a group is contained in a part of the pattern that matched multiple times,
				858	the last match is returned.
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	859
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	860	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				861	>>> m.group(0) # The entire match
				862	'Isaac Newton'
				863	>>> m.group(1) # The first parenthesized subgroup.
				864	'Isaac'
				865	>>> m.group(2) # The second parenthesized subgroup.
				866	'Newton'
				867	>>> m.group(1, 2) # Multiple arguments give us a tuple.
				868	('Isaac', 'Newton')
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	869
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	870	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				871	arguments may also be strings identifying groups by their group name. If a
				872	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				873	exception is raised.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	874
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	875	A moderately complicated example:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	876
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	877	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				878	>>> m.group('first_name')
				879	'Malcolm'
				880	>>> m.group('last_name')
				881	'Reynolds'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	882
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	883	Named groups can also be referred to by their index:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	884
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	885	>>> m.group(1)
				886	'Malcolm'
				887	>>> m.group(2)
				888	'Reynolds'
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	889
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	890	If a group matches multiple times, only the last match is accessible:
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	891
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	892	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				893	>>> m.group(1) # Returns only the last match.
				894	'c3'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	895
				896
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	897	.. method:: MatchObject.groups([default])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	898
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	899	Return a tuple containing all the subgroups of the match, from 1 up to however
				900	many groups are in the pattern. The default argument is used for groups that
				901	did not participate in the match; it defaults to ``None``. (Incompatibility
				902	note: in the original Python 1.5 release, if the tuple was one element long, a
				903	string would be returned instead. In later versions (from 1.5.1 on), a
				904	singleton tuple is returned in such cases.)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	905
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	906	For example:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	907
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	908	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				909	>>> m.groups()
				910	('24', '1632')
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	911
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	912	If we make the decimal place and everything after it optional, not all groups
				913	might participate in the match. These groups will default to ``None`` unless
				914	the default argument is given:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	915
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	916	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				917	>>> m.groups() # Second group defaults to None.
				918	('24', None)
				919	>>> m.groups('0') # Now, the second group defaults to '0'.
				920	('24', '0')
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	921
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	922
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	923	.. method:: MatchObject.groupdict([default])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	924
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	925	Return a dictionary containing all the named subgroups of the match, keyed by
				926	the subgroup name. The default argument is used for groups that did not
				927	participate in the match; it defaults to ``None``. For example:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	928
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	929	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				930	>>> m.groupdict()
				931	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	932
				933
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	934	.. method:: MatchObject.start([group])
				935	MatchObject.end([group])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	936
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	937	Return the indices of the start and end of the substring matched by group;
				938	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				939	group exists but did not contribute to the match. For a match object m, and
				940	a group g that did contribute to the match, the substring matched by group g
				941	(equivalent to ``m.group(g)``) is ::
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	942
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	943	m.string[m.start(g):m.end(g)]
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	944
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	945	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				946	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				947	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				948	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	949
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	950	An example that will remove remove_this from email addresses:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	951
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	952	>>> email = "tony@tiremove_thisger.net"
				953	>>> m = re.search("remove_this", email)
				954	>>> email[:m.start()] + email[m.end():]
				955	'tony@tiger.net'
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	956
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	957
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	958	.. method:: MatchObject.span([group])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	959
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	960	For :class:`MatchObject` m, return the 2-tuple ``(m.start(group),
				961	m.end(group))``. Note that if group did not contribute to the match, this is
				962	``(-1, -1)``. group defaults to zero, the entire match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	963
				964
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	965	.. attribute:: MatchObject.pos
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	966
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	967	The value of pos which was passed to the :meth:`~RegexObject.search` or
				968	:meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
				969	index into the string at which the RE engine started looking for a match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	970
				971
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	972	.. attribute:: MatchObject.endpos
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	973
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	974	The value of endpos which was passed to the :meth:`~RegexObject.search` or
				975	:meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
				976	index into the string beyond which the RE engine will not go.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	977
				978
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	979	.. attribute:: MatchObject.lastindex
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	980
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	981	The integer index of the last matched capturing group, or ``None`` if no group
				982	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				983	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				984	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				985	string.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	986
				987
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	988	.. attribute:: MatchObject.lastgroup
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	989
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	990	The name of the last matched capturing group, or ``None`` if the group didn't
				991	have a name, or if no group was matched at all.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	992
				993
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	994	.. attribute:: MatchObject.re
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	995
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	996	The regular expression object whose :meth:`~RegexObject.match` or
				997	:meth:`~RegexObject.search` method produced this :class:`MatchObject`
				998	instance.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	999
				1000
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	1001	.. attribute:: MatchObject.string
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1002
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	1003	The string passed to :meth:`~RegexObject.match` or
				1004	:meth:`~RegexObject.search`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1005
				1006
				1007	Examples
				1008	--------
				1009
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1010
				1011	Checking For a Pair
				1012	^^^^^^^^^^^^^^^^^^^
				1013
				1014	In this example, we'll use the following helper function to display match
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1015	objects a little more gracefully:
				1016
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1017	.. testcode::
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1018
				1019	def displaymatch(match):
				1020	if match is None:
				1021	return None
				1022	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				1023
				1024	Suppose you are writing a poker program where a player's hand is represented as
				1025	a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti	13c82d0	2011-12-17 01:17:17 +0200	[diff] [blame]	1026	for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1027	representing the card with that value.
				1028
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1029	To see if a given string is a valid hand, one could do the following:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1030
Ezio Melotti	13c82d0	2011-12-17 01:17:17 +0200	[diff] [blame]	1031	>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
				1032	>>> displaymatch(valid.match("akt5q")) # Valid.
				1033	"<Match: 'akt5q', groups=()>"
				1034	>>> displaymatch(valid.match("akt5e")) # Invalid.
				1035	>>> displaymatch(valid.match("akt")) # Invalid.
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1036	>>> displaymatch(valid.match("727ak")) # Valid.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1037	"<Match: '727ak', groups=()>"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1038
				1039	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1040	To match this with a regular expression, one could use backreferences as such:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1041
				1042	>>> pair = re.compile(r".(.).\1")
				1043	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1044	"<Match: '717', groups=('7',)>"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1045	>>> displaymatch(pair.match("718ak")) # No pairs.
				1046	>>> displaymatch(pair.match("354aa")) # Pair of aces.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1047	"<Match: '354aa', groups=('a',)>"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1048
Georg Brandl	74f8fc0	2009-07-26 13:36:39 +0000	[diff] [blame]	1049	To find out what card the pair consists of, one could use the
				1050	:meth:`~MatchObject.group` method of :class:`MatchObject` in the following
				1051	manner:
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1052
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1053	.. doctest::
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1054
				1055	>>> pair.match("717ak").group(1)
				1056	'7'
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1057
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1058	# Error because re.match() returns None, which doesn't have a group() method:
				1059	>>> pair.match("718ak").group(1)
				1060	Traceback (most recent call last):
				1061	File "<pyshell#23>", line 1, in <module>
				1062	re.match(r".(.).\1", "718ak").group(1)
				1063	AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1064
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1065	>>> pair.match("354aa").group(1)
				1066	'a'
				1067
				1068
				1069	Simulating scanf()
				1070	^^^^^^^^^^^^^^^^^^
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1071
				1072	.. index:: single: scanf()
				1073
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1074	Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1075	expressions are generally more powerful, though also more verbose, than
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1076	:c:func:`scanf` format strings. The table below offers some more-or-less
				1077	equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1078	expressions.
				1079
				1080	+--------------------------------+---------------------------------------------+
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1081	\| :c:func:`scanf` Token \| Regular Expression \|
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1082	+================================+=============================================+
				1083	\| ``%c`` \| ``.`` \|
				1084	+--------------------------------+---------------------------------------------+
				1085	\| ``%5c`` \| ``.{5}`` \|
				1086	+--------------------------------+---------------------------------------------+
				1087	\| ``%d`` \| ``[-+]?\d+`` \|
				1088	+--------------------------------+---------------------------------------------+
				1089	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1090	+--------------------------------+---------------------------------------------+
				1091	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1092	+--------------------------------+---------------------------------------------+
Ezio Melotti	8950019	2012-04-29 11:47:28 +0300	[diff] [blame]	1093	\| ``%o`` \| ``[-+]?[0-7]+`` \|
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1094	+--------------------------------+---------------------------------------------+
				1095	\| ``%s`` \| ``\S+`` \|
				1096	+--------------------------------+---------------------------------------------+
				1097	\| ``%u`` \| ``\d+`` \|
				1098	+--------------------------------+---------------------------------------------+
Ezio Melotti	8950019	2012-04-29 11:47:28 +0300	[diff] [blame]	1099	\| ``%x``, ``%X`` \| ``[-+]?(0[xX])?[\dA-Fa-f]+`` \|
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1100	+--------------------------------+---------------------------------------------+
				1101
				1102	To extract the filename and numbers from a string like ::
				1103
				1104	/usr/sbin/sendmail - 0 errors, 4 warnings
				1105
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1106	you would use a :c:func:`scanf` format like ::
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1107
				1108	%s - %d errors, %d warnings
				1109
				1110	The equivalent regular expression would be ::
				1111
				1112	(\S+) - (\d+) errors, (\d+) warnings
				1113
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1114
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1115	.. _search-vs-match:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1116
				1117	search() vs. match()
				1118	^^^^^^^^^^^^^^^^^^^^
				1119
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1120	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1121
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1122	Python offers two different primitive operations based on regular expressions:
				1123	:func:`re.match` checks for a match only at the beginning of the string, while
				1124	:func:`re.search` checks for a match anywhere in the string (this is what Perl
				1125	does by default).
				1126
				1127	For example::
				1128
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	1129	>>> re.match("c", "abcdef") # No match
				1130	>>> re.search("c", "abcdef") # Match
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1131	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1132
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1133	Regular expressions beginning with ``'^'`` can be used with :func:`search` to
				1134	restrict the match at the beginning of the string::
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1135
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	1136	>>> re.match("c", "abcdef") # No match
				1137	>>> re.search("^c", "abcdef") # No match
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1138	>>> re.search("^a", "abcdef") # Match
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1139	<_sre.SRE_Match object at ...>
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1140
				1141	Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
				1142	beginning of the string, whereas using :func:`search` with a regular expression
				1143	beginning with ``'^'`` will match at the beginning of each line.
				1144
				1145	>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
				1146	>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
				1147	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1148
				1149
				1150	Making a Phonebook
				1151	^^^^^^^^^^^^^^^^^^
				1152
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1153	:func:`split` splits a string into a list delimited by the passed pattern. The
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1154	method is invaluable for converting textual data into data structures that can be
				1155	easily read and modified by Python as demonstrated in the following example that
				1156	creates a phonebook.
				1157
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1158	First, here is the input. Normally it may come from a file, here we are using
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1159	triple-quoted string syntax:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1160
Georg Brandl	5a607b0	2012-03-17 17:26:27 +0100	[diff] [blame]	1161	>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1162	...
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1163	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1164	... Frank Burger: 925.541.7625 662 South Dogwood Way
				1165	...
				1166	...
				1167	... Heather Albrecht: 548.326.4584 919 Park Place"""
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1168
				1169	The entries are separated by one or more newlines. Now we convert the string
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1170	into a list with each nonempty line having its own entry:
				1171
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1172	.. doctest::
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1173	:options: +NORMALIZE_WHITESPACE
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1174
Georg Brandl	5a607b0	2012-03-17 17:26:27 +0100	[diff] [blame]	1175	>>> entries = re.split("\n+", text)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1176	>>> entries
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1177	['Ross McFluff: 834.345.1254 155 Elm Street',
				1178	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
				1179	'Frank Burger: 925.541.7625 662 South Dogwood Way',
				1180	'Heather Albrecht: 548.326.4584 919 Park Place']
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1181
				1182	Finally, split each entry into a list with first name, last name, telephone
Georg Brandl	907a720	2008-02-22 12:31:45 +0000	[diff] [blame]	1183	number, and address. We use the ``maxsplit`` parameter of :func:`split`
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1184	because the address has spaces, our splitting pattern, in it:
				1185
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1186	.. doctest::
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1187	:options: +NORMALIZE_WHITESPACE
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1188
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1189	>>> [re.split(":? ", entry, 3) for entry in entries]
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1190	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1191	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1192	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1193	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1194
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1195	The ``:?`` pattern matches the colon after the last name, so that it does not
Georg Brandl	907a720	2008-02-22 12:31:45 +0000	[diff] [blame]	1196	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1197	house number from the street name:
				1198
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1199	.. doctest::
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1200	:options: +NORMALIZE_WHITESPACE
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1201
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1202	>>> [re.split(":? ", entry, 4) for entry in entries]
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1203	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1204	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1205	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1206	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1207
				1208
				1209	Text Munging
				1210	^^^^^^^^^^^^
				1211
				1212	:func:`sub` replaces every occurrence of a pattern with a string or the
				1213	result of a function. This example demonstrates using :func:`sub` with
				1214	a function to "munge" text, or randomize the order of all the characters
				1215	in each word of a sentence except for the first and last characters::
				1216
				1217	>>> def repl(m):
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	1218	... inner_word = list(m.group(2))
				1219	... random.shuffle(inner_word)
				1220	... return m.group(1) + "".join(inner_word) + m.group(3)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1221	>>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandl	e0289a3	2010-08-01 21:44:38 +0000	[diff] [blame]	1222	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1223	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandl	e0289a3	2010-08-01 21:44:38 +0000	[diff] [blame]	1224	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1225	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1226
				1227
				1228	Finding all Adverbs
				1229	^^^^^^^^^^^^^^^^^^^
				1230
Georg Brandl	907a720	2008-02-22 12:31:45 +0000	[diff] [blame]	1231	:func:`findall` matches all occurrences of a pattern, not just the first
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1232	one as :func:`search` does. For example, if one was a writer and wanted to
				1233	find all of the adverbs in some text, he or she might use :func:`findall` in
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1234	the following manner:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1235
				1236	>>> text = "He was carefully disguised but captured quickly by police."
				1237	>>> re.findall(r"\w+ly", text)
				1238	['carefully', 'quickly']
				1239
				1240
				1241	Finding all Adverbs and their Positions
				1242	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1243
				1244	If one wants more information about all matches of a pattern than the matched
				1245	text, :func:`finditer` is useful as it provides instances of
				1246	:class:`MatchObject` instead of strings. Continuing with the previous example,
				1247	if one was a writer who wanted to find all of the adverbs and their positions
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1248	in some text, he or she would use :func:`finditer` in the following manner:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1249
				1250	>>> text = "He was carefully disguised but captured quickly by police."
				1251	>>> for m in re.finditer(r"\w+ly", text):
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1252	... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1253	07-16: carefully
				1254	40-47: quickly
				1255
				1256
				1257	Raw String Notation
				1258	^^^^^^^^^^^^^^^^^^^
				1259
				1260	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1261	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1262	another one to escape it. For example, the two following lines of code are
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1263	functionally identical:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1264
				1265	>>> re.match(r"\W(.)\1\W", " ff ")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1266	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1267	>>> re.match("\\W(.)\\1\\W", " ff ")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1268	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1269
				1270	When one wants to match a literal backslash, it must be escaped in the regular
				1271	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1272	notation, one must use ``"\\\\"``, making the following lines of code
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1273	functionally identical:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1274
				1275	>>> re.match(r"\\", r"\\")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1276	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1277	>>> re.match("\\\\", r"\\")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1278	<_sre.SRE_Match object at ...>