Blame - Doc/library/re.rst - platform/external/python/cpython2

blob: b353c4c67e3c53723f466ce11db5140eca4c298b [file] [log] [blame]

Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1
				2	:mod:`re` --- Regular expression operations
				3	===========================================
				4
				5	.. module:: re
				6	:synopsis: Regular expression operations.
				7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
				10
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	11	This module provides regular expression matching operations similar to
				12	those found in Perl. Both patterns and strings to be searched can be
Georg Brandl	382edff	2009-03-31 15:43:20 +0000	[diff] [blame]	13	Unicode strings as well as 8-bit strings.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	14
				15	Regular expressions use the backslash character (``'\'``) to indicate
				16	special forms or to allow special characters to be used without invoking
				17	their special meaning. This collides with Python's usage of the same
				18	character for the same purpose in string literals; for example, to match
				19	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				20	string, because the regular expression must be ``\\``, and each
				21	backslash must be expressed as ``\\`` inside a regular Python string
				22	literal.
				23
				24	The solution is to use Python's raw string notation for regular expression
				25	patterns; backslashes are not handled in any special way in a string literal
				26	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				27	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	28	newline. Usually patterns will be expressed in Python code using this raw
				29	string notation.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	30
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	31	It is important to note that most regular expression operations are available as
				32	module-level functions and :class:`RegexObject` methods. The functions are
				33	shortcuts that don't require you to compile a regex object first, but miss some
				34	fine-tuning parameters.
				35
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	36
				37	.. _re-syntax:
				38
				39	Regular Expression Syntax
				40	-------------------------
				41
				42	A regular expression (or RE) specifies a set of strings that matches it; the
				43	functions in this module let you check if a particular string matches a given
				44	regular expression (or if a given regular expression matches a particular
				45	string, which comes down to the same thing).
				46
				47	Regular expressions can be concatenated to form new regular expressions; if A
				48	and B are both regular expressions, then AB is also a regular expression.
				49	In general, if a string p matches A and another string q matches B, the
				50	string pq will match AB. This holds unless A or B contain low precedence
				51	operations; boundary conditions between A and B; or have numbered group
				52	references. Thus, complex expressions can easily be constructed from simpler
				53	primitive expressions like the ones described here. For details of the theory
				54	and implementation of regular expressions, consult the Friedl book referenced
				55	above, or almost any textbook about compiler construction.
				56
				57	A brief explanation of the format of regular expressions follows. For further
Georg Brandl	1cf0522	2008-02-05 12:01:24 +0000	[diff] [blame]	58	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	59
				60	Regular expressions can contain both special and ordinary characters. Most
				61	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				62	expressions; they simply match themselves. You can concatenate ordinary
				63	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				64	section, we'll write RE's in ``this special style``, usually without quotes, and
				65	strings to be matched ``'in single quotes'``.)
				66
				67	Some characters, like ``'\|'`` or ``'('``, are special. Special
				68	characters either stand for classes of ordinary characters, or affect
				69	how the regular expressions around them are interpreted. Regular
				70	expression pattern strings may not contain null bytes, but can specify
				71	the null byte using the ``\number`` notation, e.g., ``'\x00'``.
				72
				73
				74	The special characters are:
				75
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	76	``'.'``
				77	(Dot.) In the default mode, this matches any character except a newline. If
				78	the :const:`DOTALL` flag has been specified, this matches any character
				79	including a newline.
				80
				81	``'^'``
				82	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				83	matches immediately after each newline.
				84
				85	``'$'``
				86	Matches the end of the string or just before the newline at the end of the
				87	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				88	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				89	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Amaury Forgeot d'Arc	d08a8eb	2008-01-10 21:59:42 +0000	[diff] [blame]	90	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				91	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				92	the newline, and one at the end of the string.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	93
				94	``'*'``
				95	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				96	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				97	by any number of 'b's.
				98
				99	``'+'``
				100	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				101	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				102	match just 'a'.
				103
				104	``'?'``
				105	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				106	``ab?`` will match either 'a' or 'ab'.
				107
				108	``*?``, ``+?``, ``??``
				109	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				110	as much text as possible. Sometimes this behaviour isn't desired; if the RE
				111	``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
				112	string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
				113	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
				114	characters as possible will be matched. Using ``.*?`` in the previous
				115	expression will match only ``'<H1>'``.
				116
				117	``{m}``
				118	Specifies that exactly m copies of the previous RE should be matched; fewer
				119	matches cause the entire RE not to match. For example, ``a{6}`` will match
				120	exactly six ``'a'`` characters, but not five.
				121
				122	``{m,n}``
				123	Causes the resulting RE to match from m to n repetitions of the preceding
				124	RE, attempting to match as many repetitions as possible. For example,
				125	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				126	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				127	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				128	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				129	modifier would be confused with the previously described form.
				130
				131	``{m,n}?``
				132	Causes the resulting RE to match from m to n repetitions of the preceding
				133	RE, attempting to match as few repetitions as possible. This is the
				134	non-greedy version of the previous qualifier. For example, on the
				135	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				136	while ``a{3,5}?`` will only match 3 characters.
				137
				138	``'\'``
				139	Either escapes special characters (permitting you to match characters like
				140	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				141	sequences are discussed below.
				142
				143	If you're not using a raw string to express the pattern, remember that Python
				144	also uses the backslash as an escape sequence in string literals; if the escape
				145	sequence isn't recognized by Python's parser, the backslash and subsequent
				146	character are included in the resulting string. However, if Python would
				147	recognize the resulting sequence, the backslash should be repeated twice. This
				148	is complicated and hard to understand, so it's highly recommended that you use
				149	raw strings for all but the simplest expressions.
				150
				151	``[]``
Ezio Melotti	a195873	2011-10-20 19:31:08 +0300	[diff] [blame]	152	Used to indicate a set of characters. In a set:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	153
Ezio Melotti	a195873	2011-10-20 19:31:08 +0300	[diff] [blame]	154	* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
				155	``'m'``, or ``'k'``.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	156
Ezio Melotti	a195873	2011-10-20 19:31:08 +0300	[diff] [blame]	157	* Ranges of characters can be indicated by giving two characters and separating
				158	them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
				159	``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
				160	``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
				161	``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
				162	it will match a literal ``'-'``.
				163
				164	* Special characters lose their special meaning inside sets. For example,
				165	``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
				166	``'*'``, or ``')'``.
				167
				168	* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
				169	inside a set, although the characters they match depends on whether
				170	:const:`LOCALE` or :const:`UNICODE` mode is in force.
				171
				172	* Characters that are not within a range can be matched by :dfn:`complementing`
				173	the set. If the first character of the set is ``'^'``, all the characters
				174	that are not in the set will be matched. For example, ``[^5]`` will match
				175	any character except ``'5'``, and ``[^^]`` will match any character except
				176	``'^'``. ``^`` has no special meaning if it's not the first character in
				177	the set.
				178
				179	* To match a literal ``']'`` inside a set, precede it with a backslash, or
				180	place it at the beginning of the set. For example, both ``[()[\]{}]`` and
				181	``[]()[{}]`` will both match a parenthesis.
Mark Summerfield	700a635	2008-05-31 13:05:34 +0000	[diff] [blame]	182
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	183	``'\|'``
				184	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				185	will match either A or B. An arbitrary number of REs can be separated by the
				186	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				187	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				188	right. When one pattern completely matches, that branch is accepted. This means
				189	that once ``A`` matches, ``B`` will not be tested further, even if it would
				190	produce a longer overall match. In other words, the ``'\|'`` operator is never
				191	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				192	character class, as in ``[\|]``.
				193
				194	``(...)``
				195	Matches whatever regular expression is inside the parentheses, and indicates the
				196	start and end of a group; the contents of a group can be retrieved after a match
				197	has been performed, and can be matched later in the string with the ``\number``
				198	special sequence, described below. To match the literals ``'('`` or ``')'``,
				199	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				200
				201	``(?...)``
				202	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				203	otherwise). The first character after the ``'?'`` determines what the meaning
				204	and further syntax of the construct is. Extensions usually do not create a new
				205	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				206	currently supported extensions.
				207
				208	``(?iLmsux)``
				209	(One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
				210	``'u'``, ``'x'``.) The group matches the empty string; the letters
				211	set the corresponding flags: :const:`re.I` (ignore case),
				212	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				213	:const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
				214	and :const:`re.X` (verbose), for the entire regular expression. (The
				215	flags are described in :ref:`contents-of-module-re`.) This
				216	is useful if you wish to include the flags as part of the regular
				217	expression, instead of passing a flag argument to the
Georg Brandl	74f8fc0	2009-07-26 13:36:39 +0000	[diff] [blame]	218	:func:`re.compile` function.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	219
				220	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				221	used first in the expression string, or after one or more whitespace characters.
				222	If there are non-whitespace characters before the flag, the results are
				223	undefined.
				224
				225	``(?:...)``
Georg Brandl	3b85b9b	2010-11-26 08:20:18 +0000	[diff] [blame]	226	A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	227	expression is inside the parentheses, but the substring matched by the group
				228	cannot be retrieved after performing a match or referenced later in the
				229	pattern.
				230
				231	``(?P<name>...)``
				232	Similar to regular parentheses, but the substring matched by the group is
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	233	accessible via the symbolic group name name. Group names must be valid
				234	Python identifiers, and each group name must be defined only once within a
				235	regular expression. A symbolic group is also a numbered group, just as if
				236	the group were not named.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	237
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	238	Named groups can be referenced in three contexts. If the pattern is
				239	``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
				240	single or double quotes):
				241
				242	+---------------------------------------+----------------------------------+
				243	\| Context of reference to group "quote" \| Ways to reference it \|
				244	+=======================================+==================================+
				245	\| in the same pattern itself \| * ``(?P=quote)`` (as shown) \|
				246	\| \| * ``\1`` \|
				247	+---------------------------------------+----------------------------------+
				248	\| when processing match object ``m`` \| * ``m.group('quote')`` \|
				249	\| \| * ``m.end('quote')`` (etc.) \|
				250	+---------------------------------------+----------------------------------+
				251	\| in a string passed to the ``repl`` \| * ``\g<quote>`` \|
				252	\| argument of ``re.sub()`` \| * ``\g<1>`` \|
				253	\| \| * ``\1`` \|
				254	+---------------------------------------+----------------------------------+
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	255
				256	``(?P=name)``
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	257	A backreference to a named group; it matches whatever text was matched by the
				258	earlier group named name.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	259
				260	``(?#...)``
				261	A comment; the contents of the parentheses are simply ignored.
				262
				263	``(?=...)``
				264	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				265	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				266	``'Isaac '`` only if it's followed by ``'Asimov'``.
				267
				268	``(?!...)``
				269	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				270	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				271	followed by ``'Asimov'``.
				272
				273	``(?<=...)``
				274	Matches if the current position in the string is preceded by a match for ``...``
				275	that ends at the current position. This is called a :dfn:`positive lookbehind
				276	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				277	lookbehind will back up 3 characters and check if the contained pattern matches.
				278	The contained pattern must only match strings of some fixed length, meaning that
Serhiy Storchaka	4809d1f	2015-02-21 12:08:36 +0200	[diff] [blame]	279	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Group
				280	references are not supported even if they match strings of some fixed length.
				281	Note that
Ezio Melotti	1142773	2012-04-29 07:34:22 +0300	[diff] [blame]	282	patterns which start with positive lookbehind assertions will not match at the
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	283	beginning of the string being searched; you will most likely want to use the
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	284	:func:`search` function rather than the :func:`match` function:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	285
				286	>>> import re
				287	>>> m = re.search('(?<=abc)def', 'abcdef')
				288	>>> m.group(0)
				289	'def'
				290
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	291	This example looks for a word following a hyphen:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	292
				293	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				294	>>> m.group(0)
				295	'egg'
				296
				297	``(?<!...)``
				298	Matches if the current position in the string is not preceded by a match for
				299	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				300	positive lookbehind assertions, the contained pattern must only match strings of
Serhiy Storchaka	4809d1f	2015-02-21 12:08:36 +0200	[diff] [blame]	301	some fixed length and shouldn't contain group references.
				302	Patterns which start with negative lookbehind assertions may
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	303	match at the beginning of the string being searched.
				304
				305	``(?(id/name)yes-pattern\|no-pattern)``
				306	Will try to match with ``yes-pattern`` if the group with given id or name
				307	exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
				308	can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
				309	matching pattern, which will match with ``'<user@host.com>'`` as well as
				310	``'user@host.com'``, but not with ``'<user@host.com'``.
				311
				312	.. versionadded:: 2.4
				313
				314	The special sequences consist of ``'\'`` and a character from the list below.
				315	If the ordinary character is not on the list, then the resulting RE will match
				316	the second character. For example, ``\$`` matches the character ``'$'``.
				317
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	318	``\number``
				319	Matches the contents of the group of the same number. Groups are numbered
				320	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl	980db0a	2013-10-06 12:58:20 +0200	[diff] [blame]	321	but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	322	can only be used to match one of the first 99 groups. If the first digit of
				323	number is 0, or number is 3 octal digits long, it will not be interpreted as
				324	a group match, but as the character with octal value number. Inside the
				325	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				326	characters.
				327
				328	``\A``
				329	Matches only at the start of the string.
				330
				331	``\b``
				332	Matches the empty string, but only at the beginning or end of a word. A word is
				333	defined as a sequence of alphanumeric or underscore characters, so the end of a
				334	word is indicated by whitespace or a non-alphanumeric, non-underscore character.
Ezio Melotti	38ae5b2	2012-02-29 11:40:00 +0200	[diff] [blame]	335	Note that formally, ``\b`` is defined as the boundary between a ``\w`` and
				336	a ``\W`` character (or vice versa), or between ``\w`` and the beginning/end
				337	of the string, so the precise set of characters deemed to be alphanumeric
				338	depends on the values of the ``UNICODE`` and ``LOCALE`` flags.
				339	For example, ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
				340	``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	341	Inside a character range, ``\b`` represents the backspace character, for
				342	compatibility with Python's string literals.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	343
				344	``\B``
				345	Matches the empty string, but only when it is not at the beginning or end of a
Ezio Melotti	38ae5b2	2012-02-29 11:40:00 +0200	[diff] [blame]	346	word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, ``'py2'``,
				347	but not ``'py'``, ``'py.'``, or ``'py!'``.
				348	``\B`` is just the opposite of ``\b``, so is also subject to the settings
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	349	of ``LOCALE`` and ``UNICODE``.
				350
				351	``\d``
				352	When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
				353	is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
Mark Dickinson	fe67bd9	2009-07-28 20:35:03 +0000	[diff] [blame]	354	whatever is classified as a decimal digit in the Unicode character properties
				355	database.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	356
				357	``\D``
				358	When the :const:`UNICODE` flag is not specified, matches any non-digit
				359	character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
				360	will match anything other than character marked as digits in the Unicode
				361	character properties database.
				362
				363	``\s``
Senthil Kumaran	dc0b324	2012-04-11 03:22:58 +0800	[diff] [blame]	364	When the :const:`UNICODE` flag is not specified, it matches any whitespace
				365	character, this is equivalent to the set ``[ \t\n\r\f\v]``. The
				366	:const:`LOCALE` flag has no extra effect on matching of the space.
				367	If :const:`UNICODE` is set, this will match the characters ``[ \t\n\r\f\v]``
				368	plus whatever is classified as space in the Unicode character properties
				369	database.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	370
				371	``\S``
Benjamin Peterson	72275ef	2014-11-25 14:54:45 -0600	[diff] [blame]	372	When the :const:`UNICODE` flag is not specified, matches any non-whitespace
Senthil Kumaran	dc0b324	2012-04-11 03:22:58 +0800	[diff] [blame]	373	character; this is equivalent to the set ``[^ \t\n\r\f\v]`` The
				374	:const:`LOCALE` flag has no extra effect on non-whitespace match. If
				375	:const:`UNICODE` is set, then any character not marked as space in the
				376	Unicode character properties database is matched.
				377
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	378
				379	``\w``
				380	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				381	any alphanumeric character and the underscore; this is equivalent to the set
				382	``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
				383	whatever characters are defined as alphanumeric for the current locale. If
				384	:const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
				385	is classified as alphanumeric in the Unicode character properties database.
				386
				387	``\W``
				388	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				389	any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
				390	With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
				391	not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
Zachary Ware	7ca2a90	2014-10-19 01:06:58 -0500	[diff] [blame]	392	this will match anything other than ``[0-9_]`` plus characters classified as
Senthil Kumaran	15b6f3f	2012-03-11 20:37:39 -0700	[diff] [blame]	393	not alphanumeric in the Unicode character properties database.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	394
				395	``\Z``
				396	Matches only at the end of the string.
				397
Senthil Kumaran	15b6f3f	2012-03-11 20:37:39 -0700	[diff] [blame]	398	If both :const:`LOCALE` and :const:`UNICODE` flags are included for a
				399	particular sequence, then :const:`LOCALE` flag takes effect first followed by
				400	the :const:`UNICODE`.
				401
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	402	Most of the standard escapes supported by Python string literals are also
				403	accepted by the regular expression parser::
				404
				405	\a \b \f \n
				406	\r \t \v \x
				407	\\
				408
Ezio Melotti	48d886b	2012-04-29 04:46:34 +0300	[diff] [blame]	409	(Note that ``\b`` is used to represent word boundaries, and means "backspace"
				410	only inside character classes.)
				411
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	412	Octal escapes are included in a limited form: If the first digit is a 0, or if
				413	there are three octal digits, it is considered an octal escape. Otherwise, it is
				414	a group reference. As for string literals, octal escapes are always at most
				415	three digits in length.
				416
Georg Brandl	ae4ca79	2014-10-28 21:41:51 +0100	[diff] [blame]	417	.. seealso::
				418
				419	Mastering Regular Expressions
				420	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
				421	second edition of the book no longer covers Python at all, but the first
				422	edition covered writing good regular expression patterns in great detail.
				423
				424
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	425
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	426	.. _contents-of-module-re:
				427
				428	Module Contents
				429	---------------
				430
				431	The module defines several functions, constants, and an exception. Some of the
				432	functions are simplified versions of the full featured methods for compiled
				433	regular expressions. Most non-trivial applications always use the compiled
				434	form.
				435
				436
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	437	.. function:: compile(pattern, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	438
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	439	Compile a regular expression pattern into a regular expression object, which
Ezio Melotti	33b810d	2014-06-20 00:47:11 +0300	[diff] [blame]	440	can be used for matching using its :func:`~RegexObject.match` and
				441	:func:`~RegexObject.search` methods, described below.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	442
				443	The expression's behaviour can be modified by specifying a flags value.
				444	Values can be any of the following variables, combined using bitwise OR (the
				445	``\|`` operator).
				446
				447	The sequence ::
				448
Gregory P. Smith	0261e5d	2009-03-02 04:53:24 +0000	[diff] [blame]	449	prog = re.compile(pattern)
				450	result = prog.match(string)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	451
				452	is equivalent to ::
				453
Gregory P. Smith	0261e5d	2009-03-02 04:53:24 +0000	[diff] [blame]	454	result = re.match(pattern, string)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	455
Georg Brandl	74f8fc0	2009-07-26 13:36:39 +0000	[diff] [blame]	456	but using :func:`re.compile` and saving the resulting regular expression
				457	object for reuse is more efficient when the expression will be used several
				458	times in a single program.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	459
Gregory P. Smith	0261e5d	2009-03-02 04:53:24 +0000	[diff] [blame]	460	.. note::
				461
				462	The compiled versions of the most recent patterns passed to
				463	:func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
				464	programs that use only a few regular expressions at a time needn't worry
				465	about compiling regular expressions.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	466
				467
Sandro Tosi	e827c13	2012-01-01 12:52:24 +0100	[diff] [blame]	468	.. data:: DEBUG
				469
				470	Display debug information about compiled expression.
				471
				472
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	473	.. data:: I
				474	IGNORECASE
				475
				476	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
				477	lowercase letters, too. This is not affected by the current locale.
				478
				479
				480	.. data:: L
				481	LOCALE
				482
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	483	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
				484	current locale.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	485
				486
				487	.. data:: M
				488	MULTILINE
				489
				490	When specified, the pattern character ``'^'`` matches at the beginning of the
				491	string and at the beginning of each line (immediately following each newline);
				492	and the pattern character ``'$'`` matches at the end of the string and at the
				493	end of each line (immediately preceding each newline). By default, ``'^'``
				494	matches only at the beginning of the string, and ``'$'`` only at the end of the
				495	string and immediately before the newline (if any) at the end of the string.
				496
				497
				498	.. data:: S
				499	DOTALL
				500
				501	Make the ``'.'`` special character match any character at all, including a
				502	newline; without this flag, ``'.'`` will match anything except a newline.
				503
				504
				505	.. data:: U
				506	UNICODE
				507
				508	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
				509	on the Unicode character properties database.
				510
				511	.. versionadded:: 2.0
				512
				513
				514	.. data:: X
				515	VERBOSE
				516
				517	This flag allows you to write regular expressions that look nicer. Whitespace
				518	within the pattern is ignored, except when in a character class or preceded by
				519	an unescaped backslash, and, when a line contains a ``'#'`` neither in a
				520	character class or preceded by an unescaped backslash, all characters from the
				521	leftmost such ``'#'`` through the end of the line are ignored.
				522
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	523	That means that the two following regular expression objects that match a
				524	decimal number are functionally equal::
				525
				526	a = re.compile(r"""\d + # the integral part
				527	\. # the decimal point
				528	\d * # some fractional digits""", re.X)
				529	b = re.compile(r"\d+\.\d*")
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	530
				531
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	532	.. function:: search(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	533
Terry Jan Reedy	9f7f62f	2014-05-30 16:19:50 -0400	[diff] [blame]	534	Scan through string looking for the first location where the regular expression
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	535	pattern produces a match, and return a corresponding :class:`MatchObject`
				536	instance. Return ``None`` if no position in the string matches the pattern; note
				537	that this is different from finding a zero-length match at some point in the
				538	string.
				539
				540
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	541	.. function:: match(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	542
				543	If zero or more characters at the beginning of string match the regular
				544	expression pattern, return a corresponding :class:`MatchObject` instance.
				545	Return ``None`` if the string does not match the pattern; note that this is
				546	different from a zero-length match.
				547
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	548	Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
				549	at the beginning of the string and not at the beginning of each line.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	550
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	551	If you want to locate a match anywhere in string, use :func:`search`
				552	instead (see also :ref:`search-vs-match`).
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	553
				554
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	555	.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	556
				557	Split string by the occurrences of pattern. If capturing parentheses are
				558	used in pattern, then the text of all groups in the pattern are also returned
				559	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				560	splits occur, and the remainder of the string is returned as the final element
				561	of the list. (Incompatibility note: in the original Python 1.5 release,
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	562	maxsplit was ignored. This has been fixed in later releases.)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	563
				564	>>> re.split('\W+', 'Words, words, words.')
				565	['Words', 'words', 'words', '']
				566	>>> re.split('(\W+)', 'Words, words, words.')
				567	['Words', ', ', 'words', ', ', 'words', '.', '']
				568	>>> re.split('\W+', 'Words, words, words.', 1)
				569	['Words', 'words, words.']
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	570	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
				571	['0', '3', '9']
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	572
Georg Brandl	70992c3	2008-03-06 07:19:15 +0000	[diff] [blame]	573	If there are capturing groups in the separator and it matches at the start of
				574	the string, the result will start with an empty string. The same holds for
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	575	the end of the string:
Georg Brandl	70992c3	2008-03-06 07:19:15 +0000	[diff] [blame]	576
				577	>>> re.split('(\W+)', '...words, words...')
				578	['', '...', 'words', ', ', 'words', '...', '']
				579
				580	That way, separator components are always found at the same relative
				581	indices within the result list (e.g., if there's one capturing group
				582	in the separator, the 0th, the 2nd and so forth).
				583
Skip Montanaro	222907d	2007-09-01 17:40:03 +0000	[diff] [blame]	584	Note that split will never split a string on an empty pattern match.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	585	For example:
Skip Montanaro	222907d	2007-09-01 17:40:03 +0000	[diff] [blame]	586
				587	>>> re.split('x*', 'foo')
				588	['foo']
				589	>>> re.split("(?m)^$", "foo\n\nbar\n")
				590	['foo\n\nbar\n']
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	591
Ezio Melotti	1e5d318	2010-11-26 09:30:44 +0000	[diff] [blame]	592	.. versionchanged:: 2.7
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	593	Added the optional flags argument.
				594
Georg Brandl	70992c3	2008-03-06 07:19:15 +0000	[diff] [blame]	595
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	596	.. function:: findall(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	597
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	598	Return all non-overlapping matches of pattern in string, as a list of
Georg Brandl	b46d6ff	2008-07-19 13:48:44 +0000	[diff] [blame]	599	strings. The string is scanned left-to-right, and matches are returned in
				600	the order found. If one or more groups are present in the pattern, return a
				601	list of groups; this will be a list of tuples if the pattern has more than
				602	one group. Empty matches are included in the result unless they touch the
				603	beginning of another match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	604
				605	.. versionadded:: 1.5.2
				606
				607	.. versionchanged:: 2.4
				608	Added the optional flags argument.
				609
				610
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	611	.. function:: finditer(pattern, string, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	612
Georg Brandl	e7a0990	2007-10-21 12:10:28 +0000	[diff] [blame]	613	Return an :term:`iterator` yielding :class:`MatchObject` instances over all
Georg Brandl	b46d6ff	2008-07-19 13:48:44 +0000	[diff] [blame]	614	non-overlapping matches for the RE pattern in string. The string is
				615	scanned left-to-right, and matches are returned in the order found. Empty
				616	matches are included in the result unless they touch the beginning of another
				617	match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	618
				619	.. versionadded:: 2.2
				620
				621	.. versionchanged:: 2.4
				622	Added the optional flags argument.
				623
				624
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	625	.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	626
				627	Return the string obtained by replacing the leftmost non-overlapping occurrences
				628	of pattern in string by the replacement repl. If the pattern isn't found,
				629	string is returned unchanged. repl can be a string or a function; if it is
				630	a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi	a7eb3c8	2011-08-19 22:54:33 +0200	[diff] [blame]	631	converted to a single newline character, ``\r`` is converted to a carriage return, and
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	632	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				633	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	634	For example:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	635
				636	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				637	... r'static PyObject*\npy_\1(void)\n{',
				638	... 'def myfunc():')
				639	'static PyObject*\npy_myfunc(void)\n{'
				640
				641	If repl is a function, it is called for every non-overlapping occurrence of
				642	pattern. The function takes a single match object argument, and returns the
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	643	replacement string. For example:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	644
				645	>>> def dashrepl(matchobj):
				646	... if matchobj.group(0) == '-': return ' '
				647	... else: return '-'
				648	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				649	'pro--gram files'
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	650	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
				651	'Baked Beans & Spam'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	652
Georg Brandl	04fd324	2009-08-13 07:48:05 +0000	[diff] [blame]	653	The pattern may be a string or an RE object.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	654
				655	The optional argument count is the maximum number of pattern occurrences to be
				656	replaced; count must be a non-negative integer. If omitted or zero, all
				657	occurrences will be replaced. Empty matches for the pattern are replaced only
				658	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				659	``'-a-b-c-'``.
				660
Georg Brandl	ddbdc9a	2013-10-06 12:08:14 +0200	[diff] [blame]	661	In string-type repl arguments, in addition to the character escapes and
				662	backreferences described above,
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	663	``\g<name>`` will use the substring matched by the group named ``name``, as
				664	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				665	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				666	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				667	reference to group 20, not a reference to group 2 followed by the literal
				668	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				669	substring matched by the RE.
				670
Ezio Melotti	1e5d318	2010-11-26 09:30:44 +0000	[diff] [blame]	671	.. versionchanged:: 2.7
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	672	Added the optional flags argument.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	673
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	674
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	675	.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	676
				677	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				678	number_of_subs_made)``.
				679
Ezio Melotti	1e5d318	2010-11-26 09:30:44 +0000	[diff] [blame]	680	.. versionchanged:: 2.7
Gregory P. Smith	ae91d09	2009-03-02 05:13:57 +0000	[diff] [blame]	681	Added the optional flags argument.
				682
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	683
				684	.. function:: escape(string)
				685
				686	Return string with all non-alphanumerics backslashed; this is useful if you
				687	want to match an arbitrary literal string that may have regular expression
				688	metacharacters in it.
				689
				690
R. David Murray	a63f9b6	2010-07-10 14:25:18 +0000	[diff] [blame]	691	.. function:: purge()
				692
				693	Clear the regular expression cache.
				694
				695
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	696	.. exception:: error
				697
				698	Exception raised when a string passed to one of the functions here is not a
				699	valid regular expression (for example, it might contain unmatched parentheses)
				700	or when some other error occurs during compilation or matching. It is never an
				701	error if a string contains no match for a pattern.
				702
				703
				704	.. _re-objects:
				705
				706	Regular Expression Objects
				707	--------------------------
				708
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	709	.. class:: RegexObject
				710
				711	The :class:`RegexObject` class supports the following methods and attributes:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	712
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	713	.. method:: RegexObject.search(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	714
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	715	Scan through string looking for a location where this regular expression
				716	produces a match, and return a corresponding :class:`MatchObject` instance.
				717	Return ``None`` if no position in the string matches the pattern; note that this
				718	is different from finding a zero-length match at some point in the string.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	719
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	720	The optional second parameter pos gives an index in the string where the
				721	search is to start; it defaults to ``0``. This is not completely equivalent to
				722	slicing the string; the ``'^'`` pattern character matches at the real beginning
				723	of the string and at positions just after a newline, but not necessarily at the
				724	index where the search is to start.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	725
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	726	The optional parameter endpos limits how far the string will be searched; it
				727	will be as if the string is endpos characters long, so only the characters
				728	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
				729	than pos, no match will be found, otherwise, if rx is a compiled regular
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	730	expression object, ``rx.search(string, 0, 50)`` is equivalent to
				731	``rx.search(string[:50], 0)``.
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	732
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	733	>>> pattern = re.compile("d")
				734	>>> pattern.search("dog") # Match at index 0
				735	<_sre.SRE_Match object at ...>
				736	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	737
				738
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	739	.. method:: RegexObject.match(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	740
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	741	If zero or more characters at the beginning of string match this regular
				742	expression, return a corresponding :class:`MatchObject` instance. Return
				743	``None`` if the string does not match the pattern; note that this is different
				744	from a zero-length match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	745
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	746	The optional pos and endpos parameters have the same meaning as for the
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	747	:meth:`~RegexObject.search` method.
				748
Georg Brandl	b1a1405	2010-06-01 07:25:23 +0000	[diff] [blame]	749	>>> pattern = re.compile("o")
				750	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
				751	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				752	<_sre.SRE_Match object at ...>
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	753
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	754	If you want to locate a match anywhere in string, use
				755	:meth:`~RegexObject.search` instead (see also :ref:`search-vs-match`).
				756
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	757
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	758	.. method:: RegexObject.split(string, maxsplit=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	759
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	760	Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	761
				762
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	763	.. method:: RegexObject.findall(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	764
Georg Brandl	f93ce0c	2010-05-22 08:17:23 +0000	[diff] [blame]	765	Similar to the :func:`findall` function, using the compiled pattern, but
				766	also accepts optional pos and endpos parameters that limit the search
				767	region like for :meth:`match`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	768
				769
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	770	.. method:: RegexObject.finditer(string[, pos[, endpos]])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	771
Georg Brandl	f93ce0c	2010-05-22 08:17:23 +0000	[diff] [blame]	772	Similar to the :func:`finditer` function, using the compiled pattern, but
				773	also accepts optional pos and endpos parameters that limit the search
				774	region like for :meth:`match`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	775
				776
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	777	.. method:: RegexObject.sub(repl, string, count=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	778
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	779	Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	780
				781
Eli Bendersky	eb71138	2011-11-14 01:02:20 +0200	[diff] [blame]	782	.. method:: RegexObject.subn(repl, string, count=0)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	783
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	784	Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	785
				786
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	787	.. attribute:: RegexObject.flags
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	788
Georg Brandl	94a1057	2012-03-17 17:31:32 +0100	[diff] [blame]	789	The regex matching flags. This is a combination of the flags given to
				790	:func:`.compile` and any ``(?...)`` inline flags in the pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	791
				792
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	793	.. attribute:: RegexObject.groups
Georg Brandl	b46f0d7	2008-12-05 07:49:49 +0000	[diff] [blame]	794
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	795	The number of capturing groups in the pattern.
Georg Brandl	b46f0d7	2008-12-05 07:49:49 +0000	[diff] [blame]	796
				797
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	798	.. attribute:: RegexObject.groupindex
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	799
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	800	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				801	numbers. The dictionary is empty if no symbolic groups were used in the
				802	pattern.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	803
				804
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	805	.. attribute:: RegexObject.pattern
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	806
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	807	The pattern string from which the RE object was compiled.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	808
				809
				810	.. _match-objects:
				811
				812	Match Objects
				813	-------------
				814
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	815	.. class:: MatchObject
				816
Ezio Melotti	51c374d	2012-11-04 06:46:28 +0200	[diff] [blame]	817	Match objects always have a boolean value of ``True``.
				818	Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
				819	when there is no match, you can test whether there was a match with a simple
				820	``if`` statement::
				821
				822	match = re.search(pattern, string)
				823	if match:
				824	process(match)
				825
				826	Match objects support the following methods and attributes:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	827
				828
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	829	.. method:: MatchObject.expand(template)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	830
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	831	Return the string obtained by doing backslash substitution on the template
				832	string template, as done by the :meth:`~RegexObject.sub` method. Escapes
				833	such as ``\n`` are converted to the appropriate characters, and numeric
				834	backreferences (``\1``, ``\2``) and named backreferences (``\g<1>``,
				835	``\g<name>``) are replaced by the contents of the corresponding group.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	836
				837
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	838	.. method:: MatchObject.group([group1, ...])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	839
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	840	Returns one or more subgroups of the match. If there is a single argument, the
				841	result is a single string; if there are multiple arguments, the result is a
				842	tuple with one item per argument. Without arguments, group1 defaults to zero
				843	(the whole match is returned). If a groupN argument is zero, the corresponding
				844	return value is the entire matching string; if it is in the inclusive range
				845	[1..99], it is the string matching the corresponding parenthesized group. If a
				846	group number is negative or larger than the number of groups defined in the
				847	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				848	part of the pattern that did not match, the corresponding result is ``None``.
				849	If a group is contained in a part of the pattern that matched multiple times,
				850	the last match is returned.
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	851
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	852	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				853	>>> m.group(0) # The entire match
				854	'Isaac Newton'
				855	>>> m.group(1) # The first parenthesized subgroup.
				856	'Isaac'
				857	>>> m.group(2) # The second parenthesized subgroup.
				858	'Newton'
				859	>>> m.group(1, 2) # Multiple arguments give us a tuple.
				860	('Isaac', 'Newton')
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	861
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	862	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				863	arguments may also be strings identifying groups by their group name. If a
				864	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				865	exception is raised.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	866
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	867	A moderately complicated example:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	868
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	869	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				870	>>> m.group('first_name')
				871	'Malcolm'
				872	>>> m.group('last_name')
				873	'Reynolds'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	874
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	875	Named groups can also be referred to by their index:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	876
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	877	>>> m.group(1)
				878	'Malcolm'
				879	>>> m.group(2)
				880	'Reynolds'
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	881
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	882	If a group matches multiple times, only the last match is accessible:
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	883
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	884	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				885	>>> m.group(1) # Returns only the last match.
				886	'c3'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	887
				888
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	889	.. method:: MatchObject.groups([default])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	890
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	891	Return a tuple containing all the subgroups of the match, from 1 up to however
				892	many groups are in the pattern. The default argument is used for groups that
				893	did not participate in the match; it defaults to ``None``. (Incompatibility
				894	note: in the original Python 1.5 release, if the tuple was one element long, a
				895	string would be returned instead. In later versions (from 1.5.1 on), a
				896	singleton tuple is returned in such cases.)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	897
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	898	For example:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	899
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	900	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				901	>>> m.groups()
				902	('24', '1632')
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	903
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	904	If we make the decimal place and everything after it optional, not all groups
				905	might participate in the match. These groups will default to ``None`` unless
				906	the default argument is given:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	907
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	908	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				909	>>> m.groups() # Second group defaults to None.
				910	('24', None)
				911	>>> m.groups('0') # Now, the second group defaults to '0'.
				912	('24', '0')
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	913
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	914
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	915	.. method:: MatchObject.groupdict([default])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	916
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	917	Return a dictionary containing all the named subgroups of the match, keyed by
				918	the subgroup name. The default argument is used for groups that did not
				919	participate in the match; it defaults to ``None``. For example:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	920
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	921	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				922	>>> m.groupdict()
				923	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	924
				925
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	926	.. method:: MatchObject.start([group])
				927	MatchObject.end([group])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	928
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	929	Return the indices of the start and end of the substring matched by group;
				930	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				931	group exists but did not contribute to the match. For a match object m, and
				932	a group g that did contribute to the match, the substring matched by group g
				933	(equivalent to ``m.group(g)``) is ::
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	934
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	935	m.string[m.start(g):m.end(g)]
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	936
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	937	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				938	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				939	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				940	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	941
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	942	An example that will remove remove_this from email addresses:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	943
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	944	>>> email = "tony@tiremove_thisger.net"
				945	>>> m = re.search("remove_this", email)
				946	>>> email[:m.start()] + email[m.end():]
				947	'tony@tiger.net'
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	948
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	949
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	950	.. method:: MatchObject.span([group])
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	951
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	952	For :class:`MatchObject` m, return the 2-tuple ``(m.start(group),
				953	m.end(group))``. Note that if group did not contribute to the match, this is
				954	``(-1, -1)``. group defaults to zero, the entire match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	955
				956
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	957	.. attribute:: MatchObject.pos
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	958
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	959	The value of pos which was passed to the :meth:`~RegexObject.search` or
				960	:meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
				961	index into the string at which the RE engine started looking for a match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	962
				963
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	964	.. attribute:: MatchObject.endpos
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	965
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	966	The value of endpos which was passed to the :meth:`~RegexObject.search` or
				967	:meth:`~RegexObject.match` method of the :class:`RegexObject`. This is the
				968	index into the string beyond which the RE engine will not go.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	969
				970
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	971	.. attribute:: MatchObject.lastindex
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	972
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	973	The integer index of the last matched capturing group, or ``None`` if no group
				974	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				975	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				976	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				977	string.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	978
				979
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	980	.. attribute:: MatchObject.lastgroup
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	981
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	982	The name of the last matched capturing group, or ``None`` if the group didn't
				983	have a name, or if no group was matched at all.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	984
				985
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	986	.. attribute:: MatchObject.re
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	987
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	988	The regular expression object whose :meth:`~RegexObject.match` or
				989	:meth:`~RegexObject.search` method produced this :class:`MatchObject`
				990	instance.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	991
				992
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	993	.. attribute:: MatchObject.string
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	994
Brian Curtin	fbe5199	2010-03-25 23:48:54 +0000	[diff] [blame]	995	The string passed to :meth:`~RegexObject.match` or
				996	:meth:`~RegexObject.search`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	997
				998
				999	Examples
				1000	--------
				1001
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1002
				1003	Checking For a Pair
				1004	^^^^^^^^^^^^^^^^^^^
				1005
				1006	In this example, we'll use the following helper function to display match
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1007	objects a little more gracefully:
				1008
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1009	.. testcode::
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1010
				1011	def displaymatch(match):
				1012	if match is None:
				1013	return None
				1014	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				1015
				1016	Suppose you are writing a poker program where a player's hand is represented as
				1017	a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti	13c82d0	2011-12-17 01:17:17 +0200	[diff] [blame]	1018	for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1019	representing the card with that value.
				1020
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1021	To see if a given string is a valid hand, one could do the following:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1022
Ezio Melotti	13c82d0	2011-12-17 01:17:17 +0200	[diff] [blame]	1023	>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
				1024	>>> displaymatch(valid.match("akt5q")) # Valid.
				1025	"<Match: 'akt5q', groups=()>"
				1026	>>> displaymatch(valid.match("akt5e")) # Invalid.
				1027	>>> displaymatch(valid.match("akt")) # Invalid.
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1028	>>> displaymatch(valid.match("727ak")) # Valid.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1029	"<Match: '727ak', groups=()>"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1030
				1031	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1032	To match this with a regular expression, one could use backreferences as such:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1033
				1034	>>> pair = re.compile(r".(.).\1")
				1035	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1036	"<Match: '717', groups=('7',)>"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1037	>>> displaymatch(pair.match("718ak")) # No pairs.
				1038	>>> displaymatch(pair.match("354aa")) # Pair of aces.
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1039	"<Match: '354aa', groups=('a',)>"
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1040
Georg Brandl	74f8fc0	2009-07-26 13:36:39 +0000	[diff] [blame]	1041	To find out what card the pair consists of, one could use the
				1042	:meth:`~MatchObject.group` method of :class:`MatchObject` in the following
				1043	manner:
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1044
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1045	.. doctest::
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1046
				1047	>>> pair.match("717ak").group(1)
				1048	'7'
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1049
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1050	# Error because re.match() returns None, which doesn't have a group() method:
				1051	>>> pair.match("718ak").group(1)
				1052	Traceback (most recent call last):
				1053	File "<pyshell#23>", line 1, in <module>
				1054	re.match(r".(.).\1", "718ak").group(1)
				1055	AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1056
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1057	>>> pair.match("354aa").group(1)
				1058	'a'
				1059
				1060
				1061	Simulating scanf()
				1062	^^^^^^^^^^^^^^^^^^
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1063
				1064	.. index:: single: scanf()
				1065
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1066	Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1067	expressions are generally more powerful, though also more verbose, than
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1068	:c:func:`scanf` format strings. The table below offers some more-or-less
				1069	equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1070	expressions.
				1071
				1072	+--------------------------------+---------------------------------------------+
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1073	\| :c:func:`scanf` Token \| Regular Expression \|
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1074	+================================+=============================================+
				1075	\| ``%c`` \| ``.`` \|
				1076	+--------------------------------+---------------------------------------------+
				1077	\| ``%5c`` \| ``.{5}`` \|
				1078	+--------------------------------+---------------------------------------------+
				1079	\| ``%d`` \| ``[-+]?\d+`` \|
				1080	+--------------------------------+---------------------------------------------+
				1081	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1082	+--------------------------------+---------------------------------------------+
				1083	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1084	+--------------------------------+---------------------------------------------+
Ezio Melotti	8950019	2012-04-29 11:47:28 +0300	[diff] [blame]	1085	\| ``%o`` \| ``[-+]?[0-7]+`` \|
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1086	+--------------------------------+---------------------------------------------+
				1087	\| ``%s`` \| ``\S+`` \|
				1088	+--------------------------------+---------------------------------------------+
				1089	\| ``%u`` \| ``\d+`` \|
				1090	+--------------------------------+---------------------------------------------+
Ezio Melotti	8950019	2012-04-29 11:47:28 +0300	[diff] [blame]	1091	\| ``%x``, ``%X`` \| ``[-+]?(0[xX])?[\dA-Fa-f]+`` \|
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1092	+--------------------------------+---------------------------------------------+
				1093
				1094	To extract the filename and numbers from a string like ::
				1095
				1096	/usr/sbin/sendmail - 0 errors, 4 warnings
				1097
Sandro Tosi	98ed08f	2012-01-14 16:42:02 +0100	[diff] [blame]	1098	you would use a :c:func:`scanf` format like ::
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1099
				1100	%s - %d errors, %d warnings
				1101
				1102	The equivalent regular expression would be ::
				1103
				1104	(\S+) - (\d+) errors, (\d+) warnings
				1105
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1106
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1107	.. _search-vs-match:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1108
				1109	search() vs. match()
				1110	^^^^^^^^^^^^^^^^^^^^
				1111
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1112	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1113
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1114	Python offers two different primitive operations based on regular expressions:
				1115	:func:`re.match` checks for a match only at the beginning of the string, while
				1116	:func:`re.search` checks for a match anywhere in the string (this is what Perl
				1117	does by default).
				1118
				1119	For example::
				1120
				1121	>>> re.match("c", "abcdef") # No match
				1122	>>> re.search("c", "abcdef") # Match
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1123	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1124
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1125	Regular expressions beginning with ``'^'`` can be used with :func:`search` to
				1126	restrict the match at the beginning of the string::
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1127
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1128	>>> re.match("c", "abcdef") # No match
				1129	>>> re.search("^c", "abcdef") # No match
				1130	>>> re.search("^a", "abcdef") # Match
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1131	<_sre.SRE_Match object at ...>
Ezio Melotti	d9de93e	2012-02-29 13:37:07 +0200	[diff] [blame]	1132
				1133	Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
				1134	beginning of the string, whereas using :func:`search` with a regular expression
				1135	beginning with ``'^'`` will match at the beginning of each line.
				1136
				1137	>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
				1138	>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
				1139	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1140
				1141
				1142	Making a Phonebook
				1143	^^^^^^^^^^^^^^^^^^
				1144
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1145	:func:`split` splits a string into a list delimited by the passed pattern. The
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1146	method is invaluable for converting textual data into data structures that can be
				1147	easily read and modified by Python as demonstrated in the following example that
				1148	creates a phonebook.
				1149
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1150	First, here is the input. Normally it may come from a file, here we are using
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1151	triple-quoted string syntax:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1152
Georg Brandl	5a607b0	2012-03-17 17:26:27 +0100	[diff] [blame]	1153	>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	1154	...
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1155	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1156	... Frank Burger: 925.541.7625 662 South Dogwood Way
				1157	...
				1158	...
				1159	... Heather Albrecht: 548.326.4584 919 Park Place"""
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1160
				1161	The entries are separated by one or more newlines. Now we convert the string
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1162	into a list with each nonempty line having its own entry:
				1163
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1164	.. doctest::
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1165	:options: +NORMALIZE_WHITESPACE
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1166
Georg Brandl	5a607b0	2012-03-17 17:26:27 +0100	[diff] [blame]	1167	>>> entries = re.split("\n+", text)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1168	>>> entries
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1169	['Ross McFluff: 834.345.1254 155 Elm Street',
				1170	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
				1171	'Frank Burger: 925.541.7625 662 South Dogwood Way',
				1172	'Heather Albrecht: 548.326.4584 919 Park Place']
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1173
				1174	Finally, split each entry into a list with first name, last name, telephone
Georg Brandl	907a720	2008-02-22 12:31:45 +0000	[diff] [blame]	1175	number, and address. We use the ``maxsplit`` parameter of :func:`split`
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1176	because the address has spaces, our splitting pattern, in it:
				1177
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1178	.. doctest::
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1179	:options: +NORMALIZE_WHITESPACE
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1180
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1181	>>> [re.split(":? ", entry, 3) for entry in entries]
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1182	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1183	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1184	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1185	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1186
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1187	The ``:?`` pattern matches the colon after the last name, so that it does not
Georg Brandl	907a720	2008-02-22 12:31:45 +0000	[diff] [blame]	1188	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1189	house number from the street name:
				1190
Georg Brandl	838b4b0	2008-03-22 13:07:06 +0000	[diff] [blame]	1191	.. doctest::
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1192	:options: +NORMALIZE_WHITESPACE
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1193
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame]	1194	>>> [re.split(":? ", entry, 4) for entry in entries]
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1195	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1196	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1197	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1198	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1199
				1200
				1201	Text Munging
				1202	^^^^^^^^^^^^
				1203
				1204	:func:`sub` replaces every occurrence of a pattern with a string or the
				1205	result of a function. This example demonstrates using :func:`sub` with
				1206	a function to "munge" text, or randomize the order of all the characters
				1207	in each word of a sentence except for the first and last characters::
				1208
				1209	>>> def repl(m):
				1210	... inner_word = list(m.group(2))
				1211	... random.shuffle(inner_word)
				1212	... return m.group(1) + "".join(inner_word) + m.group(3)
				1213	>>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandl	e0289a3	2010-08-01 21:44:38 +0000	[diff] [blame]	1214	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1215	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandl	e0289a3	2010-08-01 21:44:38 +0000	[diff] [blame]	1216	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1217	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1218
				1219
				1220	Finding all Adverbs
				1221	^^^^^^^^^^^^^^^^^^^
				1222
Georg Brandl	907a720	2008-02-22 12:31:45 +0000	[diff] [blame]	1223	:func:`findall` matches all occurrences of a pattern, not just the first
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1224	one as :func:`search` does. For example, if one was a writer and wanted to
				1225	find all of the adverbs in some text, he or she might use :func:`findall` in
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1226	the following manner:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1227
				1228	>>> text = "He was carefully disguised but captured quickly by police."
				1229	>>> re.findall(r"\w+ly", text)
				1230	['carefully', 'quickly']
				1231
				1232
				1233	Finding all Adverbs and their Positions
				1234	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1235
				1236	If one wants more information about all matches of a pattern than the matched
				1237	text, :func:`finditer` is useful as it provides instances of
				1238	:class:`MatchObject` instead of strings. Continuing with the previous example,
				1239	if one was a writer who wanted to find all of the adverbs and their positions
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1240	in some text, he or she would use :func:`finditer` in the following manner:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1241
				1242	>>> text = "He was carefully disguised but captured quickly by police."
				1243	>>> for m in re.finditer(r"\w+ly", text):
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1244	... print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1245	07-16: carefully
				1246	40-47: quickly
				1247
				1248
				1249	Raw String Notation
				1250	^^^^^^^^^^^^^^^^^^^
				1251
				1252	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1253	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1254	another one to escape it. For example, the two following lines of code are
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1255	functionally identical:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1256
				1257	>>> re.match(r"\W(.)\1\W", " ff ")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1258	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1259	>>> re.match("\\W(.)\\1\\W", " ff ")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1260	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1261
				1262	When one wants to match a literal backslash, it must be escaped in the regular
				1263	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1264	notation, one must use ``"\\\\"``, making the following lines of code
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1265	functionally identical:
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1266
				1267	>>> re.match(r"\\", r"\\")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1268	<_sre.SRE_Match object at ...>
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1269	>>> re.match("\\\\", r"\\")
Georg Brandl	6199e32	2008-03-22 12:04:26 +0000	[diff] [blame]	1270	<_sre.SRE_Match object at ...>