Blame - Doc/library/re.rst - platform/external/python/cpython3

blob: d35aaf42f4ab92e4ce9e3d52e0d4799081f015bf [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`re` --- Regular expression operations
				2	===========================================
				3
				4	.. module:: re
				5	:synopsis: Regular expression operations.
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	10	Source code: :source:`Lib/re.py`
				11
				12	--------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	13
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14	This module provides regular expression matching operations similar to
Georg Brandl	ed2a1db	2009-06-08 07:48:27 +0000	[diff] [blame]	15	those found in Perl.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	16
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	17	Both patterns and strings to be searched can be Unicode strings (:class:`str`)
				18	as well as 8-bit strings (:class:`bytes`).
				19	However, Unicode strings and 8-bit strings cannot be mixed:
Martin Panter	6245cb3	2016-04-15 02:14:19 +0000	[diff] [blame]	20	that is, you cannot match a Unicode string with a byte pattern or
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	21	vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	22	string must be of the same type as both the pattern and the search string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	23
				24	Regular expressions use the backslash character (``'\'``) to indicate
				25	special forms or to allow special characters to be used without invoking
				26	their special meaning. This collides with Python's usage of the same
				27	character for the same purpose in string literals; for example, to match
				28	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				29	string, because the regular expression must be ``\\``, and each
				30	backslash must be expressed as ``\\`` inside a regular Python string
				31	literal.
				32
				33	The solution is to use Python's raw string notation for regular expression
				34	patterns; backslashes are not handled in any special way in a string literal
				35	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				36	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	37	newline. Usually patterns will be expressed in Python code using this raw
				38	string notation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	39
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	40	It is important to note that most regular expression operations are available as
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	41	module-level functions and methods on
				42	:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
				43	that don't require you to compile a regex object first, but miss some
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	44	fine-tuning parameters.
				45
Marco Buttu	ed6795e	2017-02-26 16:26:23 +0100	[diff] [blame]	46	.. seealso::
				47
				48	The third-party `regex <https://pypi.python.org/pypi/regex/>`_ module,
				49	which has an API compatible with the standard library :mod:`re` module,
				50	but offers additional functionality and a more thorough Unicode support.
				51
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	52
				53	.. _re-syntax:
				54
				55	Regular Expression Syntax
				56	-------------------------
				57
				58	A regular expression (or RE) specifies a set of strings that matches it; the
				59	functions in this module let you check if a particular string matches a given
				60	regular expression (or if a given regular expression matches a particular
				61	string, which comes down to the same thing).
				62
				63	Regular expressions can be concatenated to form new regular expressions; if A
				64	and B are both regular expressions, then AB is also a regular expression.
				65	In general, if a string p matches A and another string q matches B, the
				66	string pq will match AB. This holds unless A or B contain low precedence
				67	operations; boundary conditions between A and B; or have numbered group
				68	references. Thus, complex expressions can easily be constructed from simpler
				69	primitive expressions like the ones described here. For details of the theory
Miss Islington (bot)	67d3f8b	2018-03-23 08:55:26 -0700	[diff] [blame^]	70	and implementation of regular expressions, consult the Friedl book [Frie09]_,
				71	or almost any textbook about compiler construction.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	72
				73	A brief explanation of the format of regular expressions follows. For further
Christian Heimes	2202f87	2008-02-06 14:31:34 +0000	[diff] [blame]	74	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	75
				76	Regular expressions can contain both special and ordinary characters. Most
				77	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				78	expressions; they simply match themselves. You can concatenate ordinary
				79	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				80	section, we'll write RE's in ``this special style``, usually without quotes, and
				81	strings to be matched ``'in single quotes'``.)
				82
				83	Some characters, like ``'\|'`` or ``'('``, are special. Special
				84	characters either stand for classes of ordinary characters, or affect
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	85	how the regular expressions around them are interpreted.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	86
Martin Panter	684340e	2016-10-15 01:18:16 +0000	[diff] [blame]	87	Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
				88	directly nested. This avoids ambiguity with the non-greedy modifier suffix
				89	``?``, and with other modifiers in other implementations. To apply a second
				90	repetition to an inner repetition, parentheses may be used. For example,
				91	the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
				92
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	93
				94	The special characters are:
				95
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	96	``.``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	97	(Dot.) In the default mode, this matches any character except a newline. If
				98	the :const:`DOTALL` flag has been specified, this matches any character
				99	including a newline.
				100
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	101	``^``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	102	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				103	matches immediately after each newline.
				104
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	105	``$``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	106	Matches the end of the string or just before the newline at the end of the
				107	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				108	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				109	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes	25bb783	2008-01-11 16:17:00 +0000	[diff] [blame]	110	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				111	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				112	the newline, and one at the end of the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	113
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	114	``*``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	115	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				116	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				117	by any number of 'b's.
				118
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	119	``+``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	120	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				121	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				122	match just 'a'.
				123
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	124	``?``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	125	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				126	``ab?`` will match either 'a' or 'ab'.
				127
				128	``*?``, ``+?``, ``??``
				129	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				130	as much text as possible. Sometimes this behaviour isn't desired; if the RE
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	131	``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
				132	string, and not just ``'<a>'``. Adding ``?`` after the qualifier makes it
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	133	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
Georg Brandl	7ff033b	2016-04-12 07:51:41 +0200	[diff] [blame]	134	characters as possible will be matched. Using the RE ``<.*?>`` will match
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	135	only ``'<a>'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	136
				137	``{m}``
				138	Specifies that exactly m copies of the previous RE should be matched; fewer
				139	matches cause the entire RE not to match. For example, ``a{6}`` will match
				140	exactly six ``'a'`` characters, but not five.
				141
				142	``{m,n}``
				143	Causes the resulting RE to match from m to n repetitions of the preceding
				144	RE, attempting to match as many repetitions as possible. For example,
				145	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				146	lower bound of zero, and omitting n specifies an infinite upper bound. As an
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	147	example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
				148	followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	149	modifier would be confused with the previously described form.
				150
				151	``{m,n}?``
				152	Causes the resulting RE to match from m to n repetitions of the preceding
				153	RE, attempting to match as few repetitions as possible. This is the
				154	non-greedy version of the previous qualifier. For example, on the
				155	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				156	while ``a{3,5}?`` will only match 3 characters.
				157
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	158	``\``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	159	Either escapes special characters (permitting you to match characters like
				160	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				161	sequences are discussed below.
				162
				163	If you're not using a raw string to express the pattern, remember that Python
				164	also uses the backslash as an escape sequence in string literals; if the escape
				165	sequence isn't recognized by Python's parser, the backslash and subsequent
				166	character are included in the resulting string. However, if Python would
				167	recognize the resulting sequence, the backslash should be repeated twice. This
				168	is complicated and hard to understand, so it's highly recommended that you use
				169	raw strings for all but the simplest expressions.
				170
				171	``[]``
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	172	Used to indicate a set of characters. In a set:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	173
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	174	* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
				175	``'m'``, or ``'k'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	176
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	177	* Ranges of characters can be indicated by giving two characters and separating
				178	them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
				179	``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
				180	``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	181	``[a\-z]``) or if it's placed as the first or last character
				182	(e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	183
				184	* Special characters lose their special meaning inside sets. For example,
				185	``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
				186	``'*'``, or ``')'``.
				187
				188	* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
				189	inside a set, although the characters they match depends on whether
				190	:const:`ASCII` or :const:`LOCALE` mode is in force.
				191
				192	* Characters that are not within a range can be matched by :dfn:`complementing`
				193	the set. If the first character of the set is ``'^'``, all the characters
				194	that are not in the set will be matched. For example, ``[^5]`` will match
				195	any character except ``'5'``, and ``[^^]`` will match any character except
				196	``'^'``. ``^`` has no special meaning if it's not the first character in
				197	the set.
				198
				199	* To match a literal ``']'`` inside a set, precede it with a backslash, or
				200	place it at the beginning of the set. For example, both ``[()[\]{}]`` and
				201	``[]()[{}]`` will both match a parenthesis.
Mark Summerfield	9e670c2	2008-05-31 13:05:34 +0000	[diff] [blame]	202
Serhiy Storchaka	05cb728	2017-11-16 12:38:26 +0200	[diff] [blame]	203	* Support of nested sets and set operations as in `Unicode Technical
				204	Standard #18`_ might be added in the future. This would change the
				205	syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
				206	in ambiguous cases for the time being.
				207	That include sets starting with a literal ``'['`` or containing literal
				208	character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'\|\|'``. To
				209	avoid a warning escape them with a backslash.
				210
				211	.. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
				212
				213	.. versionchanged:: 3.7
				214	:exc:`FutureWarning` is raised if a character set contains constructs
				215	that will change semantically in the future.
				216
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	217	``\|``
				218	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				219	will match either A or B. An arbitrary number of REs can be separated by the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	220	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				221	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				222	right. When one pattern completely matches, that branch is accepted. This means
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	223	that once A matches, B will not be tested further, even if it would
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	224	produce a longer overall match. In other words, the ``'\|'`` operator is never
				225	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				226	character class, as in ``[\|]``.
				227
				228	``(...)``
				229	Matches whatever regular expression is inside the parentheses, and indicates the
				230	start and end of a group; the contents of a group can be retrieved after a match
				231	has been performed, and can be matched later in the string with the ``\number``
				232	special sequence, described below. To match the literals ``'('`` or ``')'``,
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	233	use ``$`` or ``$``, or enclose them inside a character class: ``[(]``, ``[)]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	234
				235	``(?...)``
				236	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				237	otherwise). The first character after the ``'?'`` determines what the meaning
				238	and further syntax of the construct is. Extensions usually do not create a new
				239	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				240	currently supported extensions.
				241
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	242	``(?aiLmsux)``
				243	(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				244	``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling	1c50e86	2009-06-01 00:11:36 +0000	[diff] [blame]	245	letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	246	:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	247	:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	248	:const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
				249	for the entire regular expression.
				250	(The flags are described in :ref:`contents-of-module-re`.)
				251	This is useful if you wish to include the flags as part of the
				252	regular expression, instead of passing a flag argument to the
Serhiy Storchaka	bd48d27	2016-09-11 12:50:02 +0300	[diff] [blame]	253	:func:`re.compile` function. Flags should be used first in the
				254	expression string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	255
				256	``(?:...)``
Georg Brandl	3122ce3	2010-10-29 06:17:38 +0000	[diff] [blame]	257	A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	258	expression is inside the parentheses, but the substring matched by the group
				259	cannot be retrieved after performing a match or referenced later in the
				260	pattern.
				261
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	262	``(?aiLmsux-imsx:...)``
				263	(Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				264	``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
				265	one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
				266	The letters set or remove the corresponding flags:
				267	:const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
				268	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				269	:const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
				270	and :const:`re.X` (verbose), for the part of the expression.
				271	(The flags are described in :ref:`contents-of-module-re`.)
				272
				273	The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
				274	as inline flags, so they can't be combined or follow ``'-'``. Instead,
				275	when one of them appears in an inline group, it overrides the matching mode
				276	in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
				277	ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
				278	(default). In byte pattern ``(?L:...)`` switches to locale depending
				279	matching, and ``(?a:...)`` switches to ASCII-only matching (default).
				280	This override is only in effect for the narrow inline group, and the
				281	original matching mode is restored outside of the group.
Serhiy Storchaka	be9a4e5	2016-09-10 00:57:55 +0300	[diff] [blame]	282
Zachary Ware	c307672	2016-09-09 15:47:05 -0700	[diff] [blame]	283	.. versionadded:: 3.6
Serhiy Storchaka	be9a4e5	2016-09-10 00:57:55 +0300	[diff] [blame]	284
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	285	.. versionchanged:: 3.7
				286	The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
				287
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	288	``(?P<name>...)``
				289	Similar to regular parentheses, but the substring matched by the group is
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	290	accessible via the symbolic group name name. Group names must be valid
				291	Python identifiers, and each group name must be defined only once within a
				292	regular expression. A symbolic group is also a numbered group, just as if
				293	the group were not named.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	294
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	295	Named groups can be referenced in three contexts. If the pattern is
				296	``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
				297	single or double quotes):
				298
				299	+---------------------------------------+----------------------------------+
				300	\| Context of reference to group "quote" \| Ways to reference it \|
				301	+=======================================+==================================+
				302	\| in the same pattern itself \| * ``(?P=quote)`` (as shown) \|
				303	\| \| * ``\1`` \|
				304	+---------------------------------------+----------------------------------+
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	305	\| when processing match object m \| * ``m.group('quote')`` \|
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	306	\| \| * ``m.end('quote')`` (etc.) \|
				307	+---------------------------------------+----------------------------------+
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	308	\| in a string passed to the repl \| * ``\g<quote>`` \|
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	309	\| argument of ``re.sub()`` \| * ``\g<1>`` \|
				310	\| \| * ``\1`` \|
				311	+---------------------------------------+----------------------------------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	312
				313	``(?P=name)``
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	314	A backreference to a named group; it matches whatever text was matched by the
				315	earlier group named name.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	316
				317	``(?#...)``
				318	A comment; the contents of the parentheses are simply ignored.
				319
				320	``(?=...)``
				321	Matches if ``...`` matches next, but doesn't consume any of the string. This is
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	322	called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	323	``'Isaac '`` only if it's followed by ``'Asimov'``.
				324
				325	``(?!...)``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	326	Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	327	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				328	followed by ``'Asimov'``.
				329
				330	``(?<=...)``
				331	Matches if the current position in the string is preceded by a match for ``...``
				332	that ends at the current position. This is called a :dfn:`positive lookbehind
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	333	assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	334	lookbehind will back up 3 characters and check if the contained pattern matches.
				335	The contained pattern must only match strings of some fixed length, meaning that
				336	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
Ezio Melotti	0a6b541	2012-04-29 07:34:46 +0300	[diff] [blame]	337	patterns which start with positive lookbehind assertions will not match at the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	338	beginning of the string being searched; you will most likely want to use the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	339	:func:`search` function rather than the :func:`match` function:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	340
				341	>>> import re
				342	>>> m = re.search('(?<=abc)def', 'abcdef')
				343	>>> m.group(0)
				344	'def'
				345
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	346	This example looks for a word following a hyphen:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	347
Miss Islington (bot)	c7de1d7	2018-02-02 13:50:44 -0800	[diff] [blame]	348	>>> m = re.search(r'(?<=-)\w+', 'spam-egg')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	349	>>> m.group(0)
				350	'egg'
				351
Georg Brandl	8c16cb9	2016-02-25 20:17:45 +0100	[diff] [blame]	352	.. versionchanged:: 3.5
Serhiy Storchaka	4eea62f	2015-02-21 10:07:35 +0200	[diff] [blame]	353	Added support for group references of fixed length.
				354
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	355	``(?<!...)``
				356	Matches if the current position in the string is not preceded by a match for
				357	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				358	positive lookbehind assertions, the contained pattern must only match strings of
				359	some fixed length. Patterns which start with negative lookbehind assertions may
				360	match at the beginning of the string being searched.
				361
				362	``(?(id/name)yes-pattern\|no-pattern)``
orsenthil@gmail.com	476021b	2011-03-12 10:46:25 +0800	[diff] [blame]	363	Will try to match with ``yes-pattern`` if the group with given id or
				364	name exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
				365	optional and can be omitted. For example,
				366	``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>\|$)`` is a poor email matching pattern, which
				367	will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
Serhiy Storchaka	a4d170d	2013-12-23 18:20:51 +0200	[diff] [blame]	368	not with ``'<user@host.com'`` nor ``'user@host.com>'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	369
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	370
				371	The special sequences consist of ``'\'`` and a character from the list below.
Martin Panter	98e9051	2016-06-12 06:17:29 +0000	[diff] [blame]	372	If the ordinary character is not an ASCII digit or an ASCII letter, then the
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	373	resulting RE will match the second character. For example, ``\$`` matches the
				374	character ``'$'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	375
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	376	``\number``
				377	Matches the contents of the group of the same number. Groups are numbered
				378	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl	2070e83	2013-10-06 12:58:20 +0200	[diff] [blame]	379	but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	380	can only be used to match one of the first 99 groups. If the first digit of
				381	number is 0, or number is 3 octal digits long, it will not be interpreted as
				382	a group match, but as the character with octal value number. Inside the
				383	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				384	characters.
				385
				386	``\A``
				387	Matches only at the start of the string.
				388
				389	``\b``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	390	Matches the empty string, but only at the beginning or end of a word.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	391	A word is defined as a sequence of word characters. Note that formally,
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	392	``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
				393	(or vice versa), or between ``\w`` and the beginning/end of the string.
				394	This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
				395	``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
				396
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	397	By default Unicode alphanumerics are the ones used in Unicode patterns, but
				398	this can be changed by using the :const:`ASCII` flag. Word boundaries are
				399	determined by the current locale if the :const:`LOCALE` flag is used.
				400	Inside a character range, ``\b`` represents the backspace character, for
				401	compatibility with Python's string literals.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	402
				403	``\B``
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	404	Matches the empty string, but only when it is not at the beginning or end
				405	of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
				406	``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	407	``\B`` is just the opposite of ``\b``, so word characters in Unicode
				408	patterns are Unicode alphanumerics or the underscore, although this can
				409	be changed by using the :const:`ASCII` flag. Word boundaries are
				410	determined by the current locale if the :const:`LOCALE` flag is used.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	411
				412	``\d``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	413	For Unicode (str) patterns:
Mark Dickinson	1f26828	2009-07-28 17:22:36 +0000	[diff] [blame]	414	Matches any Unicode decimal digit (that is, any character in
				415	Unicode character category [Nd]). This includes ``[0-9]``, and
				416	also many other digit characters. If the :const:`ASCII` flag is
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	417	used only ``[0-9]`` is matched.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	418
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	419	For 8-bit (bytes) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	420	Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	421
				422	``\D``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	423	Matches any character which is not a decimal digit. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	424	the opposite of ``\d``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	425	becomes the equivalent of ``[^0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	426
				427	``\s``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	428	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	429	Matches Unicode whitespace characters (which includes
				430	``[ \t\n\r\f\v]``, and also many other characters, for example the
				431	non-breaking spaces mandated by typography rules in many
				432	languages). If the :const:`ASCII` flag is used, only
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	433	``[ \t\n\r\f\v]`` is matched.
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	434
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	435	For 8-bit (bytes) patterns:
				436	Matches characters considered whitespace in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	437	this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	438
				439	``\S``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	440	Matches any character which is not a whitespace character. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	441	the opposite of ``\s``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	442	becomes the equivalent of ``[^ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	443
				444	``\w``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	445	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	446	Matches Unicode word characters; this includes most characters
				447	that can be part of a word in any language, as well as numbers and
				448	the underscore. If the :const:`ASCII` flag is used, only
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	449	``[a-zA-Z0-9_]`` is matched.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	450
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	451	For 8-bit (bytes) patterns:
				452	Matches characters considered alphanumeric in the ASCII character set;
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	453	this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
				454	used, matches characters considered alphanumeric in the current locale
				455	and the underscore.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	456
				457	``\W``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	458	Matches any character which is not a word character. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	459	the opposite of ``\w``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	460	becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	461	used, matches characters considered alphanumeric in the current locale
				462	and the underscore.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	463
				464	``\Z``
				465	Matches only at the end of the string.
				466
				467	Most of the standard escapes supported by Python string literals are also
				468	accepted by the regular expression parser::
				469
				470	\a \b \f \n
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	471	\r \t \u \U
				472	\v \x \\
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	473
Ezio Melotti	285e51b	2012-04-29 04:52:30 +0300	[diff] [blame]	474	(Note that ``\b`` is used to represent word boundaries, and means "backspace"
				475	only inside character classes.)
				476
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	477	``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	478	patterns. In bytes patterns they are errors.
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	479
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	480	Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	481	there are three octal digits, it is considered an octal escape. Otherwise, it is
				482	a group reference. As for string literals, octal escapes are always at most
				483	three digits in length.
				484
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	485	.. versionchanged:: 3.3
				486	The ``'\u'`` and ``'\U'`` escape sequences have been added.
				487
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	488	.. versionchanged:: 3.6
Martin Panter	98e9051	2016-06-12 06:17:29 +0000	[diff] [blame]	489	Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	490
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	491
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	492
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	493	.. _contents-of-module-re:
				494
				495	Module Contents
				496	---------------
				497
				498	The module defines several functions, constants, and an exception. Some of the
				499	functions are simplified versions of the full featured methods for compiled
				500	regular expressions. Most non-trivial applications always use the compiled
				501	form.
				502
Ethan Furman	c88c80b	2016-11-21 08:29:31 -0800	[diff] [blame]	503	.. versionchanged:: 3.6
				504	Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
				505	:class:`enum.IntFlag`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	506
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	507	.. function:: compile(pattern, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	508
Henk-Jaap Wagenaar	ed94a8b	2017-08-28 06:41:20 +0100	[diff] [blame]	509	Compile a regular expression pattern into a :ref:`regular expression object
				510	<re-objects>`, which can be used for matching using its
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	511	:func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
Henk-Jaap Wagenaar	ed94a8b	2017-08-28 06:41:20 +0100	[diff] [blame]	512	below.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	513
				514	The expression's behaviour can be modified by specifying a flags value.
				515	Values can be any of the following variables, combined using bitwise OR (the
				516	``\|`` operator).
				517
				518	The sequence ::
				519
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	520	prog = re.compile(pattern)
				521	result = prog.match(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	522
				523	is equivalent to ::
				524
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	525	result = re.match(pattern, string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	526
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	527	but using :func:`re.compile` and saving the resulting regular expression
				528	object for reuse is more efficient when the expression will be used several
				529	times in a single program.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	530
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	531	.. note::
				532
				533	The compiled versions of the most recent patterns passed to
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	534	:func:`re.compile` and the module-level matching functions are cached, so
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	535	programs that use only a few regular expressions at a time needn't worry
				536	about compiling regular expressions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	537
				538
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	539	.. data:: A
				540	ASCII
				541
Georg Brandl	4049ce0	2009-06-08 07:49:54 +0000	[diff] [blame]	542	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
				543	perform ASCII-only matching instead of full Unicode matching. This is only
				544	meaningful for Unicode patterns, and is ignored for byte patterns.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	545	Corresponds to the inline flag ``(?a)``.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	546
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	547	Note that for backward compatibility, the :const:`re.U` flag still
				548	exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	549	counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	550	matches are Unicode by default for strings (and Unicode matching
				551	isn't allowed for bytes).
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	552
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	553
Sandro Tosi	da785fd	2012-01-01 12:55:20 +0100	[diff] [blame]	554	.. data:: DEBUG
				555
				556	Display debug information about compiled expression.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	557	No corresponding inline flag.
Sandro Tosi	da785fd	2012-01-01 12:55:20 +0100	[diff] [blame]	558
				559
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	560	.. data:: I
				561	IGNORECASE
				562
Brian Ward	c9d6dbc	2017-05-24 00:03:38 -0700	[diff] [blame]	563	Perform case-insensitive matching; expressions like ``[A-Z]`` will also
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	564	match lowercase letters. Full Unicode matching (such as ``Ü`` matching
				565	``ü``) also works unless the :const:`re.ASCII` flag is used to disable
				566	non-ASCII matches. The current locale does not change the effect of this
				567	flag unless the :const:`re.LOCALE` flag is also used.
				568	Corresponds to the inline flag ``(?i)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	569
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	570	Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
				571	combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
				572	letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
				573	letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
				574	'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
				575	If the :const:`ASCII` flag is used, only letters 'a' to 'z'
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	576	and 'A' to 'Z' are matched.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	577
				578	.. data:: L
				579	LOCALE
				580
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	581	Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
				582	dependent on the current locale. This flag can be used only with bytes
				583	patterns. The use of this flag is discouraged as the locale mechanism
				584	is very unreliable, it only handles one "culture" at a time, and it only
				585	works with 8-bit locales. Unicode matching is already enabled by default
				586	in Python 3 for Unicode (str) patterns, and it is able to handle different
				587	locales/languages.
				588	Corresponds to the inline flag ``(?L)``.
Serhiy Storchaka	22a309a	2014-12-01 11:50:07 +0200	[diff] [blame]	589
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	590	.. versionchanged:: 3.6
				591	:const:`re.LOCALE` can be used only with bytes patterns and is
				592	not compatible with :const:`re.ASCII`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	593
Serhiy Storchaka	898ff03	2017-05-05 08:53:40 +0300	[diff] [blame]	594	.. versionchanged:: 3.7
				595	Compiled regular expression objects with the :const:`re.LOCALE` flag no
				596	longer depend on the locale at compile time. Only the locale at
				597	matching time affects the result of matching.
				598
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	599
				600	.. data:: M
				601	MULTILINE
				602
				603	When specified, the pattern character ``'^'`` matches at the beginning of the
				604	string and at the beginning of each line (immediately following each newline);
				605	and the pattern character ``'$'`` matches at the end of the string and at the
				606	end of each line (immediately preceding each newline). By default, ``'^'``
				607	matches only at the beginning of the string, and ``'$'`` only at the end of the
				608	string and immediately before the newline (if any) at the end of the string.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	609	Corresponds to the inline flag ``(?m)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	610
				611
				612	.. data:: S
				613	DOTALL
				614
				615	Make the ``'.'`` special character match any character at all, including a
				616	newline; without this flag, ``'.'`` will match anything except a newline.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	617	Corresponds to the inline flag ``(?s)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	618
				619
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	620	.. data:: X
				621	VERBOSE
				622
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	623	This flag allows you to write regular expressions that look nicer and are
				624	more readable by allowing you to visually separate logical sections of the
				625	pattern and add comments. Whitespace within the pattern is ignored, except
Serhiy Storchaka	b0b44b4	2017-11-14 17:21:26 +0200	[diff] [blame]	626	when in a character class, or when preceded by an unescaped backslash,
				627	or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	628	When a line contains a ``#`` that is not in a character class and is not
				629	preceded by an unescaped backslash, all characters from the leftmost such
				630	``#`` through the end of the line are ignored.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	631
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	632	This means that the two following regular expression objects that match a
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	633	decimal number are functionally equal::
Georg Brandl	81ac1ce	2007-08-31 17:17:17 +0000	[diff] [blame]	634
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	635	a = re.compile(r"""\d + # the integral part
				636	\. # the decimal point
				637	\d * # some fractional digits""", re.X)
				638	b = re.compile(r"\d+\.\d*")
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	639
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	640	Corresponds to the inline flag ``(?x)``.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	641
				642
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	643	.. function:: search(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	644
Terry Jan Reedy	0edb5c1	2014-05-30 16:19:59 -0400	[diff] [blame]	645	Scan through string looking for the first location where the regular expression
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	646	pattern produces a match, and return a corresponding :ref:`match object
				647	<match-objects>`. Return ``None`` if no position in the string matches the
				648	pattern; note that this is different from finding a zero-length match at some
				649	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	650
				651
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	652	.. function:: match(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	653
				654	If zero or more characters at the beginning of string match the regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	655	expression pattern, return a corresponding :ref:`match object
				656	<match-objects>`. Return ``None`` if the string does not match the pattern;
				657	note that this is different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	658
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	659	Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
				660	at the beginning of the string and not at the beginning of each line.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	661
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	662	If you want to locate a match anywhere in string, use :func:`search`
				663	instead (see also :ref:`search-vs-match`).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	664
				665
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	666	.. function:: fullmatch(pattern, string, flags=0)
				667
				668	If the whole string matches the regular expression pattern, return a
				669	corresponding :ref:`match object <match-objects>`. Return ``None`` if the
				670	string does not match the pattern; note that this is different from a
				671	zero-length match.
				672
				673	.. versionadded:: 3.4
				674
				675
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	676	.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	677
				678	Split string by the occurrences of pattern. If capturing parentheses are
				679	used in pattern, then the text of all groups in the pattern are also returned
				680	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				681	splits occur, and the remainder of the string is returned as the final element
Georg Brandl	9647389	2008-03-06 07:09:43 +0000	[diff] [blame]	682	of the list. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	683
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	684	>>> re.split(r'\W+', 'Words, words, words.')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	685	['Words', 'words', 'words', '']
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	686	>>> re.split(r'(\W+)', 'Words, words, words.')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	687	['Words', ', ', 'words', ', ', 'words', '.', '']
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	688	>>> re.split(r'\W+', 'Words, words, words.', 1)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	689	['Words', 'words, words.']
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	690	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
				691	['0', '3', '9']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	692
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	693	If there are capturing groups in the separator and it matches at the start of
				694	the string, the result will start with an empty string. The same holds for
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	695	the end of the string::
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	696
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	697	>>> re.split(r'(\W+)', '...words, words...')
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	698	['', '...', 'words', ', ', 'words', '...', '']
				699
				700	That way, separator components are always found at the same relative
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	701	indices within the result list.
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	702
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	703	Empty matches for the pattern split the string only when not adjacent
				704	to a previous empty match.
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	705
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	706	>>> re.split(r'\b', 'Words, words, words.')
				707	['', 'Words', ', ', 'words', ', ', 'words', '.']
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	708	>>> re.split(r'\W*', '...words...')
				709	['', '', 'w', 'o', 'r', 'd', 's', '', '']
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	710	>>> re.split(r'(\W*)', '...words...')
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	711	['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	712
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	713	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	714	Added the optional flags argument.
				715
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	716	.. versionchanged:: 3.7
				717	Added support of splitting on a pattern that could match an empty string.
				718
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	719
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	720	.. function:: findall(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	721
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	722	Return all non-overlapping matches of pattern in string, as a list of
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	723	strings. The string is scanned left-to-right, and matches are returned in
				724	the order found. If one or more groups are present in the pattern, return a
				725	list of groups; this will be a list of tuples if the pattern has more than
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	726	one group. Empty matches are included in the result.
				727
				728	.. versionchanged:: 3.7
				729	Non-empty matches can now start just after a previous empty match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	730
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	731
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	732	.. function:: finditer(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	733
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	734	Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
				735	all non-overlapping matches for the RE pattern in string. The string
				736	is scanned left-to-right, and matches are returned in the order found. Empty
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	737	matches are included in the result.
				738
				739	.. versionchanged:: 3.7
				740	Non-empty matches can now start just after a previous empty match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	741
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	742
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	743	.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	744
				745	Return the string obtained by replacing the leftmost non-overlapping occurrences
				746	of pattern in string by the replacement repl. If the pattern isn't found,
				747	string is returned unchanged. repl can be a string or a function; if it is
				748	a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi	6a633bb	2011-08-19 22:54:50 +0200	[diff] [blame]	749	converted to a single newline character, ``\r`` is converted to a carriage return, and
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	750	so forth. Unknown escapes such as ``\&`` are left alone. Backreferences, such
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	751	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	752	For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	753
				754	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				755	... r'static PyObject*\npy_\1(void)\n{',
				756	... 'def myfunc():')
				757	'static PyObject*\npy_myfunc(void)\n{'
				758
				759	If repl is a function, it is called for every non-overlapping occurrence of
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	760	pattern. The function takes a single :ref:`match object <match-objects>`
				761	argument, and returns the replacement string. For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	762
				763	>>> def dashrepl(matchobj):
				764	... if matchobj.group(0) == '-': return ' '
				765	... else: return '-'
				766	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				767	'pro--gram files'
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	768	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
				769	'Baked Beans & Spam'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	770
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	771	The pattern may be a string or a :ref:`pattern object <re-objects>`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	772
				773	The optional argument count is the maximum number of pattern occurrences to be
				774	replaced; count must be a non-negative integer. If omitted or zero, all
				775	occurrences will be replaced. Empty matches for the pattern are replaced only
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	776	when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
				777	``'-a-b--d-'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	778
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	779	In string-type repl arguments, in addition to the character escapes and
				780	backreferences described above,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	781	``\g<name>`` will use the substring matched by the group named ``name``, as
				782	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				783	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				784	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				785	reference to group 20, not a reference to group 2 followed by the literal
				786	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				787	substring matched by the RE.
				788
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	789	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	790	Added the optional flags argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	791
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	792	.. versionchanged:: 3.5
				793	Unmatched groups are replaced with an empty string.
				794
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	795	.. versionchanged:: 3.6
Serhiy Storchaka	53c53ea	2016-12-06 19:15:29 +0200	[diff] [blame]	796	Unknown escapes in pattern consisting of ``'\'`` and an ASCII letter
				797	now are errors.
				798
Serhiy Storchaka	ff3dbe9	2016-12-06 19:25:19 +0200	[diff] [blame]	799	.. versionchanged:: 3.7
				800	Unknown escapes in repl consisting of ``'\'`` and an ASCII letter
				801	now are errors.
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	802
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	803	Empty matches for the pattern are replaced when adjacent to a previous
				804	non-empty match.
				805
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	806
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	807	.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	808
				809	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				810	number_of_subs_made)``.
				811
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	812	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	813	Added the optional flags argument.
				814
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	815	.. versionchanged:: 3.5
				816	Unmatched groups are replaced with an empty string.
				817
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	818
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	819	.. function:: escape(pattern)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	820
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	821	Escape special characters in pattern.
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	822	This is useful if you want to match an arbitrary literal string that may
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	823	have regular expression metacharacters in it. For example::
				824
				825	>>> print(re.escape('python.exe'))
				826	python\.exe
				827
				828	>>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`\|~:"
				829	>>> print('[%s]+' % re.escape(legal_chars))
Serhiy Storchaka	05cb728	2017-11-16 12:38:26 +0200	[diff] [blame]	830	[abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\\|\~:]+
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	831
				832	>>> operators = ['+', '-', '', '/', '*']
				833	>>> print('\|'.join(map(re.escape, sorted(operators, reverse=True))))
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	834	/\|\-\|\+\|\\\|\*
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	835
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	836	This functions must not be used for the replacement string in :func:`sub`
				837	and :func:`subn`, only backslashes should be escaped. For example::
				838
				839	>>> digits_re = r'\d+'
				840	>>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
				841	>>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
				842	/usr/sbin/sendmail - \d+ errors, \d+ warnings
				843
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	844	.. versionchanged:: 3.3
				845	The ``'_'`` character is no longer escaped.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	846
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	847	.. versionchanged:: 3.7
				848	Only characters that can have special meaning in a regular expression
				849	are escaped.
				850
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	851
R. David Murray	522c32a	2010-07-10 14:23:36 +0000	[diff] [blame]	852	.. function:: purge()
				853
				854	Clear the regular expression cache.
				855
				856
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	857	.. exception:: error(msg, pattern=None, pos=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	858
				859	Exception raised when a string passed to one of the functions here is not a
				860	valid regular expression (for example, it might contain unmatched parentheses)
				861	or when some other error occurs during compilation or matching. It is never an
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	862	error if a string contains no match for a pattern. The error instance has
				863	the following additional attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	864
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	865	.. attribute:: msg
				866
				867	The unformatted error message.
				868
				869	.. attribute:: pattern
				870
				871	The regular expression pattern.
				872
				873	.. attribute:: pos
				874
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	875	The index in pattern where compilation failed (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	876
				877	.. attribute:: lineno
				878
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	879	The line corresponding to pos (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	880
				881	.. attribute:: colno
				882
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	883	The column corresponding to pos (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	884
				885	.. versionchanged:: 3.5
				886	Added additional attributes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	887
				888	.. _re-objects:
				889
				890	Regular Expression Objects
				891	--------------------------
				892
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	893	Compiled regular expression objects support the following methods and
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	894	attributes:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	895
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	896	.. method:: Pattern.search(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	897
Berker Peksag	84f387d	2016-06-08 14:56:56 +0300	[diff] [blame]	898	Scan through string looking for the first location where this regular
				899	expression produces a match, and return a corresponding :ref:`match object
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	900	<match-objects>`. Return ``None`` if no position in the string matches the
				901	pattern; note that this is different from finding a zero-length match at some
				902	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	903
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	904	The optional second parameter pos gives an index in the string where the
				905	search is to start; it defaults to ``0``. This is not completely equivalent to
				906	slicing the string; the ``'^'`` pattern character matches at the real beginning
				907	of the string and at positions just after a newline, but not necessarily at the
				908	index where the search is to start.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	909
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	910	The optional parameter endpos limits how far the string will be searched; it
				911	will be as if the string is endpos characters long, so only the characters
				912	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	913	than pos, no match will be found; otherwise, if rx is a compiled regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	914	expression object, ``rx.search(string, 0, 50)`` is equivalent to
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	915	``rx.search(string[:50], 0)``. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	916
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	917	>>> pattern = re.compile("d")
				918	>>> pattern.search("dog") # Match at index 0
				919	<re.Match object; span=(0, 1), match='d'>
				920	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	921
				922
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	923	.. method:: Pattern.match(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	924
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	925	If zero or more characters at the beginning of string match this regular
				926	expression, return a corresponding :ref:`match object <match-objects>`.
				927	Return ``None`` if the string does not match the pattern; note that this is
				928	different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	929
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	930	The optional pos and endpos parameters have the same meaning as for the
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	931	:meth:`~Pattern.search` method. ::
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	932
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	933	>>> pattern = re.compile("o")
				934	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
				935	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				936	<re.Match object; span=(1, 2), match='o'>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	937
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	938	If you want to locate a match anywhere in string, use
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	939	:meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	940
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	941
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	942	.. method:: Pattern.fullmatch(string[, pos[, endpos]])
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	943
				944	If the whole string matches this regular expression, return a corresponding
				945	:ref:`match object <match-objects>`. Return ``None`` if the string does not
				946	match the pattern; note that this is different from a zero-length match.
				947
				948	The optional pos and endpos parameters have the same meaning as for the
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	949	:meth:`~Pattern.search` method. ::
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	950
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	951	>>> pattern = re.compile("o[gh]")
				952	>>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
				953	>>> pattern.fullmatch("ogre") # No match as not the full string matches.
				954	>>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
				955	<re.Match object; span=(1, 3), match='og'>
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	956
				957	.. versionadded:: 3.4
				958
				959
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	960	.. method:: Pattern.split(string, maxsplit=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	961
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	962	Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	963
				964
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	965	.. method:: Pattern.findall(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	966
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	967	Similar to the :func:`findall` function, using the compiled pattern, but
				968	also accepts optional pos and endpos parameters that limit the search
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	969	region like for :meth:`search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	970
				971
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	972	.. method:: Pattern.finditer(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	973
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	974	Similar to the :func:`finditer` function, using the compiled pattern, but
				975	also accepts optional pos and endpos parameters that limit the search
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	976	region like for :meth:`search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	977
				978
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	979	.. method:: Pattern.sub(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	980
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	981	Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	982
				983
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	984	.. method:: Pattern.subn(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	985
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	986	Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	987
				988
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	989	.. attribute:: Pattern.flags
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	990
Georg Brandl	3a19e54	2012-03-17 17:29:27 +0100	[diff] [blame]	991	The regex matching flags. This is a combination of the flags given to
				992	:func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
				993	flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	994
				995
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	996	.. attribute:: Pattern.groups
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	997
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	998	The number of capturing groups in the pattern.
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	999
				1000
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1001	.. attribute:: Pattern.groupindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1002
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1003	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				1004	numbers. The dictionary is empty if no symbolic groups were used in the
				1005	pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1006
				1007
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1008	.. attribute:: Pattern.pattern
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1009
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1010	The pattern string from which the pattern object was compiled.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1011
				1012
Serhiy Storchaka	fdbd011	2017-04-16 10:16:03 +0300	[diff] [blame]	1013	.. versionchanged:: 3.7
				1014	Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled
				1015	regular expression objects are considered atomic.
				1016
				1017
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1018	.. _match-objects:
				1019
				1020	Match Objects
				1021	-------------
				1022
Ezio Melotti	b87f82f	2012-11-04 06:59:22 +0200	[diff] [blame]	1023	Match objects always have a boolean value of ``True``.
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1024	Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
Ezio Melotti	b87f82f	2012-11-04 06:59:22 +0200	[diff] [blame]	1025	when there is no match, you can test whether there was a match with a simple
				1026	``if`` statement::
				1027
				1028	match = re.search(pattern, string)
				1029	if match:
				1030	process(match)
				1031
				1032	Match objects support the following methods and attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1033
				1034
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1035	.. method:: Match.expand(template)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1036
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1037	Return the string obtained by doing backslash substitution on the template
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1038	string template, as done by the :meth:`~Pattern.sub` method.
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1039	Escapes such as ``\n`` are converted to the appropriate characters,
				1040	and numeric backreferences (``\1``, ``\2``) and named backreferences
				1041	(``\g<1>``, ``\g<name>``) are replaced by the contents of the
				1042	corresponding group.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1043
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	1044	.. versionchanged:: 3.5
				1045	Unmatched groups are replaced with an empty string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1046
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1047	.. method:: Match.group([group1, ...])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1048
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1049	Returns one or more subgroups of the match. If there is a single argument, the
				1050	result is a single string; if there are multiple arguments, the result is a
				1051	tuple with one item per argument. Without arguments, group1 defaults to zero
				1052	(the whole match is returned). If a groupN argument is zero, the corresponding
				1053	return value is the entire matching string; if it is in the inclusive range
				1054	[1..99], it is the string matching the corresponding parenthesized group. If a
				1055	group number is negative or larger than the number of groups defined in the
				1056	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				1057	part of the pattern that did not match, the corresponding result is ``None``.
				1058	If a group is contained in a part of the pattern that matched multiple times,
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1059	the last match is returned. ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1060
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1061	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				1062	>>> m.group(0) # The entire match
				1063	'Isaac Newton'
				1064	>>> m.group(1) # The first parenthesized subgroup.
				1065	'Isaac'
				1066	>>> m.group(2) # The second parenthesized subgroup.
				1067	'Newton'
				1068	>>> m.group(1, 2) # Multiple arguments give us a tuple.
				1069	('Isaac', 'Newton')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1070
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1071	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				1072	arguments may also be strings identifying groups by their group name. If a
				1073	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				1074	exception is raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1075
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1076	A moderately complicated example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1077
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1078	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				1079	>>> m.group('first_name')
				1080	'Malcolm'
				1081	>>> m.group('last_name')
				1082	'Reynolds'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1083
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1084	Named groups can also be referred to by their index::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1085
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1086	>>> m.group(1)
				1087	'Malcolm'
				1088	>>> m.group(2)
				1089	'Reynolds'
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1090
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1091	If a group matches multiple times, only the last match is accessible::
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1092
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1093	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				1094	>>> m.group(1) # Returns only the last match.
				1095	'c3'
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1096
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	1097
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1098	.. method:: Match.__getitem__(g)
Eric V. Smith	605bdae	2016-09-11 08:55:43 -0400	[diff] [blame]	1099
				1100	This is identical to ``m.group(g)``. This allows easier access to
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1101	an individual group from a match::
Eric V. Smith	605bdae	2016-09-11 08:55:43 -0400	[diff] [blame]	1102
				1103	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				1104	>>> m[0] # The entire match
				1105	'Isaac Newton'
				1106	>>> m[1] # The first parenthesized subgroup.
				1107	'Isaac'
				1108	>>> m[2] # The second parenthesized subgroup.
				1109	'Newton'
				1110
				1111	.. versionadded:: 3.6
				1112
				1113
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1114	.. method:: Match.groups(default=None)
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	1115
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1116	Return a tuple containing all the subgroups of the match, from 1 up to however
				1117	many groups are in the pattern. The default argument is used for groups that
				1118	did not participate in the match; it defaults to ``None``.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1119
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1120	For example::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1121
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1122	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				1123	>>> m.groups()
				1124	('24', '1632')
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1125
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1126	If we make the decimal place and everything after it optional, not all groups
				1127	might participate in the match. These groups will default to ``None`` unless
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1128	the default argument is given::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1129
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1130	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				1131	>>> m.groups() # Second group defaults to None.
				1132	('24', None)
				1133	>>> m.groups('0') # Now, the second group defaults to '0'.
				1134	('24', '0')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1135
				1136
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1137	.. method:: Match.groupdict(default=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1138
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1139	Return a dictionary containing all the named subgroups of the match, keyed by
				1140	the subgroup name. The default argument is used for groups that did not
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1141	participate in the match; it defaults to ``None``. For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1142
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1143	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				1144	>>> m.groupdict()
				1145	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1146
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1147
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1148	.. method:: Match.start([group])
				1149	Match.end([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1150
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1151	Return the indices of the start and end of the substring matched by group;
				1152	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				1153	group exists but did not contribute to the match. For a match object m, and
				1154	a group g that did contribute to the match, the substring matched by group g
				1155	(equivalent to ``m.group(g)``) is ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1156
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1157	m.string[m.start(g):m.end(g)]
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1158
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1159	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				1160	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				1161	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				1162	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1163
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1164	An example that will remove remove_this from email addresses::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1165
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1166	>>> email = "tony@tiremove_thisger.net"
				1167	>>> m = re.search("remove_this", email)
				1168	>>> email[:m.start()] + email[m.end():]
				1169	'tony@tiger.net'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1170
				1171
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1172	.. method:: Match.span([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1173
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1174	For a match m, return the 2-tuple ``(m.start(group), m.end(group))``. Note
				1175	that if group did not contribute to the match, this is ``(-1, -1)``.
				1176	group defaults to zero, the entire match.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1177
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1178
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1179	.. attribute:: Match.pos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1180
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1181	The value of pos which was passed to the :meth:`~Pattern.search` or
				1182	:meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	1183	the index into the string at which the RE engine started looking for a match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1184
				1185
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1186	.. attribute:: Match.endpos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1187
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1188	The value of endpos which was passed to the :meth:`~Pattern.search` or
				1189	:meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	1190	the index into the string beyond which the RE engine will not go.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1191
				1192
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1193	.. attribute:: Match.lastindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1194
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1195	The integer index of the last matched capturing group, or ``None`` if no group
				1196	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				1197	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				1198	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				1199	string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1200
				1201
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1202	.. attribute:: Match.lastgroup
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1203
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1204	The name of the last matched capturing group, or ``None`` if the group didn't
				1205	have a name, or if no group was matched at all.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1206
				1207
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1208	.. attribute:: Match.re
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1209
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1210	The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1211	:meth:`~Pattern.search` method produced this match instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1212
				1213
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1214	.. attribute:: Match.string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1215
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1216	The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1217
				1218
Serhiy Storchaka	fdbd011	2017-04-16 10:16:03 +0300	[diff] [blame]	1219	.. versionchanged:: 3.7
				1220	Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects
				1221	are considered atomic.
				1222
				1223
Raymond Hettinger	1fa7682	2010-12-06 23:31:36 +0000	[diff] [blame]	1224	.. _re-examples:
				1225
				1226	Regular Expression Examples
				1227	---------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1228
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1229
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	1230	Checking for a Pair
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1231	^^^^^^^^^^^^^^^^^^^
				1232
				1233	In this example, we'll use the following helper function to display match
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1234	objects a little more gracefully:
				1235
				1236	.. testcode::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1237
				1238	def displaymatch(match):
				1239	if match is None:
				1240	return None
				1241	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				1242
				1243	Suppose you are writing a poker program where a player's hand is represented as
				1244	a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1245	for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1246	representing the card with that value.
				1247
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1248	To see if a given string is a valid hand, one could do the following::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1249
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1250	>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
				1251	>>> displaymatch(valid.match("akt5q")) # Valid.
				1252	"<Match: 'akt5q', groups=()>"
				1253	>>> displaymatch(valid.match("akt5e")) # Invalid.
				1254	>>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1255	>>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1256	"<Match: '727ak', groups=()>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1257
				1258	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1259	To match this with a regular expression, one could use backreferences as such::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1260
				1261	>>> pair = re.compile(r".(.).\1")
				1262	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1263	"<Match: '717', groups=('7',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1264	>>> displaymatch(pair.match("718ak")) # No pairs.
				1265	>>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1266	"<Match: '354aa', groups=('a',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1267
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	1268	To find out what card the pair consists of, one could use the
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1269	:meth:`~Match.group` method of the match object in the following manner:
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1270
				1271	.. doctest::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1272
				1273	>>> pair.match("717ak").group(1)
				1274	'7'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1275
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1276	# Error because re.match() returns None, which doesn't have a group() method:
				1277	>>> pair.match("718ak").group(1)
				1278	Traceback (most recent call last):
				1279	File "<pyshell#23>", line 1, in <module>
				1280	re.match(r".(.).\1", "718ak").group(1)
				1281	AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1282
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1283	>>> pair.match("354aa").group(1)
				1284	'a'
				1285
				1286
				1287	Simulating scanf()
				1288	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1289
				1290	.. index:: single: scanf()
				1291
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1292	Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1293	expressions are generally more powerful, though also more verbose, than
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1294	:c:func:`scanf` format strings. The table below offers some more-or-less
				1295	equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1296	expressions.
				1297
				1298	+--------------------------------+---------------------------------------------+
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1299	\| :c:func:`scanf` Token \| Regular Expression \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1300	+================================+=============================================+
				1301	\| ``%c`` \| ``.`` \|
				1302	+--------------------------------+---------------------------------------------+
				1303	\| ``%5c`` \| ``.{5}`` \|
				1304	+--------------------------------+---------------------------------------------+
				1305	\| ``%d`` \| ``[-+]?\d+`` \|
				1306	+--------------------------------+---------------------------------------------+
				1307	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1308	+--------------------------------+---------------------------------------------+
				1309	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1310	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1311	\| ``%o`` \| ``[-+]?[0-7]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1312	+--------------------------------+---------------------------------------------+
				1313	\| ``%s`` \| ``\S+`` \|
				1314	+--------------------------------+---------------------------------------------+
				1315	\| ``%u`` \| ``\d+`` \|
				1316	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1317	\| ``%x``, ``%X`` \| ``[-+]?(0[xX])?[\dA-Fa-f]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1318	+--------------------------------+---------------------------------------------+
				1319
				1320	To extract the filename and numbers from a string like ::
				1321
				1322	/usr/sbin/sendmail - 0 errors, 4 warnings
				1323
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1324	you would use a :c:func:`scanf` format like ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1325
				1326	%s - %d errors, %d warnings
				1327
				1328	The equivalent regular expression would be ::
				1329
				1330	(\S+) - (\d+) errors, (\d+) warnings
				1331
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1332
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1333	.. _search-vs-match:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1334
				1335	search() vs. match()
				1336	^^^^^^^^^^^^^^^^^^^^
				1337
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1338	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1339
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1340	Python offers two different primitive operations based on regular expressions:
				1341	:func:`re.match` checks for a match only at the beginning of the string, while
				1342	:func:`re.search` checks for a match anywhere in the string (this is what Perl
				1343	does by default).
				1344
				1345	For example::
				1346
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1347	>>> re.match("c", "abcdef") # No match
				1348	>>> re.search("c", "abcdef") # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1349	<re.Match object; span=(2, 3), match='c'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1350
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1351	Regular expressions beginning with ``'^'`` can be used with :func:`search` to
				1352	restrict the match at the beginning of the string::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1353
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1354	>>> re.match("c", "abcdef") # No match
				1355	>>> re.search("^c", "abcdef") # No match
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1356	>>> re.search("^a", "abcdef") # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1357	<re.Match object; span=(0, 1), match='a'>
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1358
				1359	Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
				1360	beginning of the string, whereas using :func:`search` with a regular expression
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1361	beginning with ``'^'`` will match at the beginning of each line. ::
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1362
				1363	>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
				1364	>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1365	<re.Match object; span=(4, 5), match='X'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1366
				1367
				1368	Making a Phonebook
				1369	^^^^^^^^^^^^^^^^^^
				1370
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1371	:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1372	method is invaluable for converting textual data into data structures that can be
				1373	easily read and modified by Python as demonstrated in the following example that
				1374	creates a phonebook.
				1375
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1376	First, here is the input. Normally it may come from a file, here we are using
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1377	triple-quoted string syntax::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1378
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1379	>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1380	...
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1381	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1382	... Frank Burger: 925.541.7625 662 South Dogwood Way
				1383	...
				1384	...
				1385	... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1386
				1387	The entries are separated by one or more newlines. Now we convert the string
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1388	into a list with each nonempty line having its own entry:
				1389
				1390	.. doctest::
				1391	:options: +NORMALIZE_WHITESPACE
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1392
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1393	>>> entries = re.split("\n+", text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1394	>>> entries
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1395	['Ross McFluff: 834.345.1254 155 Elm Street',
				1396	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
				1397	'Frank Burger: 925.541.7625 662 South Dogwood Way',
				1398	'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1399
				1400	Finally, split each entry into a list with first name, last name, telephone
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1401	number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1402	because the address has spaces, our splitting pattern, in it:
				1403
				1404	.. doctest::
				1405	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1406
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1407	>>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1408	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1409	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1410	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1411	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1412
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1413	The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1414	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1415	house number from the street name:
				1416
				1417	.. doctest::
				1418	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1419
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1420	>>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1421	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1422	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1423	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1424	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1425
				1426
				1427	Text Munging
				1428	^^^^^^^^^^^^
				1429
				1430	:func:`sub` replaces every occurrence of a pattern with a string or the
				1431	result of a function. This example demonstrates using :func:`sub` with
				1432	a function to "munge" text, or randomize the order of all the characters
				1433	in each word of a sentence except for the first and last characters::
				1434
				1435	>>> def repl(m):
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1436	... inner_word = list(m.group(2))
				1437	... random.shuffle(inner_word)
				1438	... return m.group(1) + "".join(inner_word) + m.group(3)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1439	>>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1440	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1441	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1442	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1443	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1444
				1445
				1446	Finding all Adverbs
				1447	^^^^^^^^^^^^^^^^^^^
				1448
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1449	:func:`findall` matches all occurrences of a pattern, not just the first
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1450	one as :func:`search` does. For example, if one was a writer and wanted to
				1451	find all of the adverbs in some text, he or she might use :func:`findall` in
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1452	the following manner::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1453
				1454	>>> text = "He was carefully disguised but captured quickly by police."
				1455	>>> re.findall(r"\w+ly", text)
				1456	['carefully', 'quickly']
				1457
				1458
				1459	Finding all Adverbs and their Positions
				1460	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1461
				1462	If one wants more information about all matches of a pattern than the matched
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1463	text, :func:`finditer` is useful as it provides :ref:`match objects
				1464	<match-objects>` instead of strings. Continuing with the previous example, if
				1465	one was a writer who wanted to find all of the adverbs and their positions in
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1466	some text, he or she would use :func:`finditer` in the following manner::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1467
				1468	>>> text = "He was carefully disguised but captured quickly by police."
				1469	>>> for m in re.finditer(r"\w+ly", text):
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1470	... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1471	07-16: carefully
				1472	40-47: quickly
				1473
				1474
				1475	Raw String Notation
				1476	^^^^^^^^^^^^^^^^^^^
				1477
				1478	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1479	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1480	another one to escape it. For example, the two following lines of code are
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1481	functionally identical::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1482
				1483	>>> re.match(r"\W(.)\1\W", " ff ")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1484	<re.Match object; span=(0, 4), match=' ff '>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1485	>>> re.match("\\W(.)\\1\\W", " ff ")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1486	<re.Match object; span=(0, 4), match=' ff '>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1487
				1488	When one wants to match a literal backslash, it must be escaped in the regular
				1489	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1490	notation, one must use ``"\\\\"``, making the following lines of code
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1491	functionally identical::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1492
				1493	>>> re.match(r"\\", r"\\")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1494	<re.Match object; span=(0, 1), match='\\'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1495	>>> re.match("\\\\", r"\\")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1496	<re.Match object; span=(0, 1), match='\\'>
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1497
				1498
				1499	Writing a Tokenizer
				1500	^^^^^^^^^^^^^^^^^^^
				1501
Georg Brandl	5d94134	2016-02-26 19:37:12 +0100	[diff] [blame]	1502	A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1503	analyzes a string to categorize groups of characters. This is a useful first
				1504	step in writing a compiler or interpreter.
				1505
				1506	The text categories are specified with regular expressions. The technique is
				1507	to combine those into a single master regular expression and to loop over
				1508	successive matches::
				1509
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1510	import collections
				1511	import re
				1512
				1513	Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1514
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1515	def tokenize(code):
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1516	keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
				1517	token_specification = [
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1518	('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
				1519	('ASSIGN', r':='), # Assignment operator
				1520	('END', r';'), # Statement terminator
				1521	('ID', r'[A-Za-z]+'), # Identifiers
				1522	('OP', r'[+\-*/]'), # Arithmetic operators
				1523	('NEWLINE', r'\n'), # Line endings
				1524	('SKIP', r'[ \t]+'), # Skip over spaces and tabs
				1525	('MISMATCH',r'.'), # Any other character
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1526	]
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1527	tok_regex = '\|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1528	line_num = 1
				1529	line_start = 0
				1530	for mo in re.finditer(tok_regex, code):
				1531	kind = mo.lastgroup
				1532	value = mo.group(kind)
				1533	if kind == 'NEWLINE':
				1534	line_start = mo.end()
				1535	line_num += 1
				1536	elif kind == 'SKIP':
				1537	pass
				1538	elif kind == 'MISMATCH':
Raymond Hettinger	d0b9158	2017-02-06 07:15:31 -0800	[diff] [blame]	1539	raise RuntimeError(f'{value!r} unexpected on line {line_num}')
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1540	else:
				1541	if kind == 'ID' and value in keywords:
				1542	kind = value
				1543	column = mo.start() - line_start
				1544	yield Token(kind, value, line_num, column)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1545
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1546	statements = '''
				1547	IF quantity THEN
				1548	total := total + price * quantity;
				1549	tax := price * 0.05;
				1550	ENDIF;
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1551	'''
Raymond Hettinger	23157e5	2011-05-13 01:38:31 -0700	[diff] [blame]	1552
				1553	for token in tokenize(statements):
				1554	print(token)
				1555
				1556	The tokenizer produces the following output::
Raymond Hettinger	9c47d77	2011-05-13 01:03:50 -0700	[diff] [blame]	1557
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1558	Token(typ='IF', value='IF', line=2, column=4)
				1559	Token(typ='ID', value='quantity', line=2, column=7)
				1560	Token(typ='THEN', value='THEN', line=2, column=16)
				1561	Token(typ='ID', value='total', line=3, column=8)
				1562	Token(typ='ASSIGN', value=':=', line=3, column=14)
				1563	Token(typ='ID', value='total', line=3, column=17)
				1564	Token(typ='OP', value='+', line=3, column=23)
				1565	Token(typ='ID', value='price', line=3, column=25)
				1566	Token(typ='OP', value='*', line=3, column=31)
				1567	Token(typ='ID', value='quantity', line=3, column=33)
				1568	Token(typ='END', value=';', line=3, column=41)
				1569	Token(typ='ID', value='tax', line=4, column=8)
				1570	Token(typ='ASSIGN', value=':=', line=4, column=12)
				1571	Token(typ='ID', value='price', line=4, column=15)
				1572	Token(typ='OP', value='*', line=4, column=21)
				1573	Token(typ='NUMBER', value='0.05', line=4, column=23)
				1574	Token(typ='END', value=';', line=4, column=27)
				1575	Token(typ='ENDIF', value='ENDIF', line=5, column=4)
				1576	Token(typ='END', value=';', line=5, column=9)
Miss Islington (bot)	67d3f8b	2018-03-23 08:55:26 -0700	[diff] [blame^]	1577
				1578
				1579	.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
				1580	Media, 2009. The third edition of the book no longer covers Python at all,
				1581	but the first edition covered writing good regular expression patterns in
				1582	great detail.