Blame - Doc/library/re.rst - platform/external/python/cpython3

blob: 67f85705169beda626086a9d7ff0cbb80d2a2f55 [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`re` --- Regular expression operations
				2	===========================================
				3
				4	.. module:: re
				5	:synopsis: Regular expression operations.
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	10	Source code: :source:`Lib/re.py`
				11
				12	--------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	13
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14	This module provides regular expression matching operations similar to
Georg Brandl	ed2a1db	2009-06-08 07:48:27 +0000	[diff] [blame]	15	those found in Perl.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	16
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	17	Both patterns and strings to be searched can be Unicode strings (:class:`str`)
				18	as well as 8-bit strings (:class:`bytes`).
				19	However, Unicode strings and 8-bit strings cannot be mixed:
Martin Panter	6245cb3	2016-04-15 02:14:19 +0000	[diff] [blame]	20	that is, you cannot match a Unicode string with a byte pattern or
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	21	vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	22	string must be of the same type as both the pattern and the search string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	23
				24	Regular expressions use the backslash character (``'\'``) to indicate
				25	special forms or to allow special characters to be used without invoking
				26	their special meaning. This collides with Python's usage of the same
				27	character for the same purpose in string literals; for example, to match
				28	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				29	string, because the regular expression must be ``\\``, and each
				30	backslash must be expressed as ``\\`` inside a regular Python string
				31	literal.
				32
				33	The solution is to use Python's raw string notation for regular expression
				34	patterns; backslashes are not handled in any special way in a string literal
				35	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				36	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	37	newline. Usually patterns will be expressed in Python code using this raw
				38	string notation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	39
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	40	It is important to note that most regular expression operations are available as
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	41	module-level functions and methods on
				42	:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
				43	that don't require you to compile a regex object first, but miss some
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	44	fine-tuning parameters.
				45
Marco Buttu	ed6795e	2017-02-26 16:26:23 +0100	[diff] [blame]	46	.. seealso::
				47
Stéphane Wirtel	19177fb	2018-05-15 20:58:35 +0200	[diff] [blame]	48	The third-party `regex <https://pypi.org/project/regex/>`_ module,
Marco Buttu	ed6795e	2017-02-26 16:26:23 +0100	[diff] [blame]	49	which has an API compatible with the standard library :mod:`re` module,
				50	but offers additional functionality and a more thorough Unicode support.
				51
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	52
				53	.. _re-syntax:
				54
				55	Regular Expression Syntax
				56	-------------------------
				57
				58	A regular expression (or RE) specifies a set of strings that matches it; the
				59	functions in this module let you check if a particular string matches a given
				60	regular expression (or if a given regular expression matches a particular
				61	string, which comes down to the same thing).
				62
				63	Regular expressions can be concatenated to form new regular expressions; if A
				64	and B are both regular expressions, then AB is also a regular expression.
				65	In general, if a string p matches A and another string q matches B, the
				66	string pq will match AB. This holds unless A or B contain low precedence
				67	operations; boundary conditions between A and B; or have numbered group
				68	references. Thus, complex expressions can easily be constructed from simpler
				69	primitive expressions like the ones described here. For details of the theory
Berker Peksag	a0a42d2	2018-03-23 16:46:52 +0300	[diff] [blame]	70	and implementation of regular expressions, consult the Friedl book [Frie09]_,
				71	or almost any textbook about compiler construction.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	72
				73	A brief explanation of the format of regular expressions follows. For further
Christian Heimes	2202f87	2008-02-06 14:31:34 +0000	[diff] [blame]	74	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	75
				76	Regular expressions can contain both special and ordinary characters. Most
				77	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				78	expressions; they simply match themselves. You can concatenate ordinary
				79	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				80	section, we'll write RE's in ``this special style``, usually without quotes, and
				81	strings to be matched ``'in single quotes'``.)
				82
				83	Some characters, like ``'\|'`` or ``'('``, are special. Special
				84	characters either stand for classes of ordinary characters, or affect
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	85	how the regular expressions around them are interpreted.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	86
Martin Panter	684340e	2016-10-15 01:18:16 +0000	[diff] [blame]	87	Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
				88	directly nested. This avoids ambiguity with the non-greedy modifier suffix
				89	``?``, and with other modifiers in other implementations. To apply a second
				90	repetition to an inner repetition, parentheses may be used. For example,
				91	the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
				92
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	93
				94	The special characters are:
				95
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	96	``.``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	97	(Dot.) In the default mode, this matches any character except a newline. If
				98	the :const:`DOTALL` flag has been specified, this matches any character
				99	including a newline.
				100
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	101	``^``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	102	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				103	matches immediately after each newline.
				104
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	105	``$``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	106	Matches the end of the string or just before the newline at the end of the
				107	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				108	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				109	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes	25bb783	2008-01-11 16:17:00 +0000	[diff] [blame]	110	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				111	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				112	the newline, and one at the end of the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	113
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	114	``*``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	115	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				116	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				117	by any number of 'b's.
				118
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	119	``+``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	120	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				121	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				122	match just 'a'.
				123
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	124	``?``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	125	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				126	``ab?`` will match either 'a' or 'ab'.
				127
				128	``*?``, ``+?``, ``??``
				129	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				130	as much text as possible. Sometimes this behaviour isn't desired; if the RE
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	131	``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
				132	string, and not just ``'<a>'``. Adding ``?`` after the qualifier makes it
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	133	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
Georg Brandl	7ff033b	2016-04-12 07:51:41 +0200	[diff] [blame]	134	characters as possible will be matched. Using the RE ``<.*?>`` will match
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	135	only ``'<a>'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	136
				137	``{m}``
				138	Specifies that exactly m copies of the previous RE should be matched; fewer
				139	matches cause the entire RE not to match. For example, ``a{6}`` will match
				140	exactly six ``'a'`` characters, but not five.
				141
				142	``{m,n}``
				143	Causes the resulting RE to match from m to n repetitions of the preceding
				144	RE, attempting to match as many repetitions as possible. For example,
				145	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				146	lower bound of zero, and omitting n specifies an infinite upper bound. As an
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	147	example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
				148	followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	149	modifier would be confused with the previously described form.
				150
				151	``{m,n}?``
				152	Causes the resulting RE to match from m to n repetitions of the preceding
				153	RE, attempting to match as few repetitions as possible. This is the
				154	non-greedy version of the previous qualifier. For example, on the
				155	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				156	while ``a{3,5}?`` will only match 3 characters.
				157
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	158	``\``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	159	Either escapes special characters (permitting you to match characters like
				160	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				161	sequences are discussed below.
				162
				163	If you're not using a raw string to express the pattern, remember that Python
				164	also uses the backslash as an escape sequence in string literals; if the escape
				165	sequence isn't recognized by Python's parser, the backslash and subsequent
				166	character are included in the resulting string. However, if Python would
				167	recognize the resulting sequence, the backslash should be repeated twice. This
				168	is complicated and hard to understand, so it's highly recommended that you use
				169	raw strings for all but the simplest expressions.
				170
				171	``[]``
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	172	Used to indicate a set of characters. In a set:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	173
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	174	* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
				175	``'m'``, or ``'k'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	176
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	177	* Ranges of characters can be indicated by giving two characters and separating
				178	them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
				179	``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
				180	``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	181	``[a\-z]``) or if it's placed as the first or last character
				182	(e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	183
				184	* Special characters lose their special meaning inside sets. For example,
				185	``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
				186	``'*'``, or ``')'``.
				187
				188	* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
				189	inside a set, although the characters they match depends on whether
				190	:const:`ASCII` or :const:`LOCALE` mode is in force.
				191
				192	* Characters that are not within a range can be matched by :dfn:`complementing`
				193	the set. If the first character of the set is ``'^'``, all the characters
				194	that are not in the set will be matched. For example, ``[^5]`` will match
				195	any character except ``'5'``, and ``[^^]`` will match any character except
				196	``'^'``. ``^`` has no special meaning if it's not the first character in
				197	the set.
				198
				199	* To match a literal ``']'`` inside a set, precede it with a backslash, or
				200	place it at the beginning of the set. For example, both ``[()[\]{}]`` and
				201	``[]()[{}]`` will both match a parenthesis.
Mark Summerfield	9e670c2	2008-05-31 13:05:34 +0000	[diff] [blame]	202
Serhiy Storchaka	05cb728	2017-11-16 12:38:26 +0200	[diff] [blame]	203	* Support of nested sets and set operations as in `Unicode Technical
				204	Standard #18`_ might be added in the future. This would change the
				205	syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
				206	in ambiguous cases for the time being.
Andrés Delfino	7dfbd49	2018-10-06 16:48:30 -0300	[diff] [blame]	207	That includes sets starting with a literal ``'['`` or containing literal
Serhiy Storchaka	05cb728	2017-11-16 12:38:26 +0200	[diff] [blame]	208	character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'\|\|'``. To
				209	avoid a warning escape them with a backslash.
				210
				211	.. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
				212
				213	.. versionchanged:: 3.7
				214	:exc:`FutureWarning` is raised if a character set contains constructs
				215	that will change semantically in the future.
				216
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	217	``\|``
				218	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				219	will match either A or B. An arbitrary number of REs can be separated by the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	220	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				221	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				222	right. When one pattern completely matches, that branch is accepted. This means
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	223	that once A matches, B will not be tested further, even if it would
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	224	produce a longer overall match. In other words, the ``'\|'`` operator is never
				225	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				226	character class, as in ``[\|]``.
				227
				228	``(...)``
				229	Matches whatever regular expression is inside the parentheses, and indicates the
				230	start and end of a group; the contents of a group can be retrieved after a match
				231	has been performed, and can be matched later in the string with the ``\number``
				232	special sequence, described below. To match the literals ``'('`` or ``')'``,
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	233	use ``$`` or ``$``, or enclose them inside a character class: ``[(]``, ``[)]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	234
				235	``(?...)``
				236	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				237	otherwise). The first character after the ``'?'`` determines what the meaning
				238	and further syntax of the construct is. Extensions usually do not create a new
				239	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				240	currently supported extensions.
				241
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	242	``(?aiLmsux)``
				243	(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				244	``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling	1c50e86	2009-06-01 00:11:36 +0000	[diff] [blame]	245	letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	246	:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	247	:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	248	:const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
				249	for the entire regular expression.
				250	(The flags are described in :ref:`contents-of-module-re`.)
				251	This is useful if you wish to include the flags as part of the
				252	regular expression, instead of passing a flag argument to the
Serhiy Storchaka	bd48d27	2016-09-11 12:50:02 +0300	[diff] [blame]	253	:func:`re.compile` function. Flags should be used first in the
				254	expression string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	255
				256	``(?:...)``
Georg Brandl	3122ce3	2010-10-29 06:17:38 +0000	[diff] [blame]	257	A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	258	expression is inside the parentheses, but the substring matched by the group
				259	cannot be retrieved after performing a match or referenced later in the
				260	pattern.
				261
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	262	``(?aiLmsux-imsx:...)``
				263	(Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				264	``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
				265	one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
				266	The letters set or remove the corresponding flags:
				267	:const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
				268	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				269	:const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
				270	and :const:`re.X` (verbose), for the part of the expression.
				271	(The flags are described in :ref:`contents-of-module-re`.)
				272
				273	The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
				274	as inline flags, so they can't be combined or follow ``'-'``. Instead,
				275	when one of them appears in an inline group, it overrides the matching mode
				276	in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
				277	ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
				278	(default). In byte pattern ``(?L:...)`` switches to locale depending
				279	matching, and ``(?a:...)`` switches to ASCII-only matching (default).
				280	This override is only in effect for the narrow inline group, and the
				281	original matching mode is restored outside of the group.
Serhiy Storchaka	be9a4e5	2016-09-10 00:57:55 +0300	[diff] [blame]	282
Zachary Ware	c307672	2016-09-09 15:47:05 -0700	[diff] [blame]	283	.. versionadded:: 3.6
Serhiy Storchaka	be9a4e5	2016-09-10 00:57:55 +0300	[diff] [blame]	284
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	285	.. versionchanged:: 3.7
				286	The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
				287
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	288	``(?P<name>...)``
				289	Similar to regular parentheses, but the substring matched by the group is
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	290	accessible via the symbolic group name name. Group names must be valid
				291	Python identifiers, and each group name must be defined only once within a
				292	regular expression. A symbolic group is also a numbered group, just as if
				293	the group were not named.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	294
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	295	Named groups can be referenced in three contexts. If the pattern is
				296	``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
				297	single or double quotes):
				298
				299	+---------------------------------------+----------------------------------+
				300	\| Context of reference to group "quote" \| Ways to reference it \|
				301	+=======================================+==================================+
				302	\| in the same pattern itself \| * ``(?P=quote)`` (as shown) \|
				303	\| \| * ``\1`` \|
				304	+---------------------------------------+----------------------------------+
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	305	\| when processing match object m \| * ``m.group('quote')`` \|
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	306	\| \| * ``m.end('quote')`` (etc.) \|
				307	+---------------------------------------+----------------------------------+
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	308	\| in a string passed to the repl \| * ``\g<quote>`` \|
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	309	\| argument of ``re.sub()`` \| * ``\g<1>`` \|
				310	\| \| * ``\1`` \|
				311	+---------------------------------------+----------------------------------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	312
				313	``(?P=name)``
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	314	A backreference to a named group; it matches whatever text was matched by the
				315	earlier group named name.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	316
				317	``(?#...)``
				318	A comment; the contents of the parentheses are simply ignored.
				319
				320	``(?=...)``
				321	Matches if ``...`` matches next, but doesn't consume any of the string. This is
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	322	called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	323	``'Isaac '`` only if it's followed by ``'Asimov'``.
				324
				325	``(?!...)``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	326	Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	327	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				328	followed by ``'Asimov'``.
				329
				330	``(?<=...)``
				331	Matches if the current position in the string is preceded by a match for ``...``
				332	that ends at the current position. This is called a :dfn:`positive lookbehind
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	333	assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	334	lookbehind will back up 3 characters and check if the contained pattern matches.
				335	The contained pattern must only match strings of some fixed length, meaning that
				336	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
Ezio Melotti	0a6b541	2012-04-29 07:34:46 +0300	[diff] [blame]	337	patterns which start with positive lookbehind assertions will not match at the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	338	beginning of the string being searched; you will most likely want to use the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	339	:func:`search` function rather than the :func:`match` function:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	340
				341	>>> import re
				342	>>> m = re.search('(?<=abc)def', 'abcdef')
				343	>>> m.group(0)
				344	'def'
				345
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	346	This example looks for a word following a hyphen:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	347
Cheryl Sabella	6677142	2018-02-02 16:16:27 -0500	[diff] [blame]	348	>>> m = re.search(r'(?<=-)\w+', 'spam-egg')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	349	>>> m.group(0)
				350	'egg'
				351
Georg Brandl	8c16cb9	2016-02-25 20:17:45 +0100	[diff] [blame]	352	.. versionchanged:: 3.5
Serhiy Storchaka	4eea62f	2015-02-21 10:07:35 +0200	[diff] [blame]	353	Added support for group references of fixed length.
				354
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	355	``(?<!...)``
				356	Matches if the current position in the string is not preceded by a match for
				357	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				358	positive lookbehind assertions, the contained pattern must only match strings of
				359	some fixed length. Patterns which start with negative lookbehind assertions may
				360	match at the beginning of the string being searched.
				361
				362	``(?(id/name)yes-pattern\|no-pattern)``
orsenthil@gmail.com	476021b	2011-03-12 10:46:25 +0800	[diff] [blame]	363	Will try to match with ``yes-pattern`` if the group with given id or
				364	name exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
				365	optional and can be omitted. For example,
				366	``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>\|$)`` is a poor email matching pattern, which
				367	will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
Serhiy Storchaka	a4d170d	2013-12-23 18:20:51 +0200	[diff] [blame]	368	not with ``'<user@host.com'`` nor ``'user@host.com>'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	369
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	370
				371	The special sequences consist of ``'\'`` and a character from the list below.
Martin Panter	98e9051	2016-06-12 06:17:29 +0000	[diff] [blame]	372	If the ordinary character is not an ASCII digit or an ASCII letter, then the
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	373	resulting RE will match the second character. For example, ``\$`` matches the
				374	character ``'$'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	375
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	376	``\number``
				377	Matches the contents of the group of the same number. Groups are numbered
				378	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl	2070e83	2013-10-06 12:58:20 +0200	[diff] [blame]	379	but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	380	can only be used to match one of the first 99 groups. If the first digit of
				381	number is 0, or number is 3 octal digits long, it will not be interpreted as
				382	a group match, but as the character with octal value number. Inside the
				383	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				384	characters.
				385
				386	``\A``
				387	Matches only at the start of the string.
				388
				389	``\b``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	390	Matches the empty string, but only at the beginning or end of a word.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	391	A word is defined as a sequence of word characters. Note that formally,
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	392	``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
				393	(or vice versa), or between ``\w`` and the beginning/end of the string.
				394	This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
				395	``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
				396
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	397	By default Unicode alphanumerics are the ones used in Unicode patterns, but
				398	this can be changed by using the :const:`ASCII` flag. Word boundaries are
				399	determined by the current locale if the :const:`LOCALE` flag is used.
				400	Inside a character range, ``\b`` represents the backspace character, for
				401	compatibility with Python's string literals.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	402
				403	``\B``
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	404	Matches the empty string, but only when it is not at the beginning or end
				405	of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
				406	``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	407	``\B`` is just the opposite of ``\b``, so word characters in Unicode
				408	patterns are Unicode alphanumerics or the underscore, although this can
				409	be changed by using the :const:`ASCII` flag. Word boundaries are
				410	determined by the current locale if the :const:`LOCALE` flag is used.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	411
				412	``\d``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	413	For Unicode (str) patterns:
Mark Dickinson	1f26828	2009-07-28 17:22:36 +0000	[diff] [blame]	414	Matches any Unicode decimal digit (that is, any character in
				415	Unicode character category [Nd]). This includes ``[0-9]``, and
				416	also many other digit characters. If the :const:`ASCII` flag is
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	417	used only ``[0-9]`` is matched.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	418
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	419	For 8-bit (bytes) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	420	Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	421
				422	``\D``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	423	Matches any character which is not a decimal digit. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	424	the opposite of ``\d``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	425	becomes the equivalent of ``[^0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	426
				427	``\s``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	428	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	429	Matches Unicode whitespace characters (which includes
				430	``[ \t\n\r\f\v]``, and also many other characters, for example the
				431	non-breaking spaces mandated by typography rules in many
				432	languages). If the :const:`ASCII` flag is used, only
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	433	``[ \t\n\r\f\v]`` is matched.
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	434
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	435	For 8-bit (bytes) patterns:
				436	Matches characters considered whitespace in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	437	this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	438
				439	``\S``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	440	Matches any character which is not a whitespace character. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	441	the opposite of ``\s``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	442	becomes the equivalent of ``[^ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	443
				444	``\w``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	445	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	446	Matches Unicode word characters; this includes most characters
				447	that can be part of a word in any language, as well as numbers and
				448	the underscore. If the :const:`ASCII` flag is used, only
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	449	``[a-zA-Z0-9_]`` is matched.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	450
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	451	For 8-bit (bytes) patterns:
				452	Matches characters considered alphanumeric in the ASCII character set;
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	453	this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
				454	used, matches characters considered alphanumeric in the current locale
				455	and the underscore.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	456
				457	``\W``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	458	Matches any character which is not a word character. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	459	the opposite of ``\w``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	460	becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	461	used, matches characters considered alphanumeric in the current locale
				462	and the underscore.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	463
				464	``\Z``
				465	Matches only at the end of the string.
				466
				467	Most of the standard escapes supported by Python string literals are also
				468	accepted by the regular expression parser::
				469
				470	\a \b \f \n
Serhiy Storchaka	a445feb	2018-02-10 00:08:17 +0200	[diff] [blame]	471	\N \r \t \u
				472	\U \v \x \\
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	473
Ezio Melotti	285e51b	2012-04-29 04:52:30 +0300	[diff] [blame]	474	(Note that ``\b`` is used to represent word boundaries, and means "backspace"
				475	only inside character classes.)
				476
Serhiy Storchaka	a445feb	2018-02-10 00:08:17 +0200	[diff] [blame]	477	``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	478	patterns. In bytes patterns they are errors.
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	479
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	480	Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	481	there are three octal digits, it is considered an octal escape. Otherwise, it is
				482	a group reference. As for string literals, octal escapes are always at most
				483	three digits in length.
				484
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	485	.. versionchanged:: 3.3
				486	The ``'\u'`` and ``'\U'`` escape sequences have been added.
				487
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	488	.. versionchanged:: 3.6
Martin Panter	98e9051	2016-06-12 06:17:29 +0000	[diff] [blame]	489	Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	490
Serhiy Storchaka	a445feb	2018-02-10 00:08:17 +0200	[diff] [blame]	491	.. versionchanged:: 3.8
				492	The ``'\N{name}'`` escape sequence has been added. As in string literals,
				493	it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	494
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	495
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	496	.. _contents-of-module-re:
				497
				498	Module Contents
				499	---------------
				500
				501	The module defines several functions, constants, and an exception. Some of the
				502	functions are simplified versions of the full featured methods for compiled
				503	regular expressions. Most non-trivial applications always use the compiled
				504	form.
				505
Ethan Furman	c88c80b	2016-11-21 08:29:31 -0800	[diff] [blame]	506	.. versionchanged:: 3.6
				507	Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
				508	:class:`enum.IntFlag`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	509
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	510	.. function:: compile(pattern, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	511
Henk-Jaap Wagenaar	ed94a8b	2017-08-28 06:41:20 +0100	[diff] [blame]	512	Compile a regular expression pattern into a :ref:`regular expression object
				513	<re-objects>`, which can be used for matching using its
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	514	:func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
Henk-Jaap Wagenaar	ed94a8b	2017-08-28 06:41:20 +0100	[diff] [blame]	515	below.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	516
				517	The expression's behaviour can be modified by specifying a flags value.
				518	Values can be any of the following variables, combined using bitwise OR (the
				519	``\|`` operator).
				520
				521	The sequence ::
				522
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	523	prog = re.compile(pattern)
				524	result = prog.match(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	525
				526	is equivalent to ::
				527
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	528	result = re.match(pattern, string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	529
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	530	but using :func:`re.compile` and saving the resulting regular expression
				531	object for reuse is more efficient when the expression will be used several
				532	times in a single program.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	533
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	534	.. note::
				535
				536	The compiled versions of the most recent patterns passed to
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	537	:func:`re.compile` and the module-level matching functions are cached, so
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	538	programs that use only a few regular expressions at a time needn't worry
				539	about compiling regular expressions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	540
				541
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	542	.. data:: A
				543	ASCII
				544
Georg Brandl	4049ce0	2009-06-08 07:49:54 +0000	[diff] [blame]	545	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
				546	perform ASCII-only matching instead of full Unicode matching. This is only
				547	meaningful for Unicode patterns, and is ignored for byte patterns.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	548	Corresponds to the inline flag ``(?a)``.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	549
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	550	Note that for backward compatibility, the :const:`re.U` flag still
				551	exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	552	counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	553	matches are Unicode by default for strings (and Unicode matching
				554	isn't allowed for bytes).
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	555
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	556
Sandro Tosi	da785fd	2012-01-01 12:55:20 +0100	[diff] [blame]	557	.. data:: DEBUG
				558
				559	Display debug information about compiled expression.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	560	No corresponding inline flag.
Sandro Tosi	da785fd	2012-01-01 12:55:20 +0100	[diff] [blame]	561
				562
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	563	.. data:: I
				564	IGNORECASE
				565
Brian Ward	c9d6dbc	2017-05-24 00:03:38 -0700	[diff] [blame]	566	Perform case-insensitive matching; expressions like ``[A-Z]`` will also
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	567	match lowercase letters. Full Unicode matching (such as ``Ü`` matching
				568	``ü``) also works unless the :const:`re.ASCII` flag is used to disable
				569	non-ASCII matches. The current locale does not change the effect of this
				570	flag unless the :const:`re.LOCALE` flag is also used.
				571	Corresponds to the inline flag ``(?i)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	572
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	573	Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
				574	combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
				575	letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
				576	letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
				577	'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
				578	If the :const:`ASCII` flag is used, only letters 'a' to 'z'
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	579	and 'A' to 'Z' are matched.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	580
				581	.. data:: L
				582	LOCALE
				583
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	584	Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
				585	dependent on the current locale. This flag can be used only with bytes
				586	patterns. The use of this flag is discouraged as the locale mechanism
				587	is very unreliable, it only handles one "culture" at a time, and it only
				588	works with 8-bit locales. Unicode matching is already enabled by default
				589	in Python 3 for Unicode (str) patterns, and it is able to handle different
				590	locales/languages.
				591	Corresponds to the inline flag ``(?L)``.
Serhiy Storchaka	22a309a	2014-12-01 11:50:07 +0200	[diff] [blame]	592
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	593	.. versionchanged:: 3.6
				594	:const:`re.LOCALE` can be used only with bytes patterns and is
				595	not compatible with :const:`re.ASCII`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	596
Serhiy Storchaka	898ff03	2017-05-05 08:53:40 +0300	[diff] [blame]	597	.. versionchanged:: 3.7
				598	Compiled regular expression objects with the :const:`re.LOCALE` flag no
				599	longer depend on the locale at compile time. Only the locale at
				600	matching time affects the result of matching.
				601
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	602
				603	.. data:: M
				604	MULTILINE
				605
				606	When specified, the pattern character ``'^'`` matches at the beginning of the
				607	string and at the beginning of each line (immediately following each newline);
				608	and the pattern character ``'$'`` matches at the end of the string and at the
				609	end of each line (immediately preceding each newline). By default, ``'^'``
				610	matches only at the beginning of the string, and ``'$'`` only at the end of the
				611	string and immediately before the newline (if any) at the end of the string.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	612	Corresponds to the inline flag ``(?m)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	613
				614
				615	.. data:: S
				616	DOTALL
				617
				618	Make the ``'.'`` special character match any character at all, including a
				619	newline; without this flag, ``'.'`` will match anything except a newline.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	620	Corresponds to the inline flag ``(?s)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	621
				622
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	623	.. data:: X
				624	VERBOSE
				625
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	626	This flag allows you to write regular expressions that look nicer and are
				627	more readable by allowing you to visually separate logical sections of the
				628	pattern and add comments. Whitespace within the pattern is ignored, except
Serhiy Storchaka	b0b44b4	2017-11-14 17:21:26 +0200	[diff] [blame]	629	when in a character class, or when preceded by an unescaped backslash,
				630	or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	631	When a line contains a ``#`` that is not in a character class and is not
				632	preceded by an unescaped backslash, all characters from the leftmost such
				633	``#`` through the end of the line are ignored.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	634
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	635	This means that the two following regular expression objects that match a
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	636	decimal number are functionally equal::
Georg Brandl	81ac1ce	2007-08-31 17:17:17 +0000	[diff] [blame]	637
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	638	a = re.compile(r"""\d + # the integral part
				639	\. # the decimal point
				640	\d * # some fractional digits""", re.X)
				641	b = re.compile(r"\d+\.\d*")
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	642
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	643	Corresponds to the inline flag ``(?x)``.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	644
				645
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	646	.. function:: search(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	647
Terry Jan Reedy	0edb5c1	2014-05-30 16:19:59 -0400	[diff] [blame]	648	Scan through string looking for the first location where the regular expression
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	649	pattern produces a match, and return a corresponding :ref:`match object
				650	<match-objects>`. Return ``None`` if no position in the string matches the
				651	pattern; note that this is different from finding a zero-length match at some
				652	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	653
				654
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	655	.. function:: match(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	656
				657	If zero or more characters at the beginning of string match the regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	658	expression pattern, return a corresponding :ref:`match object
				659	<match-objects>`. Return ``None`` if the string does not match the pattern;
				660	note that this is different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	661
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	662	Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
				663	at the beginning of the string and not at the beginning of each line.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	664
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	665	If you want to locate a match anywhere in string, use :func:`search`
				666	instead (see also :ref:`search-vs-match`).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	667
				668
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	669	.. function:: fullmatch(pattern, string, flags=0)
				670
				671	If the whole string matches the regular expression pattern, return a
				672	corresponding :ref:`match object <match-objects>`. Return ``None`` if the
				673	string does not match the pattern; note that this is different from a
				674	zero-length match.
				675
				676	.. versionadded:: 3.4
				677
				678
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	679	.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	680
				681	Split string by the occurrences of pattern. If capturing parentheses are
				682	used in pattern, then the text of all groups in the pattern are also returned
				683	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				684	splits occur, and the remainder of the string is returned as the final element
Georg Brandl	9647389	2008-03-06 07:09:43 +0000	[diff] [blame]	685	of the list. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	686
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	687	>>> re.split(r'\W+', 'Words, words, words.')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	688	['Words', 'words', 'words', '']
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	689	>>> re.split(r'(\W+)', 'Words, words, words.')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	690	['Words', ', ', 'words', ', ', 'words', '.', '']
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	691	>>> re.split(r'\W+', 'Words, words, words.', 1)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	692	['Words', 'words, words.']
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	693	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
				694	['0', '3', '9']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	695
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	696	If there are capturing groups in the separator and it matches at the start of
				697	the string, the result will start with an empty string. The same holds for
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	698	the end of the string::
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	699
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	700	>>> re.split(r'(\W+)', '...words, words...')
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	701	['', '...', 'words', ', ', 'words', '...', '']
				702
				703	That way, separator components are always found at the same relative
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	704	indices within the result list.
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	705
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	706	Empty matches for the pattern split the string only when not adjacent
				707	to a previous empty match.
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	708
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	709	>>> re.split(r'\b', 'Words, words, words.')
				710	['', 'Words', ', ', 'words', ', ', 'words', '.']
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	711	>>> re.split(r'\W*', '...words...')
				712	['', '', 'w', 'o', 'r', 'd', 's', '', '']
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	713	>>> re.split(r'(\W*)', '...words...')
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	714	['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	715
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	716	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	717	Added the optional flags argument.
				718
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	719	.. versionchanged:: 3.7
				720	Added support of splitting on a pattern that could match an empty string.
				721
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	722
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	723	.. function:: findall(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	724
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	725	Return all non-overlapping matches of pattern in string, as a list of
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	726	strings. The string is scanned left-to-right, and matches are returned in
				727	the order found. If one or more groups are present in the pattern, return a
				728	list of groups; this will be a list of tuples if the pattern has more than
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	729	one group. Empty matches are included in the result.
				730
				731	.. versionchanged:: 3.7
				732	Non-empty matches can now start just after a previous empty match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	733
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	734
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	735	.. function:: finditer(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	736
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	737	Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
				738	all non-overlapping matches for the RE pattern in string. The string
				739	is scanned left-to-right, and matches are returned in the order found. Empty
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	740	matches are included in the result.
				741
				742	.. versionchanged:: 3.7
				743	Non-empty matches can now start just after a previous empty match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	744
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	745
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	746	.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	747
				748	Return the string obtained by replacing the leftmost non-overlapping occurrences
				749	of pattern in string by the replacement repl. If the pattern isn't found,
				750	string is returned unchanged. repl can be a string or a function; if it is
				751	a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi	6a633bb	2011-08-19 22:54:50 +0200	[diff] [blame]	752	converted to a single newline character, ``\r`` is converted to a carriage return, and
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	753	so forth. Unknown escapes such as ``\&`` are left alone. Backreferences, such
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	754	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	755	For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	756
				757	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				758	... r'static PyObject*\npy_\1(void)\n{',
				759	... 'def myfunc():')
				760	'static PyObject*\npy_myfunc(void)\n{'
				761
				762	If repl is a function, it is called for every non-overlapping occurrence of
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	763	pattern. The function takes a single :ref:`match object <match-objects>`
				764	argument, and returns the replacement string. For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	765
				766	>>> def dashrepl(matchobj):
				767	... if matchobj.group(0) == '-': return ' '
				768	... else: return '-'
				769	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				770	'pro--gram files'
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	771	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
				772	'Baked Beans & Spam'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	773
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	774	The pattern may be a string or a :ref:`pattern object <re-objects>`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	775
				776	The optional argument count is the maximum number of pattern occurrences to be
				777	replaced; count must be a non-negative integer. If omitted or zero, all
				778	occurrences will be replaced. Empty matches for the pattern are replaced only
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	779	when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
				780	``'-a-b--d-'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	781
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	782	In string-type repl arguments, in addition to the character escapes and
				783	backreferences described above,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	784	``\g<name>`` will use the substring matched by the group named ``name``, as
				785	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				786	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				787	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				788	reference to group 20, not a reference to group 2 followed by the literal
				789	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				790	substring matched by the RE.
				791
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	792	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	793	Added the optional flags argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	794
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	795	.. versionchanged:: 3.5
				796	Unmatched groups are replaced with an empty string.
				797
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	798	.. versionchanged:: 3.6
Serhiy Storchaka	53c53ea	2016-12-06 19:15:29 +0200	[diff] [blame]	799	Unknown escapes in pattern consisting of ``'\'`` and an ASCII letter
				800	now are errors.
				801
Serhiy Storchaka	ff3dbe9	2016-12-06 19:25:19 +0200	[diff] [blame]	802	.. versionchanged:: 3.7
				803	Unknown escapes in repl consisting of ``'\'`` and an ASCII letter
				804	now are errors.
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	805
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	806	Empty matches for the pattern are replaced when adjacent to a previous
				807	non-empty match.
				808
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	809
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	810	.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	811
				812	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				813	number_of_subs_made)``.
				814
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	815	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	816	Added the optional flags argument.
				817
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	818	.. versionchanged:: 3.5
				819	Unmatched groups are replaced with an empty string.
				820
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	821
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	822	.. function:: escape(pattern)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	823
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	824	Escape special characters in pattern.
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	825	This is useful if you want to match an arbitrary literal string that may
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	826	have regular expression metacharacters in it. For example::
				827
				828	>>> print(re.escape('python.exe'))
				829	python\.exe
				830
				831	>>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`\|~:"
				832	>>> print('[%s]+' % re.escape(legal_chars))
Serhiy Storchaka	05cb728	2017-11-16 12:38:26 +0200	[diff] [blame]	833	[abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\\|\~:]+
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	834
				835	>>> operators = ['+', '-', '', '/', '*']
				836	>>> print('\|'.join(map(re.escape, sorted(operators, reverse=True))))
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	837	/\|\-\|\+\|\\\|\*
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	838
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	839	This functions must not be used for the replacement string in :func:`sub`
				840	and :func:`subn`, only backslashes should be escaped. For example::
				841
				842	>>> digits_re = r'\d+'
				843	>>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
				844	>>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
				845	/usr/sbin/sendmail - \d+ errors, \d+ warnings
				846
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	847	.. versionchanged:: 3.3
				848	The ``'_'`` character is no longer escaped.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	849
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	850	.. versionchanged:: 3.7
				851	Only characters that can have special meaning in a regular expression
				852	are escaped.
				853
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	854
R. David Murray	522c32a	2010-07-10 14:23:36 +0000	[diff] [blame]	855	.. function:: purge()
				856
				857	Clear the regular expression cache.
				858
				859
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	860	.. exception:: error(msg, pattern=None, pos=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	861
				862	Exception raised when a string passed to one of the functions here is not a
				863	valid regular expression (for example, it might contain unmatched parentheses)
				864	or when some other error occurs during compilation or matching. It is never an
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	865	error if a string contains no match for a pattern. The error instance has
				866	the following additional attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	867
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	868	.. attribute:: msg
				869
				870	The unformatted error message.
				871
				872	.. attribute:: pattern
				873
				874	The regular expression pattern.
				875
				876	.. attribute:: pos
				877
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	878	The index in pattern where compilation failed (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	879
				880	.. attribute:: lineno
				881
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	882	The line corresponding to pos (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	883
				884	.. attribute:: colno
				885
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	886	The column corresponding to pos (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	887
				888	.. versionchanged:: 3.5
				889	Added additional attributes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	890
				891	.. _re-objects:
				892
				893	Regular Expression Objects
				894	--------------------------
				895
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	896	Compiled regular expression objects support the following methods and
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	897	attributes:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	898
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	899	.. method:: Pattern.search(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	900
Berker Peksag	84f387d	2016-06-08 14:56:56 +0300	[diff] [blame]	901	Scan through string looking for the first location where this regular
				902	expression produces a match, and return a corresponding :ref:`match object
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	903	<match-objects>`. Return ``None`` if no position in the string matches the
				904	pattern; note that this is different from finding a zero-length match at some
				905	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	906
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	907	The optional second parameter pos gives an index in the string where the
				908	search is to start; it defaults to ``0``. This is not completely equivalent to
				909	slicing the string; the ``'^'`` pattern character matches at the real beginning
				910	of the string and at positions just after a newline, but not necessarily at the
				911	index where the search is to start.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	912
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	913	The optional parameter endpos limits how far the string will be searched; it
				914	will be as if the string is endpos characters long, so only the characters
				915	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	916	than pos, no match will be found; otherwise, if rx is a compiled regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	917	expression object, ``rx.search(string, 0, 50)`` is equivalent to
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	918	``rx.search(string[:50], 0)``. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	919
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	920	>>> pattern = re.compile("d")
				921	>>> pattern.search("dog") # Match at index 0
				922	<re.Match object; span=(0, 1), match='d'>
				923	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	924
				925
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	926	.. method:: Pattern.match(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	927
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	928	If zero or more characters at the beginning of string match this regular
				929	expression, return a corresponding :ref:`match object <match-objects>`.
				930	Return ``None`` if the string does not match the pattern; note that this is
				931	different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	932
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	933	The optional pos and endpos parameters have the same meaning as for the
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	934	:meth:`~Pattern.search` method. ::
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	935
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	936	>>> pattern = re.compile("o")
				937	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
				938	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				939	<re.Match object; span=(1, 2), match='o'>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	940
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	941	If you want to locate a match anywhere in string, use
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	942	:meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	943
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	944
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	945	.. method:: Pattern.fullmatch(string[, pos[, endpos]])
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	946
				947	If the whole string matches this regular expression, return a corresponding
				948	:ref:`match object <match-objects>`. Return ``None`` if the string does not
				949	match the pattern; note that this is different from a zero-length match.
				950
				951	The optional pos and endpos parameters have the same meaning as for the
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	952	:meth:`~Pattern.search` method. ::
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	953
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	954	>>> pattern = re.compile("o[gh]")
				955	>>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
				956	>>> pattern.fullmatch("ogre") # No match as not the full string matches.
				957	>>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
				958	<re.Match object; span=(1, 3), match='og'>
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	959
				960	.. versionadded:: 3.4
				961
				962
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	963	.. method:: Pattern.split(string, maxsplit=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	964
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	965	Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	966
				967
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	968	.. method:: Pattern.findall(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	969
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	970	Similar to the :func:`findall` function, using the compiled pattern, but
				971	also accepts optional pos and endpos parameters that limit the search
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	972	region like for :meth:`search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	973
				974
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	975	.. method:: Pattern.finditer(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	976
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	977	Similar to the :func:`finditer` function, using the compiled pattern, but
				978	also accepts optional pos and endpos parameters that limit the search
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	979	region like for :meth:`search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	980
				981
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	982	.. method:: Pattern.sub(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	983
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	984	Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	985
				986
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	987	.. method:: Pattern.subn(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	988
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	989	Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	990
				991
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	992	.. attribute:: Pattern.flags
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	993
Georg Brandl	3a19e54	2012-03-17 17:29:27 +0100	[diff] [blame]	994	The regex matching flags. This is a combination of the flags given to
				995	:func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
				996	flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	997
				998
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	999	.. attribute:: Pattern.groups
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	1000
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1001	The number of capturing groups in the pattern.
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	1002
				1003
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1004	.. attribute:: Pattern.groupindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1005
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1006	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				1007	numbers. The dictionary is empty if no symbolic groups were used in the
				1008	pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1009
				1010
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1011	.. attribute:: Pattern.pattern
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1012
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1013	The pattern string from which the pattern object was compiled.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1014
				1015
Serhiy Storchaka	fdbd011	2017-04-16 10:16:03 +0300	[diff] [blame]	1016	.. versionchanged:: 3.7
				1017	Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled
				1018	regular expression objects are considered atomic.
				1019
				1020
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1021	.. _match-objects:
				1022
				1023	Match Objects
				1024	-------------
				1025
Ezio Melotti	b87f82f	2012-11-04 06:59:22 +0200	[diff] [blame]	1026	Match objects always have a boolean value of ``True``.
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1027	Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
Ezio Melotti	b87f82f	2012-11-04 06:59:22 +0200	[diff] [blame]	1028	when there is no match, you can test whether there was a match with a simple
				1029	``if`` statement::
				1030
				1031	match = re.search(pattern, string)
				1032	if match:
				1033	process(match)
				1034
				1035	Match objects support the following methods and attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1036
				1037
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1038	.. method:: Match.expand(template)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1039
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1040	Return the string obtained by doing backslash substitution on the template
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1041	string template, as done by the :meth:`~Pattern.sub` method.
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1042	Escapes such as ``\n`` are converted to the appropriate characters,
				1043	and numeric backreferences (``\1``, ``\2``) and named backreferences
				1044	(``\g<1>``, ``\g<name>``) are replaced by the contents of the
				1045	corresponding group.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1046
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	1047	.. versionchanged:: 3.5
				1048	Unmatched groups are replaced with an empty string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1049
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1050	.. method:: Match.group([group1, ...])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1051
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1052	Returns one or more subgroups of the match. If there is a single argument, the
				1053	result is a single string; if there are multiple arguments, the result is a
				1054	tuple with one item per argument. Without arguments, group1 defaults to zero
				1055	(the whole match is returned). If a groupN argument is zero, the corresponding
				1056	return value is the entire matching string; if it is in the inclusive range
				1057	[1..99], it is the string matching the corresponding parenthesized group. If a
				1058	group number is negative or larger than the number of groups defined in the
				1059	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				1060	part of the pattern that did not match, the corresponding result is ``None``.
				1061	If a group is contained in a part of the pattern that matched multiple times,
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1062	the last match is returned. ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1063
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1064	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				1065	>>> m.group(0) # The entire match
				1066	'Isaac Newton'
				1067	>>> m.group(1) # The first parenthesized subgroup.
				1068	'Isaac'
				1069	>>> m.group(2) # The second parenthesized subgroup.
				1070	'Newton'
				1071	>>> m.group(1, 2) # Multiple arguments give us a tuple.
				1072	('Isaac', 'Newton')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1073
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1074	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				1075	arguments may also be strings identifying groups by their group name. If a
				1076	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				1077	exception is raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1078
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1079	A moderately complicated example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1080
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1081	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				1082	>>> m.group('first_name')
				1083	'Malcolm'
				1084	>>> m.group('last_name')
				1085	'Reynolds'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1086
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1087	Named groups can also be referred to by their index::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1088
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1089	>>> m.group(1)
				1090	'Malcolm'
				1091	>>> m.group(2)
				1092	'Reynolds'
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1093
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1094	If a group matches multiple times, only the last match is accessible::
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1095
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1096	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				1097	>>> m.group(1) # Returns only the last match.
				1098	'c3'
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1099
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	1100
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1101	.. method:: Match.__getitem__(g)
Eric V. Smith	605bdae	2016-09-11 08:55:43 -0400	[diff] [blame]	1102
				1103	This is identical to ``m.group(g)``. This allows easier access to
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1104	an individual group from a match::
Eric V. Smith	605bdae	2016-09-11 08:55:43 -0400	[diff] [blame]	1105
				1106	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				1107	>>> m[0] # The entire match
				1108	'Isaac Newton'
				1109	>>> m[1] # The first parenthesized subgroup.
				1110	'Isaac'
				1111	>>> m[2] # The second parenthesized subgroup.
				1112	'Newton'
				1113
				1114	.. versionadded:: 3.6
				1115
				1116
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1117	.. method:: Match.groups(default=None)
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	1118
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1119	Return a tuple containing all the subgroups of the match, from 1 up to however
				1120	many groups are in the pattern. The default argument is used for groups that
				1121	did not participate in the match; it defaults to ``None``.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1122
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1123	For example::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1124
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1125	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				1126	>>> m.groups()
				1127	('24', '1632')
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1128
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1129	If we make the decimal place and everything after it optional, not all groups
				1130	might participate in the match. These groups will default to ``None`` unless
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1131	the default argument is given::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1132
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1133	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				1134	>>> m.groups() # Second group defaults to None.
				1135	('24', None)
				1136	>>> m.groups('0') # Now, the second group defaults to '0'.
				1137	('24', '0')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1138
				1139
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1140	.. method:: Match.groupdict(default=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1141
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1142	Return a dictionary containing all the named subgroups of the match, keyed by
				1143	the subgroup name. The default argument is used for groups that did not
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1144	participate in the match; it defaults to ``None``. For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1145
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1146	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				1147	>>> m.groupdict()
				1148	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1149
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1150
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1151	.. method:: Match.start([group])
				1152	Match.end([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1153
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1154	Return the indices of the start and end of the substring matched by group;
				1155	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				1156	group exists but did not contribute to the match. For a match object m, and
				1157	a group g that did contribute to the match, the substring matched by group g
				1158	(equivalent to ``m.group(g)``) is ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1159
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1160	m.string[m.start(g):m.end(g)]
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1161
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1162	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				1163	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				1164	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				1165	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1166
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1167	An example that will remove remove_this from email addresses::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1168
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1169	>>> email = "tony@tiremove_thisger.net"
				1170	>>> m = re.search("remove_this", email)
				1171	>>> email[:m.start()] + email[m.end():]
				1172	'tony@tiger.net'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1173
				1174
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1175	.. method:: Match.span([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1176
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1177	For a match m, return the 2-tuple ``(m.start(group), m.end(group))``. Note
				1178	that if group did not contribute to the match, this is ``(-1, -1)``.
				1179	group defaults to zero, the entire match.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1180
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1181
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1182	.. attribute:: Match.pos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1183
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1184	The value of pos which was passed to the :meth:`~Pattern.search` or
				1185	:meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	1186	the index into the string at which the RE engine started looking for a match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1187
				1188
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1189	.. attribute:: Match.endpos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1190
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1191	The value of endpos which was passed to the :meth:`~Pattern.search` or
				1192	:meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	1193	the index into the string beyond which the RE engine will not go.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1194
				1195
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1196	.. attribute:: Match.lastindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1197
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1198	The integer index of the last matched capturing group, or ``None`` if no group
				1199	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				1200	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				1201	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				1202	string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1203
				1204
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1205	.. attribute:: Match.lastgroup
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1206
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1207	The name of the last matched capturing group, or ``None`` if the group didn't
				1208	have a name, or if no group was matched at all.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1209
				1210
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1211	.. attribute:: Match.re
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1212
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1213	The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1214	:meth:`~Pattern.search` method produced this match instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1215
				1216
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1217	.. attribute:: Match.string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1218
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1219	The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1220
				1221
Serhiy Storchaka	fdbd011	2017-04-16 10:16:03 +0300	[diff] [blame]	1222	.. versionchanged:: 3.7
				1223	Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects
				1224	are considered atomic.
				1225
				1226
Raymond Hettinger	1fa7682	2010-12-06 23:31:36 +0000	[diff] [blame]	1227	.. _re-examples:
				1228
				1229	Regular Expression Examples
				1230	---------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1231
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1232
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	1233	Checking for a Pair
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1234	^^^^^^^^^^^^^^^^^^^
				1235
				1236	In this example, we'll use the following helper function to display match
Stéphane Wirtel	859c068	2018-10-12 09:51:05 +0200	[diff] [blame^]	1237	objects a little more gracefully::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1238
				1239	def displaymatch(match):
				1240	if match is None:
				1241	return None
				1242	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				1243
				1244	Suppose you are writing a poker program where a player's hand is represented as
				1245	a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1246	for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1247	representing the card with that value.
				1248
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1249	To see if a given string is a valid hand, one could do the following::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1250
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1251	>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
				1252	>>> displaymatch(valid.match("akt5q")) # Valid.
				1253	"<Match: 'akt5q', groups=()>"
				1254	>>> displaymatch(valid.match("akt5e")) # Invalid.
				1255	>>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1256	>>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1257	"<Match: '727ak', groups=()>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1258
				1259	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1260	To match this with a regular expression, one could use backreferences as such::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1261
				1262	>>> pair = re.compile(r".(.).\1")
				1263	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1264	"<Match: '717', groups=('7',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1265	>>> displaymatch(pair.match("718ak")) # No pairs.
				1266	>>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1267	"<Match: '354aa', groups=('a',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1268
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	1269	To find out what card the pair consists of, one could use the
Stéphane Wirtel	859c068	2018-10-12 09:51:05 +0200	[diff] [blame^]	1270	:meth:`~Match.group` method of the match object in the following manner::
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1271
Stéphane Wirtel	859c068	2018-10-12 09:51:05 +0200	[diff] [blame^]	1272	>>> pair = re.compile(r".(.).\1")
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1273	>>> pair.match("717ak").group(1)
				1274	'7'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1275
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1276	# Error because re.match() returns None, which doesn't have a group() method:
				1277	>>> pair.match("718ak").group(1)
				1278	Traceback (most recent call last):
				1279	File "<pyshell#23>", line 1, in <module>
				1280	re.match(r".(.).\1", "718ak").group(1)
				1281	AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1282
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1283	>>> pair.match("354aa").group(1)
				1284	'a'
				1285
				1286
				1287	Simulating scanf()
				1288	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1289
				1290	.. index:: single: scanf()
				1291
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1292	Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1293	expressions are generally more powerful, though also more verbose, than
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1294	:c:func:`scanf` format strings. The table below offers some more-or-less
				1295	equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1296	expressions.
				1297
				1298	+--------------------------------+---------------------------------------------+
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1299	\| :c:func:`scanf` Token \| Regular Expression \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1300	+================================+=============================================+
				1301	\| ``%c`` \| ``.`` \|
				1302	+--------------------------------+---------------------------------------------+
				1303	\| ``%5c`` \| ``.{5}`` \|
				1304	+--------------------------------+---------------------------------------------+
				1305	\| ``%d`` \| ``[-+]?\d+`` \|
				1306	+--------------------------------+---------------------------------------------+
				1307	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1308	+--------------------------------+---------------------------------------------+
				1309	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1310	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1311	\| ``%o`` \| ``[-+]?[0-7]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1312	+--------------------------------+---------------------------------------------+
				1313	\| ``%s`` \| ``\S+`` \|
				1314	+--------------------------------+---------------------------------------------+
				1315	\| ``%u`` \| ``\d+`` \|
				1316	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1317	\| ``%x``, ``%X`` \| ``[-+]?(0[xX])?[\dA-Fa-f]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1318	+--------------------------------+---------------------------------------------+
				1319
				1320	To extract the filename and numbers from a string like ::
				1321
				1322	/usr/sbin/sendmail - 0 errors, 4 warnings
				1323
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1324	you would use a :c:func:`scanf` format like ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1325
				1326	%s - %d errors, %d warnings
				1327
				1328	The equivalent regular expression would be ::
				1329
				1330	(\S+) - (\d+) errors, (\d+) warnings
				1331
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1332
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1333	.. _search-vs-match:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1334
				1335	search() vs. match()
				1336	^^^^^^^^^^^^^^^^^^^^
				1337
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1338	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1339
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1340	Python offers two different primitive operations based on regular expressions:
				1341	:func:`re.match` checks for a match only at the beginning of the string, while
				1342	:func:`re.search` checks for a match anywhere in the string (this is what Perl
				1343	does by default).
				1344
				1345	For example::
				1346
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1347	>>> re.match("c", "abcdef") # No match
				1348	>>> re.search("c", "abcdef") # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1349	<re.Match object; span=(2, 3), match='c'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1350
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1351	Regular expressions beginning with ``'^'`` can be used with :func:`search` to
				1352	restrict the match at the beginning of the string::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1353
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1354	>>> re.match("c", "abcdef") # No match
				1355	>>> re.search("^c", "abcdef") # No match
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1356	>>> re.search("^a", "abcdef") # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1357	<re.Match object; span=(0, 1), match='a'>
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1358
				1359	Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
				1360	beginning of the string, whereas using :func:`search` with a regular expression
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1361	beginning with ``'^'`` will match at the beginning of each line. ::
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1362
				1363	>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
				1364	>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1365	<re.Match object; span=(4, 5), match='X'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1366
				1367
				1368	Making a Phonebook
				1369	^^^^^^^^^^^^^^^^^^
				1370
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1371	:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1372	method is invaluable for converting textual data into data structures that can be
				1373	easily read and modified by Python as demonstrated in the following example that
				1374	creates a phonebook.
				1375
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1376	First, here is the input. Normally it may come from a file, here we are using
Stéphane Wirtel	859c068	2018-10-12 09:51:05 +0200	[diff] [blame^]	1377	triple-quoted string syntax
				1378
				1379	.. doctest::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1380
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1381	>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1382	...
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1383	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1384	... Frank Burger: 925.541.7625 662 South Dogwood Way
				1385	...
				1386	...
				1387	... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1388
				1389	The entries are separated by one or more newlines. Now we convert the string
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1390	into a list with each nonempty line having its own entry:
				1391
				1392	.. doctest::
				1393	:options: +NORMALIZE_WHITESPACE
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1394
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1395	>>> entries = re.split("\n+", text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1396	>>> entries
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1397	['Ross McFluff: 834.345.1254 155 Elm Street',
				1398	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
				1399	'Frank Burger: 925.541.7625 662 South Dogwood Way',
				1400	'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1401
				1402	Finally, split each entry into a list with first name, last name, telephone
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1403	number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1404	because the address has spaces, our splitting pattern, in it:
				1405
				1406	.. doctest::
				1407	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1408
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1409	>>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1410	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1411	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1412	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1413	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1414
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1415	The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1416	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1417	house number from the street name:
				1418
				1419	.. doctest::
				1420	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1421
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1422	>>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1423	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1424	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1425	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1426	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1427
				1428
				1429	Text Munging
				1430	^^^^^^^^^^^^
				1431
				1432	:func:`sub` replaces every occurrence of a pattern with a string or the
				1433	result of a function. This example demonstrates using :func:`sub` with
				1434	a function to "munge" text, or randomize the order of all the characters
				1435	in each word of a sentence except for the first and last characters::
				1436
				1437	>>> def repl(m):
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1438	... inner_word = list(m.group(2))
				1439	... random.shuffle(inner_word)
				1440	... return m.group(1) + "".join(inner_word) + m.group(3)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1441	>>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1442	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1443	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1444	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1445	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1446
				1447
				1448	Finding all Adverbs
				1449	^^^^^^^^^^^^^^^^^^^
				1450
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1451	:func:`findall` matches all occurrences of a pattern, not just the first
Andrés Delfino	5092439	2018-06-18 01:34:30 -0300	[diff] [blame]	1452	one as :func:`search` does. For example, if a writer wanted to
				1453	find all of the adverbs in some text, they might use :func:`findall` in
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1454	the following manner::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1455
				1456	>>> text = "He was carefully disguised but captured quickly by police."
				1457	>>> re.findall(r"\w+ly", text)
				1458	['carefully', 'quickly']
				1459
				1460
				1461	Finding all Adverbs and their Positions
				1462	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1463
				1464	If one wants more information about all matches of a pattern than the matched
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1465	text, :func:`finditer` is useful as it provides :ref:`match objects
				1466	<match-objects>` instead of strings. Continuing with the previous example, if
Andrés Delfino	5092439	2018-06-18 01:34:30 -0300	[diff] [blame]	1467	a writer wanted to find all of the adverbs and their positions in
				1468	some text, they would use :func:`finditer` in the following manner::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1469
				1470	>>> text = "He was carefully disguised but captured quickly by police."
				1471	>>> for m in re.finditer(r"\w+ly", text):
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1472	... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1473	07-16: carefully
				1474	40-47: quickly
				1475
				1476
				1477	Raw String Notation
				1478	^^^^^^^^^^^^^^^^^^^
				1479
				1480	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1481	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1482	another one to escape it. For example, the two following lines of code are
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1483	functionally identical::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1484
				1485	>>> re.match(r"\W(.)\1\W", " ff ")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1486	<re.Match object; span=(0, 4), match=' ff '>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1487	>>> re.match("\\W(.)\\1\\W", " ff ")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1488	<re.Match object; span=(0, 4), match=' ff '>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1489
				1490	When one wants to match a literal backslash, it must be escaped in the regular
				1491	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1492	notation, one must use ``"\\\\"``, making the following lines of code
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1493	functionally identical::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1494
				1495	>>> re.match(r"\\", r"\\")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1496	<re.Match object; span=(0, 1), match='\\'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1497	>>> re.match("\\\\", r"\\")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1498	<re.Match object; span=(0, 1), match='\\'>
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1499
				1500
				1501	Writing a Tokenizer
				1502	^^^^^^^^^^^^^^^^^^^
				1503
Georg Brandl	5d94134	2016-02-26 19:37:12 +0100	[diff] [blame]	1504	A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1505	analyzes a string to categorize groups of characters. This is a useful first
				1506	step in writing a compiler or interpreter.
				1507
				1508	The text categories are specified with regular expressions. The technique is
				1509	to combine those into a single master regular expression and to loop over
				1510	successive matches::
				1511
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1512	import collections
				1513	import re
				1514
				1515	Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1516
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1517	def tokenize(code):
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1518	keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
				1519	token_specification = [
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1520	('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
				1521	('ASSIGN', r':='), # Assignment operator
				1522	('END', r';'), # Statement terminator
				1523	('ID', r'[A-Za-z]+'), # Identifiers
				1524	('OP', r'[+\-*/]'), # Arithmetic operators
				1525	('NEWLINE', r'\n'), # Line endings
				1526	('SKIP', r'[ \t]+'), # Skip over spaces and tabs
				1527	('MISMATCH',r'.'), # Any other character
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1528	]
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1529	tok_regex = '\|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1530	line_num = 1
				1531	line_start = 0
				1532	for mo in re.finditer(tok_regex, code):
				1533	kind = mo.lastgroup
				1534	value = mo.group(kind)
				1535	if kind == 'NEWLINE':
				1536	line_start = mo.end()
				1537	line_num += 1
				1538	elif kind == 'SKIP':
				1539	pass
				1540	elif kind == 'MISMATCH':
Raymond Hettinger	d0b9158	2017-02-06 07:15:31 -0800	[diff] [blame]	1541	raise RuntimeError(f'{value!r} unexpected on line {line_num}')
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1542	else:
				1543	if kind == 'ID' and value in keywords:
				1544	kind = value
				1545	column = mo.start() - line_start
				1546	yield Token(kind, value, line_num, column)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1547
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1548	statements = '''
				1549	IF quantity THEN
				1550	total := total + price * quantity;
				1551	tax := price * 0.05;
				1552	ENDIF;
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1553	'''
Raymond Hettinger	23157e5	2011-05-13 01:38:31 -0700	[diff] [blame]	1554
				1555	for token in tokenize(statements):
				1556	print(token)
				1557
				1558	The tokenizer produces the following output::
Raymond Hettinger	9c47d77	2011-05-13 01:03:50 -0700	[diff] [blame]	1559
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1560	Token(typ='IF', value='IF', line=2, column=4)
				1561	Token(typ='ID', value='quantity', line=2, column=7)
				1562	Token(typ='THEN', value='THEN', line=2, column=16)
				1563	Token(typ='ID', value='total', line=3, column=8)
				1564	Token(typ='ASSIGN', value=':=', line=3, column=14)
				1565	Token(typ='ID', value='total', line=3, column=17)
				1566	Token(typ='OP', value='+', line=3, column=23)
				1567	Token(typ='ID', value='price', line=3, column=25)
				1568	Token(typ='OP', value='*', line=3, column=31)
				1569	Token(typ='ID', value='quantity', line=3, column=33)
				1570	Token(typ='END', value=';', line=3, column=41)
				1571	Token(typ='ID', value='tax', line=4, column=8)
				1572	Token(typ='ASSIGN', value=':=', line=4, column=12)
				1573	Token(typ='ID', value='price', line=4, column=15)
				1574	Token(typ='OP', value='*', line=4, column=21)
				1575	Token(typ='NUMBER', value='0.05', line=4, column=23)
				1576	Token(typ='END', value=';', line=4, column=27)
				1577	Token(typ='ENDIF', value='ENDIF', line=5, column=4)
				1578	Token(typ='END', value=';', line=5, column=9)
Berker Peksag	a0a42d2	2018-03-23 16:46:52 +0300	[diff] [blame]	1579
				1580
				1581	.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
				1582	Media, 2009. The third edition of the book no longer covers Python at all,
				1583	but the first edition covered writing good regular expression patterns in
				1584	great detail.