Blame - Doc/library/re.rst - platform/external/python/cpython3

blob: b51283089c82e977e2182271d9b43a66e7f4fc0e [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`re` --- Regular expression operations
				2	===========================================
				3
				4	.. module:: re
				5	:synopsis: Regular expression operations.
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	10	Source code: :source:`Lib/re.py`
				11
				12	--------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	13
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14	This module provides regular expression matching operations similar to
Georg Brandl	ed2a1db	2009-06-08 07:48:27 +0000	[diff] [blame]	15	those found in Perl.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	16
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	17	Both patterns and strings to be searched can be Unicode strings (:class:`str`)
				18	as well as 8-bit strings (:class:`bytes`).
				19	However, Unicode strings and 8-bit strings cannot be mixed:
Martin Panter	6245cb3	2016-04-15 02:14:19 +0000	[diff] [blame]	20	that is, you cannot match a Unicode string with a byte pattern or
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	21	vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	22	string must be of the same type as both the pattern and the search string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	23
				24	Regular expressions use the backslash character (``'\'``) to indicate
				25	special forms or to allow special characters to be used without invoking
				26	their special meaning. This collides with Python's usage of the same
				27	character for the same purpose in string literals; for example, to match
				28	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				29	string, because the regular expression must be ``\\``, and each
				30	backslash must be expressed as ``\\`` inside a regular Python string
Pablo Galindo	e8239b8	2019-01-20 18:57:56 +0000	[diff] [blame]	31	literal. Also, please note that any invalid escape sequences in Python's
				32	usage of the backslash in string literals now generate a :exc:`DeprecationWarning`
				33	and in the future this will become a :exc:`SyntaxError`. This behaviour
				34	will happen even if it is a valid escape sequence for a regular expression.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	35
				36	The solution is to use Python's raw string notation for regular expression
				37	patterns; backslashes are not handled in any special way in a string literal
				38	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				39	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	40	newline. Usually patterns will be expressed in Python code using this raw
				41	string notation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	42
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	43	It is important to note that most regular expression operations are available as
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	44	module-level functions and methods on
				45	:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
				46	that don't require you to compile a regex object first, but miss some
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	47	fine-tuning parameters.
				48
Marco Buttu	ed6795e	2017-02-26 16:26:23 +0100	[diff] [blame]	49	.. seealso::
				50
Stéphane Wirtel	19177fb	2018-05-15 20:58:35 +0200	[diff] [blame]	51	The third-party `regex <https://pypi.org/project/regex/>`_ module,
Marco Buttu	ed6795e	2017-02-26 16:26:23 +0100	[diff] [blame]	52	which has an API compatible with the standard library :mod:`re` module,
				53	but offers additional functionality and a more thorough Unicode support.
				54
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	55
				56	.. _re-syntax:
				57
				58	Regular Expression Syntax
				59	-------------------------
				60
				61	A regular expression (or RE) specifies a set of strings that matches it; the
				62	functions in this module let you check if a particular string matches a given
				63	regular expression (or if a given regular expression matches a particular
				64	string, which comes down to the same thing).
				65
				66	Regular expressions can be concatenated to form new regular expressions; if A
				67	and B are both regular expressions, then AB is also a regular expression.
				68	In general, if a string p matches A and another string q matches B, the
				69	string pq will match AB. This holds unless A or B contain low precedence
				70	operations; boundary conditions between A and B; or have numbered group
				71	references. Thus, complex expressions can easily be constructed from simpler
				72	primitive expressions like the ones described here. For details of the theory
Berker Peksag	a0a42d2	2018-03-23 16:46:52 +0300	[diff] [blame]	73	and implementation of regular expressions, consult the Friedl book [Frie09]_,
				74	or almost any textbook about compiler construction.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	75
				76	A brief explanation of the format of regular expressions follows. For further
Christian Heimes	2202f87	2008-02-06 14:31:34 +0000	[diff] [blame]	77	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	78
				79	Regular expressions can contain both special and ordinary characters. Most
				80	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				81	expressions; they simply match themselves. You can concatenate ordinary
				82	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				83	section, we'll write RE's in ``this special style``, usually without quotes, and
				84	strings to be matched ``'in single quotes'``.)
				85
				86	Some characters, like ``'\|'`` or ``'('``, are special. Special
				87	characters either stand for classes of ordinary characters, or affect
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	88	how the regular expressions around them are interpreted.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	89
Martin Panter	684340e	2016-10-15 01:18:16 +0000	[diff] [blame]	90	Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
				91	directly nested. This avoids ambiguity with the non-greedy modifier suffix
				92	``?``, and with other modifiers in other implementations. To apply a second
				93	repetition to an inner repetition, parentheses may be used. For example,
				94	the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
				95
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	96
				97	The special characters are:
				98
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	99	.. index:: single: . (dot); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	100
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	101	``.``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	102	(Dot.) In the default mode, this matches any character except a newline. If
				103	the :const:`DOTALL` flag has been specified, this matches any character
				104	including a newline.
				105
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	106	.. index:: single: ^ (caret); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	107
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	108	``^``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	109	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				110	matches immediately after each newline.
				111
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	112	.. index:: single: $ (dollar); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	113
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	114	``$``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	115	Matches the end of the string or just before the newline at the end of the
				116	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				117	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				118	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes	25bb783	2008-01-11 16:17:00 +0000	[diff] [blame]	119	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				120	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				121	the newline, and one at the end of the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	122
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	123	.. index:: single: * (asterisk); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	124
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	125	``*``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	126	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				127	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				128	by any number of 'b's.
				129
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	130	.. index:: single: + (plus); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	131
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	132	``+``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	133	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				134	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				135	match just 'a'.
				136
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	137	.. index:: single: ? (question mark); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	138
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	139	``?``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	140	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				141	``ab?`` will match either 'a' or 'ab'.
				142
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	143	.. index::
				144	single: *?; in regular expressions
				145	single: +?; in regular expressions
				146	single: ??; in regular expressions
				147
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	148	``*?``, ``+?``, ``??``
				149	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				150	as much text as possible. Sometimes this behaviour isn't desired; if the RE
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	151	``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
				152	string, and not just ``'<a>'``. Adding ``?`` after the qualifier makes it
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	153	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
Georg Brandl	7ff033b	2016-04-12 07:51:41 +0200	[diff] [blame]	154	characters as possible will be matched. Using the RE ``<.*?>`` will match
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	155	only ``'<a>'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	156
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	157	.. index::
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	158	single: {} (curly brackets); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	159
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	160	``{m}``
				161	Specifies that exactly m copies of the previous RE should be matched; fewer
				162	matches cause the entire RE not to match. For example, ``a{6}`` will match
				163	exactly six ``'a'`` characters, but not five.
				164
				165	``{m,n}``
				166	Causes the resulting RE to match from m to n repetitions of the preceding
				167	RE, attempting to match as many repetitions as possible. For example,
				168	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				169	lower bound of zero, and omitting n specifies an infinite upper bound. As an
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	170	example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
				171	followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	172	modifier would be confused with the previously described form.
				173
				174	``{m,n}?``
				175	Causes the resulting RE to match from m to n repetitions of the preceding
				176	RE, attempting to match as few repetitions as possible. This is the
				177	non-greedy version of the previous qualifier. For example, on the
				178	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				179	while ``a{3,5}?`` will only match 3 characters.
				180
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	181	.. index:: single: \ (backslash); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	182
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	183	``\``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	184	Either escapes special characters (permitting you to match characters like
				185	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				186	sequences are discussed below.
				187
				188	If you're not using a raw string to express the pattern, remember that Python
				189	also uses the backslash as an escape sequence in string literals; if the escape
				190	sequence isn't recognized by Python's parser, the backslash and subsequent
				191	character are included in the resulting string. However, if Python would
				192	recognize the resulting sequence, the backslash should be repeated twice. This
				193	is complicated and hard to understand, so it's highly recommended that you use
				194	raw strings for all but the simplest expressions.
				195
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	196	.. index::
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	197	single: [] (square brackets); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	198
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	199	``[]``
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	200	Used to indicate a set of characters. In a set:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	201
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	202	* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
				203	``'m'``, or ``'k'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	204
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	205	.. index:: single: - (minus); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	206
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	207	* Ranges of characters can be indicated by giving two characters and separating
				208	them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
				209	``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
				210	``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	211	``[a\-z]``) or if it's placed as the first or last character
				212	(e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	213
				214	* Special characters lose their special meaning inside sets. For example,
				215	``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
				216	``'*'``, or ``')'``.
				217
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	218	.. index:: single: \ (backslash); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	219
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	220	* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
				221	inside a set, although the characters they match depends on whether
				222	:const:`ASCII` or :const:`LOCALE` mode is in force.
				223
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	224	.. index:: single: ^ (caret); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	225
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	226	* Characters that are not within a range can be matched by :dfn:`complementing`
				227	the set. If the first character of the set is ``'^'``, all the characters
				228	that are not in the set will be matched. For example, ``[^5]`` will match
				229	any character except ``'5'``, and ``[^^]`` will match any character except
				230	``'^'``. ``^`` has no special meaning if it's not the first character in
				231	the set.
				232
				233	* To match a literal ``']'`` inside a set, precede it with a backslash, or
				234	place it at the beginning of the set. For example, both ``[()[\]{}]`` and
				235	``[]()[{}]`` will both match a parenthesis.
Mark Summerfield	9e670c2	2008-05-31 13:05:34 +0000	[diff] [blame]	236
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	237	.. .. index:: single: --; in regular expressions
				238	.. .. index:: single: &&; in regular expressions
				239	.. .. index:: single: ~~; in regular expressions
				240	.. .. index:: single: \|\|; in regular expressions
				241
Serhiy Storchaka	05cb728	2017-11-16 12:38:26 +0200	[diff] [blame]	242	* Support of nested sets and set operations as in `Unicode Technical
				243	Standard #18`_ might be added in the future. This would change the
				244	syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
				245	in ambiguous cases for the time being.
Andrés Delfino	7dfbd49	2018-10-06 16:48:30 -0300	[diff] [blame]	246	That includes sets starting with a literal ``'['`` or containing literal
Serhiy Storchaka	05cb728	2017-11-16 12:38:26 +0200	[diff] [blame]	247	character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'\|\|'``. To
				248	avoid a warning escape them with a backslash.
				249
				250	.. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
				251
				252	.. versionchanged:: 3.7
				253	:exc:`FutureWarning` is raised if a character set contains constructs
				254	that will change semantically in the future.
				255
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	256	.. index:: single: \| (vertical bar); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	257
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	258	``\|``
				259	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				260	will match either A or B. An arbitrary number of REs can be separated by the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	261	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				262	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				263	right. When one pattern completely matches, that branch is accepted. This means
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	264	that once A matches, B will not be tested further, even if it would
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	265	produce a longer overall match. In other words, the ``'\|'`` operator is never
				266	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				267	character class, as in ``[\|]``.
				268
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	269	.. index::
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	270	single: () (parentheses); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	271
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	272	``(...)``
				273	Matches whatever regular expression is inside the parentheses, and indicates the
				274	start and end of a group; the contents of a group can be retrieved after a match
				275	has been performed, and can be matched later in the string with the ``\number``
				276	special sequence, described below. To match the literals ``'('`` or ``')'``,
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	277	use ``$`` or ``$``, or enclose them inside a character class: ``[(]``, ``[)]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	278
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	279	.. index:: single: (?; in regular expressions
				280
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	281	``(?...)``
				282	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				283	otherwise). The first character after the ``'?'`` determines what the meaning
				284	and further syntax of the construct is. Extensions usually do not create a new
				285	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				286	currently supported extensions.
				287
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	288	``(?aiLmsux)``
				289	(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				290	``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling	1c50e86	2009-06-01 00:11:36 +0000	[diff] [blame]	291	letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	292	:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	293	:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	294	:const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
				295	for the entire regular expression.
				296	(The flags are described in :ref:`contents-of-module-re`.)
				297	This is useful if you wish to include the flags as part of the
				298	regular expression, instead of passing a flag argument to the
Serhiy Storchaka	bd48d27	2016-09-11 12:50:02 +0300	[diff] [blame]	299	:func:`re.compile` function. Flags should be used first in the
				300	expression string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	301
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	302	.. index:: single: (?:; in regular expressions
				303
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	304	``(?:...)``
Georg Brandl	3122ce3	2010-10-29 06:17:38 +0000	[diff] [blame]	305	A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	306	expression is inside the parentheses, but the substring matched by the group
				307	cannot be retrieved after performing a match or referenced later in the
				308	pattern.
				309
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	310	``(?aiLmsux-imsx:...)``
				311	(Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				312	``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
				313	one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
				314	The letters set or remove the corresponding flags:
				315	:const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
				316	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				317	:const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
				318	and :const:`re.X` (verbose), for the part of the expression.
				319	(The flags are described in :ref:`contents-of-module-re`.)
				320
				321	The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
				322	as inline flags, so they can't be combined or follow ``'-'``. Instead,
				323	when one of them appears in an inline group, it overrides the matching mode
				324	in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
				325	ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
				326	(default). In byte pattern ``(?L:...)`` switches to locale depending
				327	matching, and ``(?a:...)`` switches to ASCII-only matching (default).
				328	This override is only in effect for the narrow inline group, and the
				329	original matching mode is restored outside of the group.
Serhiy Storchaka	be9a4e5	2016-09-10 00:57:55 +0300	[diff] [blame]	330
Zachary Ware	c307672	2016-09-09 15:47:05 -0700	[diff] [blame]	331	.. versionadded:: 3.6
Serhiy Storchaka	be9a4e5	2016-09-10 00:57:55 +0300	[diff] [blame]	332
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	333	.. versionchanged:: 3.7
				334	The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
				335
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	336	.. index:: single: (?P<; in regular expressions
				337
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	338	``(?P<name>...)``
				339	Similar to regular parentheses, but the substring matched by the group is
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	340	accessible via the symbolic group name name. Group names must be valid
				341	Python identifiers, and each group name must be defined only once within a
				342	regular expression. A symbolic group is also a numbered group, just as if
				343	the group were not named.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	344
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	345	Named groups can be referenced in three contexts. If the pattern is
				346	``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
				347	single or double quotes):
				348
				349	+---------------------------------------+----------------------------------+
				350	\| Context of reference to group "quote" \| Ways to reference it \|
				351	+=======================================+==================================+
				352	\| in the same pattern itself \| * ``(?P=quote)`` (as shown) \|
				353	\| \| * ``\1`` \|
				354	+---------------------------------------+----------------------------------+
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	355	\| when processing match object m \| * ``m.group('quote')`` \|
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	356	\| \| * ``m.end('quote')`` (etc.) \|
				357	+---------------------------------------+----------------------------------+
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	358	\| in a string passed to the repl \| * ``\g<quote>`` \|
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	359	\| argument of ``re.sub()`` \| * ``\g<1>`` \|
				360	\| \| * ``\1`` \|
				361	+---------------------------------------+----------------------------------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	362
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	363	.. index:: single: (?P=; in regular expressions
				364
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	365	``(?P=name)``
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	366	A backreference to a named group; it matches whatever text was matched by the
				367	earlier group named name.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	368
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	369	.. index:: single: (?#; in regular expressions
				370
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	371	``(?#...)``
				372	A comment; the contents of the parentheses are simply ignored.
				373
animalize	4a7f44a	2019-02-18 21:26:37 +0800	[diff] [blame]	374	.. index:: single: (?=; in regular expressions
				375
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	376	``(?=...)``
				377	Matches if ``...`` matches next, but doesn't consume any of the string. This is
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	378	called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	379	``'Isaac '`` only if it's followed by ``'Asimov'``.
				380
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	381	.. index:: single: (?!; in regular expressions
				382
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	383	``(?!...)``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	384	Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	385	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				386	followed by ``'Asimov'``.
				387
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	388	.. index:: single: (?<=; in regular expressions
				389
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	390	``(?<=...)``
				391	Matches if the current position in the string is preceded by a match for ``...``
				392	that ends at the current position. This is called a :dfn:`positive lookbehind
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	393	assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	394	lookbehind will back up 3 characters and check if the contained pattern matches.
				395	The contained pattern must only match strings of some fixed length, meaning that
				396	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
Ezio Melotti	0a6b541	2012-04-29 07:34:46 +0300	[diff] [blame]	397	patterns which start with positive lookbehind assertions will not match at the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	398	beginning of the string being searched; you will most likely want to use the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	399	:func:`search` function rather than the :func:`match` function:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	400
				401	>>> import re
				402	>>> m = re.search('(?<=abc)def', 'abcdef')
				403	>>> m.group(0)
				404	'def'
				405
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	406	This example looks for a word following a hyphen:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	407
Cheryl Sabella	6677142	2018-02-02 16:16:27 -0500	[diff] [blame]	408	>>> m = re.search(r'(?<=-)\w+', 'spam-egg')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	409	>>> m.group(0)
				410	'egg'
				411
Georg Brandl	8c16cb9	2016-02-25 20:17:45 +0100	[diff] [blame]	412	.. versionchanged:: 3.5
Serhiy Storchaka	4eea62f	2015-02-21 10:07:35 +0200	[diff] [blame]	413	Added support for group references of fixed length.
				414
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	415	.. index:: single: (?<!; in regular expressions
				416
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	417	``(?<!...)``
				418	Matches if the current position in the string is not preceded by a match for
				419	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				420	positive lookbehind assertions, the contained pattern must only match strings of
				421	some fixed length. Patterns which start with negative lookbehind assertions may
				422	match at the beginning of the string being searched.
				423
				424	``(?(id/name)yes-pattern\|no-pattern)``
orsenthil@gmail.com	476021b	2011-03-12 10:46:25 +0800	[diff] [blame]	425	Will try to match with ``yes-pattern`` if the group with given id or
				426	name exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
				427	optional and can be omitted. For example,
				428	``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>\|$)`` is a poor email matching pattern, which
				429	will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
Serhiy Storchaka	a4d170d	2013-12-23 18:20:51 +0200	[diff] [blame]	430	not with ``'<user@host.com'`` nor ``'user@host.com>'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	431
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	432
				433	The special sequences consist of ``'\'`` and a character from the list below.
Martin Panter	98e9051	2016-06-12 06:17:29 +0000	[diff] [blame]	434	If the ordinary character is not an ASCII digit or an ASCII letter, then the
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	435	resulting RE will match the second character. For example, ``\$`` matches the
				436	character ``'$'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	437
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	438	.. index:: single: \ (backslash); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	439
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	440	``\number``
				441	Matches the contents of the group of the same number. Groups are numbered
				442	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl	2070e83	2013-10-06 12:58:20 +0200	[diff] [blame]	443	but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	444	can only be used to match one of the first 99 groups. If the first digit of
				445	number is 0, or number is 3 octal digits long, it will not be interpreted as
				446	a group match, but as the character with octal value number. Inside the
				447	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				448	characters.
				449
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	450	.. index:: single: \A; in regular expressions
				451
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	452	``\A``
				453	Matches only at the start of the string.
				454
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	455	.. index:: single: \b; in regular expressions
				456
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	457	``\b``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	458	Matches the empty string, but only at the beginning or end of a word.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	459	A word is defined as a sequence of word characters. Note that formally,
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	460	``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
				461	(or vice versa), or between ``\w`` and the beginning/end of the string.
				462	This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
				463	``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
				464
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	465	By default Unicode alphanumerics are the ones used in Unicode patterns, but
				466	this can be changed by using the :const:`ASCII` flag. Word boundaries are
				467	determined by the current locale if the :const:`LOCALE` flag is used.
				468	Inside a character range, ``\b`` represents the backspace character, for
				469	compatibility with Python's string literals.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	470
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	471	.. index:: single: \B; in regular expressions
				472
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	473	``\B``
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	474	Matches the empty string, but only when it is not at the beginning or end
				475	of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
				476	``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	477	``\B`` is just the opposite of ``\b``, so word characters in Unicode
				478	patterns are Unicode alphanumerics or the underscore, although this can
				479	be changed by using the :const:`ASCII` flag. Word boundaries are
				480	determined by the current locale if the :const:`LOCALE` flag is used.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	481
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	482	.. index:: single: \d; in regular expressions
				483
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	484	``\d``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	485	For Unicode (str) patterns:
Mark Dickinson	1f26828	2009-07-28 17:22:36 +0000	[diff] [blame]	486	Matches any Unicode decimal digit (that is, any character in
				487	Unicode character category [Nd]). This includes ``[0-9]``, and
				488	also many other digit characters. If the :const:`ASCII` flag is
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	489	used only ``[0-9]`` is matched.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	490
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	491	For 8-bit (bytes) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	492	Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	493
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	494	.. index:: single: \D; in regular expressions
				495
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	496	``\D``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	497	Matches any character which is not a decimal digit. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	498	the opposite of ``\d``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	499	becomes the equivalent of ``[^0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	500
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	501	.. index:: single: \s; in regular expressions
				502
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	503	``\s``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	504	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	505	Matches Unicode whitespace characters (which includes
				506	``[ \t\n\r\f\v]``, and also many other characters, for example the
				507	non-breaking spaces mandated by typography rules in many
				508	languages). If the :const:`ASCII` flag is used, only
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	509	``[ \t\n\r\f\v]`` is matched.
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	510
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	511	For 8-bit (bytes) patterns:
				512	Matches characters considered whitespace in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	513	this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	514
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	515	.. index:: single: \S; in regular expressions
				516
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	517	``\S``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	518	Matches any character which is not a whitespace character. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	519	the opposite of ``\s``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	520	becomes the equivalent of ``[^ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	521
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	522	.. index:: single: \w; in regular expressions
				523
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	524	``\w``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	525	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	526	Matches Unicode word characters; this includes most characters
				527	that can be part of a word in any language, as well as numbers and
				528	the underscore. If the :const:`ASCII` flag is used, only
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	529	``[a-zA-Z0-9_]`` is matched.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	530
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	531	For 8-bit (bytes) patterns:
				532	Matches characters considered alphanumeric in the ASCII character set;
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	533	this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
				534	used, matches characters considered alphanumeric in the current locale
				535	and the underscore.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	536
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	537	.. index:: single: \W; in regular expressions
				538
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	539	``\W``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	540	Matches any character which is not a word character. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	541	the opposite of ``\w``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	542	becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	543	used, matches characters considered alphanumeric in the current locale
				544	and the underscore.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	545
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	546	.. index:: single: \Z; in regular expressions
				547
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	548	``\Z``
				549	Matches only at the end of the string.
				550
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	551	.. index::
				552	single: \a; in regular expressions
				553	single: \b; in regular expressions
				554	single: \f; in regular expressions
				555	single: \n; in regular expressions
				556	single: \N; in regular expressions
				557	single: \r; in regular expressions
				558	single: \t; in regular expressions
				559	single: \u; in regular expressions
				560	single: \U; in regular expressions
				561	single: \v; in regular expressions
				562	single: \x; in regular expressions
				563	single: \\; in regular expressions
				564
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	565	Most of the standard escapes supported by Python string literals are also
				566	accepted by the regular expression parser::
				567
				568	\a \b \f \n
Serhiy Storchaka	a445feb	2018-02-10 00:08:17 +0200	[diff] [blame]	569	\N \r \t \u
				570	\U \v \x \\
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	571
Ezio Melotti	285e51b	2012-04-29 04:52:30 +0300	[diff] [blame]	572	(Note that ``\b`` is used to represent word boundaries, and means "backspace"
				573	only inside character classes.)
				574
Serhiy Storchaka	a445feb	2018-02-10 00:08:17 +0200	[diff] [blame]	575	``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	576	patterns. In bytes patterns they are errors.
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	577
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	578	Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	579	there are three octal digits, it is considered an octal escape. Otherwise, it is
				580	a group reference. As for string literals, octal escapes are always at most
				581	three digits in length.
				582
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	583	.. versionchanged:: 3.3
				584	The ``'\u'`` and ``'\U'`` escape sequences have been added.
				585
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	586	.. versionchanged:: 3.6
Martin Panter	98e9051	2016-06-12 06:17:29 +0000	[diff] [blame]	587	Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	588
Serhiy Storchaka	a445feb	2018-02-10 00:08:17 +0200	[diff] [blame]	589	.. versionchanged:: 3.8
				590	The ``'\N{name}'`` escape sequence has been added. As in string literals,
				591	it expands to the named Unicode character (e.g. ``'\N{EM DASH}'``).
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	592
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	593
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	594	.. _contents-of-module-re:
				595
				596	Module Contents
				597	---------------
				598
				599	The module defines several functions, constants, and an exception. Some of the
				600	functions are simplified versions of the full featured methods for compiled
				601	regular expressions. Most non-trivial applications always use the compiled
				602	form.
				603
Ethan Furman	c88c80b	2016-11-21 08:29:31 -0800	[diff] [blame]	604	.. versionchanged:: 3.6
				605	Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
				606	:class:`enum.IntFlag`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	607
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	608	.. function:: compile(pattern, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	609
Henk-Jaap Wagenaar	ed94a8b	2017-08-28 06:41:20 +0100	[diff] [blame]	610	Compile a regular expression pattern into a :ref:`regular expression object
				611	<re-objects>`, which can be used for matching using its
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	612	:func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
Henk-Jaap Wagenaar	ed94a8b	2017-08-28 06:41:20 +0100	[diff] [blame]	613	below.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	614
				615	The expression's behaviour can be modified by specifying a flags value.
				616	Values can be any of the following variables, combined using bitwise OR (the
				617	``\|`` operator).
				618
				619	The sequence ::
				620
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	621	prog = re.compile(pattern)
				622	result = prog.match(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	623
				624	is equivalent to ::
				625
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	626	result = re.match(pattern, string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	627
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	628	but using :func:`re.compile` and saving the resulting regular expression
				629	object for reuse is more efficient when the expression will be used several
				630	times in a single program.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	631
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	632	.. note::
				633
				634	The compiled versions of the most recent patterns passed to
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	635	:func:`re.compile` and the module-level matching functions are cached, so
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	636	programs that use only a few regular expressions at a time needn't worry
				637	about compiling regular expressions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	638
				639
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	640	.. data:: A
				641	ASCII
				642
Georg Brandl	4049ce0	2009-06-08 07:49:54 +0000	[diff] [blame]	643	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
				644	perform ASCII-only matching instead of full Unicode matching. This is only
				645	meaningful for Unicode patterns, and is ignored for byte patterns.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	646	Corresponds to the inline flag ``(?a)``.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	647
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	648	Note that for backward compatibility, the :const:`re.U` flag still
				649	exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	650	counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	651	matches are Unicode by default for strings (and Unicode matching
				652	isn't allowed for bytes).
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	653
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	654
Sandro Tosi	da785fd	2012-01-01 12:55:20 +0100	[diff] [blame]	655	.. data:: DEBUG
				656
				657	Display debug information about compiled expression.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	658	No corresponding inline flag.
Sandro Tosi	da785fd	2012-01-01 12:55:20 +0100	[diff] [blame]	659
				660
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	661	.. data:: I
				662	IGNORECASE
				663
Brian Ward	c9d6dbc	2017-05-24 00:03:38 -0700	[diff] [blame]	664	Perform case-insensitive matching; expressions like ``[A-Z]`` will also
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	665	match lowercase letters. Full Unicode matching (such as ``Ü`` matching
				666	``ü``) also works unless the :const:`re.ASCII` flag is used to disable
				667	non-ASCII matches. The current locale does not change the effect of this
				668	flag unless the :const:`re.LOCALE` flag is also used.
				669	Corresponds to the inline flag ``(?i)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	670
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	671	Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
				672	combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
				673	letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
				674	letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
				675	'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
				676	If the :const:`ASCII` flag is used, only letters 'a' to 'z'
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	677	and 'A' to 'Z' are matched.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	678
				679	.. data:: L
				680	LOCALE
				681
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	682	Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
				683	dependent on the current locale. This flag can be used only with bytes
				684	patterns. The use of this flag is discouraged as the locale mechanism
				685	is very unreliable, it only handles one "culture" at a time, and it only
				686	works with 8-bit locales. Unicode matching is already enabled by default
				687	in Python 3 for Unicode (str) patterns, and it is able to handle different
				688	locales/languages.
				689	Corresponds to the inline flag ``(?L)``.
Serhiy Storchaka	22a309a	2014-12-01 11:50:07 +0200	[diff] [blame]	690
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	691	.. versionchanged:: 3.6
				692	:const:`re.LOCALE` can be used only with bytes patterns and is
				693	not compatible with :const:`re.ASCII`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	694
Serhiy Storchaka	898ff03	2017-05-05 08:53:40 +0300	[diff] [blame]	695	.. versionchanged:: 3.7
				696	Compiled regular expression objects with the :const:`re.LOCALE` flag no
				697	longer depend on the locale at compile time. Only the locale at
				698	matching time affects the result of matching.
				699
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	700
				701	.. data:: M
				702	MULTILINE
				703
				704	When specified, the pattern character ``'^'`` matches at the beginning of the
				705	string and at the beginning of each line (immediately following each newline);
				706	and the pattern character ``'$'`` matches at the end of the string and at the
				707	end of each line (immediately preceding each newline). By default, ``'^'``
				708	matches only at the beginning of the string, and ``'$'`` only at the end of the
				709	string and immediately before the newline (if any) at the end of the string.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	710	Corresponds to the inline flag ``(?m)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	711
				712
				713	.. data:: S
				714	DOTALL
				715
				716	Make the ``'.'`` special character match any character at all, including a
				717	newline; without this flag, ``'.'`` will match anything except a newline.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	718	Corresponds to the inline flag ``(?s)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	719
				720
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	721	.. data:: X
				722	VERBOSE
				723
Serhiy Storchaka	913876d	2018-10-28 13:41:26 +0200	[diff] [blame]	724	.. index:: single: # (hash); in regular expressions
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	725
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	726	This flag allows you to write regular expressions that look nicer and are
				727	more readable by allowing you to visually separate logical sections of the
				728	pattern and add comments. Whitespace within the pattern is ignored, except
Serhiy Storchaka	b0b44b4	2017-11-14 17:21:26 +0200	[diff] [blame]	729	when in a character class, or when preceded by an unescaped backslash,
				730	or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	731	When a line contains a ``#`` that is not in a character class and is not
				732	preceded by an unescaped backslash, all characters from the leftmost such
				733	``#`` through the end of the line are ignored.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	734
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	735	This means that the two following regular expression objects that match a
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	736	decimal number are functionally equal::
Georg Brandl	81ac1ce	2007-08-31 17:17:17 +0000	[diff] [blame]	737
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	738	a = re.compile(r"""\d + # the integral part
				739	\. # the decimal point
				740	\d * # some fractional digits""", re.X)
				741	b = re.compile(r"\d+\.\d*")
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	742
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	743	Corresponds to the inline flag ``(?x)``.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	744
				745
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	746	.. function:: search(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	747
Terry Jan Reedy	0edb5c1	2014-05-30 16:19:59 -0400	[diff] [blame]	748	Scan through string looking for the first location where the regular expression
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	749	pattern produces a match, and return a corresponding :ref:`match object
				750	<match-objects>`. Return ``None`` if no position in the string matches the
				751	pattern; note that this is different from finding a zero-length match at some
				752	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	753
				754
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	755	.. function:: match(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	756
				757	If zero or more characters at the beginning of string match the regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	758	expression pattern, return a corresponding :ref:`match object
				759	<match-objects>`. Return ``None`` if the string does not match the pattern;
				760	note that this is different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	761
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	762	Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
				763	at the beginning of the string and not at the beginning of each line.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	764
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	765	If you want to locate a match anywhere in string, use :func:`search`
				766	instead (see also :ref:`search-vs-match`).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	767
				768
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	769	.. function:: fullmatch(pattern, string, flags=0)
				770
				771	If the whole string matches the regular expression pattern, return a
				772	corresponding :ref:`match object <match-objects>`. Return ``None`` if the
				773	string does not match the pattern; note that this is different from a
				774	zero-length match.
				775
				776	.. versionadded:: 3.4
				777
				778
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	779	.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	780
				781	Split string by the occurrences of pattern. If capturing parentheses are
				782	used in pattern, then the text of all groups in the pattern are also returned
				783	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				784	splits occur, and the remainder of the string is returned as the final element
Georg Brandl	9647389	2008-03-06 07:09:43 +0000	[diff] [blame]	785	of the list. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	786
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	787	>>> re.split(r'\W+', 'Words, words, words.')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	788	['Words', 'words', 'words', '']
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	789	>>> re.split(r'(\W+)', 'Words, words, words.')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	790	['Words', ', ', 'words', ', ', 'words', '.', '']
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	791	>>> re.split(r'\W+', 'Words, words, words.', 1)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	792	['Words', 'words, words.']
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	793	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
				794	['0', '3', '9']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	795
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	796	If there are capturing groups in the separator and it matches at the start of
				797	the string, the result will start with an empty string. The same holds for
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	798	the end of the string::
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	799
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	800	>>> re.split(r'(\W+)', '...words, words...')
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	801	['', '...', 'words', ', ', 'words', '...', '']
				802
				803	That way, separator components are always found at the same relative
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	804	indices within the result list.
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	805
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	806	Empty matches for the pattern split the string only when not adjacent
				807	to a previous empty match.
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	808
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	809	>>> re.split(r'\b', 'Words, words, words.')
				810	['', 'Words', ', ', 'words', ', ', 'words', '.']
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	811	>>> re.split(r'\W*', '...words...')
				812	['', '', 'w', 'o', 'r', 'd', 's', '', '']
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	813	>>> re.split(r'(\W*)', '...words...')
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	814	['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	815
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	816	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	817	Added the optional flags argument.
				818
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	819	.. versionchanged:: 3.7
				820	Added support of splitting on a pattern that could match an empty string.
				821
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	822
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	823	.. function:: findall(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	824
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	825	Return all non-overlapping matches of pattern in string, as a list of
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	826	strings. The string is scanned left-to-right, and matches are returned in
				827	the order found. If one or more groups are present in the pattern, return a
				828	list of groups; this will be a list of tuples if the pattern has more than
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	829	one group. Empty matches are included in the result.
				830
				831	.. versionchanged:: 3.7
				832	Non-empty matches can now start just after a previous empty match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	833
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	834
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	835	.. function:: finditer(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	836
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	837	Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
				838	all non-overlapping matches for the RE pattern in string. The string
				839	is scanned left-to-right, and matches are returned in the order found. Empty
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	840	matches are included in the result.
				841
				842	.. versionchanged:: 3.7
				843	Non-empty matches can now start just after a previous empty match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	844
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	845
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	846	.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	847
				848	Return the string obtained by replacing the leftmost non-overlapping occurrences
				849	of pattern in string by the replacement repl. If the pattern isn't found,
				850	string is returned unchanged. repl can be a string or a function; if it is
				851	a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi	6a633bb	2011-08-19 22:54:50 +0200	[diff] [blame]	852	converted to a single newline character, ``\r`` is converted to a carriage return, and
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	853	so forth. Unknown escapes such as ``\&`` are left alone. Backreferences, such
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	854	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	855	For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	856
				857	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				858	... r'static PyObject*\npy_\1(void)\n{',
				859	... 'def myfunc():')
				860	'static PyObject*\npy_myfunc(void)\n{'
				861
				862	If repl is a function, it is called for every non-overlapping occurrence of
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	863	pattern. The function takes a single :ref:`match object <match-objects>`
				864	argument, and returns the replacement string. For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	865
				866	>>> def dashrepl(matchobj):
				867	... if matchobj.group(0) == '-': return ' '
				868	... else: return '-'
				869	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				870	'pro--gram files'
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	871	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
				872	'Baked Beans & Spam'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	873
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	874	The pattern may be a string or a :ref:`pattern object <re-objects>`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	875
				876	The optional argument count is the maximum number of pattern occurrences to be
				877	replaced; count must be a non-negative integer. If omitted or zero, all
				878	occurrences will be replaced. Empty matches for the pattern are replaced only
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	879	when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
				880	``'-a-b--d-'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	881
Serhiy Storchaka	ddb961d	2018-10-26 09:00:49 +0300	[diff] [blame]	882	.. index:: single: \g; in regular expressions
				883
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	884	In string-type repl arguments, in addition to the character escapes and
				885	backreferences described above,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	886	``\g<name>`` will use the substring matched by the group named ``name``, as
				887	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				888	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				889	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				890	reference to group 20, not a reference to group 2 followed by the literal
				891	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				892	substring matched by the RE.
				893
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	894	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	895	Added the optional flags argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	896
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	897	.. versionchanged:: 3.5
				898	Unmatched groups are replaced with an empty string.
				899
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	900	.. versionchanged:: 3.6
Serhiy Storchaka	53c53ea	2016-12-06 19:15:29 +0200	[diff] [blame]	901	Unknown escapes in pattern consisting of ``'\'`` and an ASCII letter
				902	now are errors.
				903
Serhiy Storchaka	ff3dbe9	2016-12-06 19:25:19 +0200	[diff] [blame]	904	.. versionchanged:: 3.7
				905	Unknown escapes in repl consisting of ``'\'`` and an ASCII letter
				906	now are errors.
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	907
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	908	Empty matches for the pattern are replaced when adjacent to a previous
				909	non-empty match.
				910
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	911
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	912	.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	913
				914	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				915	number_of_subs_made)``.
				916
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	917	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	918	Added the optional flags argument.
				919
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	920	.. versionchanged:: 3.5
				921	Unmatched groups are replaced with an empty string.
				922
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	923
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	924	.. function:: escape(pattern)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	925
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	926	Escape special characters in pattern.
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	927	This is useful if you want to match an arbitrary literal string that may
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	928	have regular expression metacharacters in it. For example::
				929
				930	>>> print(re.escape('python.exe'))
				931	python\.exe
				932
				933	>>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`\|~:"
				934	>>> print('[%s]+' % re.escape(legal_chars))
Serhiy Storchaka	05cb728	2017-11-16 12:38:26 +0200	[diff] [blame]	935	[abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\\|\~:]+
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	936
				937	>>> operators = ['+', '-', '', '/', '*']
				938	>>> print('\|'.join(map(re.escape, sorted(operators, reverse=True))))
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	939	/\|\-\|\+\|\\\|\*
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	940
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	941	This functions must not be used for the replacement string in :func:`sub`
				942	and :func:`subn`, only backslashes should be escaped. For example::
				943
				944	>>> digits_re = r'\d+'
				945	>>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
				946	>>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
				947	/usr/sbin/sendmail - \d+ errors, \d+ warnings
				948
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	949	.. versionchanged:: 3.3
				950	The ``'_'`` character is no longer escaped.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	951
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	952	.. versionchanged:: 3.7
				953	Only characters that can have special meaning in a regular expression
				954	are escaped.
				955
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	956
R. David Murray	522c32a	2010-07-10 14:23:36 +0000	[diff] [blame]	957	.. function:: purge()
				958
				959	Clear the regular expression cache.
				960
				961
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	962	.. exception:: error(msg, pattern=None, pos=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	963
				964	Exception raised when a string passed to one of the functions here is not a
				965	valid regular expression (for example, it might contain unmatched parentheses)
				966	or when some other error occurs during compilation or matching. It is never an
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	967	error if a string contains no match for a pattern. The error instance has
				968	the following additional attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	969
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	970	.. attribute:: msg
				971
				972	The unformatted error message.
				973
				974	.. attribute:: pattern
				975
				976	The regular expression pattern.
				977
				978	.. attribute:: pos
				979
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	980	The index in pattern where compilation failed (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	981
				982	.. attribute:: lineno
				983
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	984	The line corresponding to pos (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	985
				986	.. attribute:: colno
				987
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	988	The column corresponding to pos (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	989
				990	.. versionchanged:: 3.5
				991	Added additional attributes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	992
				993	.. _re-objects:
				994
				995	Regular Expression Objects
				996	--------------------------
				997
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	998	Compiled regular expression objects support the following methods and
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	999	attributes:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1000
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1001	.. method:: Pattern.search(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1002
Berker Peksag	84f387d	2016-06-08 14:56:56 +0300	[diff] [blame]	1003	Scan through string looking for the first location where this regular
				1004	expression produces a match, and return a corresponding :ref:`match object
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1005	<match-objects>`. Return ``None`` if no position in the string matches the
				1006	pattern; note that this is different from finding a zero-length match at some
				1007	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1008
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1009	The optional second parameter pos gives an index in the string where the
				1010	search is to start; it defaults to ``0``. This is not completely equivalent to
				1011	slicing the string; the ``'^'`` pattern character matches at the real beginning
				1012	of the string and at positions just after a newline, but not necessarily at the
				1013	index where the search is to start.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1014
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1015	The optional parameter endpos limits how far the string will be searched; it
				1016	will be as if the string is endpos characters long, so only the characters
				1017	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	1018	than pos, no match will be found; otherwise, if rx is a compiled regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1019	expression object, ``rx.search(string, 0, 50)`` is equivalent to
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1020	``rx.search(string[:50], 0)``. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1021
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1022	>>> pattern = re.compile("d")
				1023	>>> pattern.search("dog") # Match at index 0
				1024	<re.Match object; span=(0, 1), match='d'>
				1025	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1026
				1027
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1028	.. method:: Pattern.match(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1029
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1030	If zero or more characters at the beginning of string match this regular
				1031	expression, return a corresponding :ref:`match object <match-objects>`.
				1032	Return ``None`` if the string does not match the pattern; note that this is
				1033	different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1034
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1035	The optional pos and endpos parameters have the same meaning as for the
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1036	:meth:`~Pattern.search` method. ::
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	1037
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1038	>>> pattern = re.compile("o")
				1039	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
				1040	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				1041	<re.Match object; span=(1, 2), match='o'>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1042
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1043	If you want to locate a match anywhere in string, use
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1044	:meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1045
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1046
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1047	.. method:: Pattern.fullmatch(string[, pos[, endpos]])
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	1048
				1049	If the whole string matches this regular expression, return a corresponding
				1050	:ref:`match object <match-objects>`. Return ``None`` if the string does not
				1051	match the pattern; note that this is different from a zero-length match.
				1052
				1053	The optional pos and endpos parameters have the same meaning as for the
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1054	:meth:`~Pattern.search` method. ::
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	1055
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1056	>>> pattern = re.compile("o[gh]")
				1057	>>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
				1058	>>> pattern.fullmatch("ogre") # No match as not the full string matches.
				1059	>>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
				1060	<re.Match object; span=(1, 3), match='og'>
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	1061
				1062	.. versionadded:: 3.4
				1063
				1064
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1065	.. method:: Pattern.split(string, maxsplit=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1066
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1067	Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1068
				1069
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1070	.. method:: Pattern.findall(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1071
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1072	Similar to the :func:`findall` function, using the compiled pattern, but
				1073	also accepts optional pos and endpos parameters that limit the search
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1074	region like for :meth:`search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1075
				1076
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1077	.. method:: Pattern.finditer(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1078
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1079	Similar to the :func:`finditer` function, using the compiled pattern, but
				1080	also accepts optional pos and endpos parameters that limit the search
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1081	region like for :meth:`search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1082
				1083
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1084	.. method:: Pattern.sub(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1085
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1086	Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1087
				1088
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1089	.. method:: Pattern.subn(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1090
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1091	Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1092
				1093
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1094	.. attribute:: Pattern.flags
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1095
Georg Brandl	3a19e54	2012-03-17 17:29:27 +0100	[diff] [blame]	1096	The regex matching flags. This is a combination of the flags given to
				1097	:func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
				1098	flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1099
				1100
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1101	.. attribute:: Pattern.groups
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	1102
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1103	The number of capturing groups in the pattern.
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	1104
				1105
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1106	.. attribute:: Pattern.groupindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1107
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1108	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				1109	numbers. The dictionary is empty if no symbolic groups were used in the
				1110	pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1111
				1112
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1113	.. attribute:: Pattern.pattern
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1114
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1115	The pattern string from which the pattern object was compiled.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1116
				1117
Serhiy Storchaka	fdbd011	2017-04-16 10:16:03 +0300	[diff] [blame]	1118	.. versionchanged:: 3.7
				1119	Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled
				1120	regular expression objects are considered atomic.
				1121
				1122
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1123	.. _match-objects:
				1124
				1125	Match Objects
				1126	-------------
				1127
Ezio Melotti	b87f82f	2012-11-04 06:59:22 +0200	[diff] [blame]	1128	Match objects always have a boolean value of ``True``.
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1129	Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
Ezio Melotti	b87f82f	2012-11-04 06:59:22 +0200	[diff] [blame]	1130	when there is no match, you can test whether there was a match with a simple
				1131	``if`` statement::
				1132
				1133	match = re.search(pattern, string)
				1134	if match:
				1135	process(match)
				1136
				1137	Match objects support the following methods and attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1138
				1139
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1140	.. method:: Match.expand(template)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1141
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1142	Return the string obtained by doing backslash substitution on the template
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1143	string template, as done by the :meth:`~Pattern.sub` method.
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1144	Escapes such as ``\n`` are converted to the appropriate characters,
				1145	and numeric backreferences (``\1``, ``\2``) and named backreferences
				1146	(``\g<1>``, ``\g<name>``) are replaced by the contents of the
				1147	corresponding group.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1148
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	1149	.. versionchanged:: 3.5
				1150	Unmatched groups are replaced with an empty string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1151
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1152	.. method:: Match.group([group1, ...])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1153
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1154	Returns one or more subgroups of the match. If there is a single argument, the
				1155	result is a single string; if there are multiple arguments, the result is a
				1156	tuple with one item per argument. Without arguments, group1 defaults to zero
				1157	(the whole match is returned). If a groupN argument is zero, the corresponding
				1158	return value is the entire matching string; if it is in the inclusive range
				1159	[1..99], it is the string matching the corresponding parenthesized group. If a
				1160	group number is negative or larger than the number of groups defined in the
				1161	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				1162	part of the pattern that did not match, the corresponding result is ``None``.
				1163	If a group is contained in a part of the pattern that matched multiple times,
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1164	the last match is returned. ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1165
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1166	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				1167	>>> m.group(0) # The entire match
				1168	'Isaac Newton'
				1169	>>> m.group(1) # The first parenthesized subgroup.
				1170	'Isaac'
				1171	>>> m.group(2) # The second parenthesized subgroup.
				1172	'Newton'
				1173	>>> m.group(1, 2) # Multiple arguments give us a tuple.
				1174	('Isaac', 'Newton')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1175
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1176	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				1177	arguments may also be strings identifying groups by their group name. If a
				1178	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				1179	exception is raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1180
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1181	A moderately complicated example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1182
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1183	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				1184	>>> m.group('first_name')
				1185	'Malcolm'
				1186	>>> m.group('last_name')
				1187	'Reynolds'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1188
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1189	Named groups can also be referred to by their index::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1190
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1191	>>> m.group(1)
				1192	'Malcolm'
				1193	>>> m.group(2)
				1194	'Reynolds'
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1195
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1196	If a group matches multiple times, only the last match is accessible::
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1197
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1198	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				1199	>>> m.group(1) # Returns only the last match.
				1200	'c3'
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1201
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	1202
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1203	.. method:: Match.__getitem__(g)
Eric V. Smith	605bdae	2016-09-11 08:55:43 -0400	[diff] [blame]	1204
				1205	This is identical to ``m.group(g)``. This allows easier access to
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1206	an individual group from a match::
Eric V. Smith	605bdae	2016-09-11 08:55:43 -0400	[diff] [blame]	1207
				1208	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				1209	>>> m[0] # The entire match
				1210	'Isaac Newton'
				1211	>>> m[1] # The first parenthesized subgroup.
				1212	'Isaac'
				1213	>>> m[2] # The second parenthesized subgroup.
				1214	'Newton'
				1215
				1216	.. versionadded:: 3.6
				1217
				1218
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1219	.. method:: Match.groups(default=None)
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	1220
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1221	Return a tuple containing all the subgroups of the match, from 1 up to however
				1222	many groups are in the pattern. The default argument is used for groups that
				1223	did not participate in the match; it defaults to ``None``.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1224
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1225	For example::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1226
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1227	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				1228	>>> m.groups()
				1229	('24', '1632')
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1230
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1231	If we make the decimal place and everything after it optional, not all groups
				1232	might participate in the match. These groups will default to ``None`` unless
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1233	the default argument is given::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1234
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1235	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				1236	>>> m.groups() # Second group defaults to None.
				1237	('24', None)
				1238	>>> m.groups('0') # Now, the second group defaults to '0'.
				1239	('24', '0')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1240
				1241
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1242	.. method:: Match.groupdict(default=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1243
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1244	Return a dictionary containing all the named subgroups of the match, keyed by
				1245	the subgroup name. The default argument is used for groups that did not
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1246	participate in the match; it defaults to ``None``. For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1247
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1248	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				1249	>>> m.groupdict()
				1250	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1251
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1252
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1253	.. method:: Match.start([group])
				1254	Match.end([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1255
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1256	Return the indices of the start and end of the substring matched by group;
				1257	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				1258	group exists but did not contribute to the match. For a match object m, and
				1259	a group g that did contribute to the match, the substring matched by group g
				1260	(equivalent to ``m.group(g)``) is ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1261
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1262	m.string[m.start(g):m.end(g)]
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1263
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1264	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				1265	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				1266	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				1267	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1268
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1269	An example that will remove remove_this from email addresses::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1270
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1271	>>> email = "tony@tiremove_thisger.net"
				1272	>>> m = re.search("remove_this", email)
				1273	>>> email[:m.start()] + email[m.end():]
				1274	'tony@tiger.net'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1275
				1276
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1277	.. method:: Match.span([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1278
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1279	For a match m, return the 2-tuple ``(m.start(group), m.end(group))``. Note
				1280	that if group did not contribute to the match, this is ``(-1, -1)``.
				1281	group defaults to zero, the entire match.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1282
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1283
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1284	.. attribute:: Match.pos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1285
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1286	The value of pos which was passed to the :meth:`~Pattern.search` or
				1287	:meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	1288	the index into the string at which the RE engine started looking for a match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1289
				1290
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1291	.. attribute:: Match.endpos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1292
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1293	The value of endpos which was passed to the :meth:`~Pattern.search` or
				1294	:meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	1295	the index into the string beyond which the RE engine will not go.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1296
				1297
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1298	.. attribute:: Match.lastindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1299
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1300	The integer index of the last matched capturing group, or ``None`` if no group
				1301	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				1302	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				1303	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				1304	string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1305
				1306
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1307	.. attribute:: Match.lastgroup
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1308
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1309	The name of the last matched capturing group, or ``None`` if the group didn't
				1310	have a name, or if no group was matched at all.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1311
				1312
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1313	.. attribute:: Match.re
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1314
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1315	The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1316	:meth:`~Pattern.search` method produced this match instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1317
				1318
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1319	.. attribute:: Match.string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1320
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1321	The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1322
				1323
Serhiy Storchaka	fdbd011	2017-04-16 10:16:03 +0300	[diff] [blame]	1324	.. versionchanged:: 3.7
				1325	Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects
				1326	are considered atomic.
				1327
				1328
Raymond Hettinger	1fa7682	2010-12-06 23:31:36 +0000	[diff] [blame]	1329	.. _re-examples:
				1330
				1331	Regular Expression Examples
				1332	---------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1333
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1334
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	1335	Checking for a Pair
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1336	^^^^^^^^^^^^^^^^^^^
				1337
				1338	In this example, we'll use the following helper function to display match
Stéphane Wirtel	859c068	2018-10-12 09:51:05 +0200	[diff] [blame]	1339	objects a little more gracefully::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1340
				1341	def displaymatch(match):
				1342	if match is None:
				1343	return None
				1344	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				1345
				1346	Suppose you are writing a poker program where a player's hand is represented as
				1347	a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1348	for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1349	representing the card with that value.
				1350
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1351	To see if a given string is a valid hand, one could do the following::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1352
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1353	>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
				1354	>>> displaymatch(valid.match("akt5q")) # Valid.
				1355	"<Match: 'akt5q', groups=()>"
				1356	>>> displaymatch(valid.match("akt5e")) # Invalid.
				1357	>>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1358	>>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1359	"<Match: '727ak', groups=()>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1360
				1361	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1362	To match this with a regular expression, one could use backreferences as such::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1363
				1364	>>> pair = re.compile(r".(.).\1")
				1365	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1366	"<Match: '717', groups=('7',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1367	>>> displaymatch(pair.match("718ak")) # No pairs.
				1368	>>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1369	"<Match: '354aa', groups=('a',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1370
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	1371	To find out what card the pair consists of, one could use the
Stéphane Wirtel	859c068	2018-10-12 09:51:05 +0200	[diff] [blame]	1372	:meth:`~Match.group` method of the match object in the following manner::
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1373
Stéphane Wirtel	859c068	2018-10-12 09:51:05 +0200	[diff] [blame]	1374	>>> pair = re.compile(r".(.).\1")
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1375	>>> pair.match("717ak").group(1)
				1376	'7'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1377
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1378	# Error because re.match() returns None, which doesn't have a group() method:
				1379	>>> pair.match("718ak").group(1)
				1380	Traceback (most recent call last):
				1381	File "<pyshell#23>", line 1, in <module>
				1382	re.match(r".(.).\1", "718ak").group(1)
				1383	AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1384
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1385	>>> pair.match("354aa").group(1)
				1386	'a'
				1387
				1388
				1389	Simulating scanf()
				1390	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1391
				1392	.. index:: single: scanf()
				1393
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1394	Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1395	expressions are generally more powerful, though also more verbose, than
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1396	:c:func:`scanf` format strings. The table below offers some more-or-less
				1397	equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1398	expressions.
				1399
				1400	+--------------------------------+---------------------------------------------+
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1401	\| :c:func:`scanf` Token \| Regular Expression \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1402	+================================+=============================================+
				1403	\| ``%c`` \| ``.`` \|
				1404	+--------------------------------+---------------------------------------------+
				1405	\| ``%5c`` \| ``.{5}`` \|
				1406	+--------------------------------+---------------------------------------------+
				1407	\| ``%d`` \| ``[-+]?\d+`` \|
				1408	+--------------------------------+---------------------------------------------+
				1409	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1410	+--------------------------------+---------------------------------------------+
				1411	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1412	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1413	\| ``%o`` \| ``[-+]?[0-7]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1414	+--------------------------------+---------------------------------------------+
				1415	\| ``%s`` \| ``\S+`` \|
				1416	+--------------------------------+---------------------------------------------+
				1417	\| ``%u`` \| ``\d+`` \|
				1418	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1419	\| ``%x``, ``%X`` \| ``[-+]?(0[xX])?[\dA-Fa-f]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1420	+--------------------------------+---------------------------------------------+
				1421
				1422	To extract the filename and numbers from a string like ::
				1423
				1424	/usr/sbin/sendmail - 0 errors, 4 warnings
				1425
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1426	you would use a :c:func:`scanf` format like ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1427
				1428	%s - %d errors, %d warnings
				1429
				1430	The equivalent regular expression would be ::
				1431
				1432	(\S+) - (\d+) errors, (\d+) warnings
				1433
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1434
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1435	.. _search-vs-match:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1436
				1437	search() vs. match()
				1438	^^^^^^^^^^^^^^^^^^^^
				1439
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1440	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1441
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1442	Python offers two different primitive operations based on regular expressions:
				1443	:func:`re.match` checks for a match only at the beginning of the string, while
				1444	:func:`re.search` checks for a match anywhere in the string (this is what Perl
				1445	does by default).
				1446
				1447	For example::
				1448
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1449	>>> re.match("c", "abcdef") # No match
				1450	>>> re.search("c", "abcdef") # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1451	<re.Match object; span=(2, 3), match='c'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1452
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1453	Regular expressions beginning with ``'^'`` can be used with :func:`search` to
				1454	restrict the match at the beginning of the string::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1455
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1456	>>> re.match("c", "abcdef") # No match
				1457	>>> re.search("^c", "abcdef") # No match
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1458	>>> re.search("^a", "abcdef") # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1459	<re.Match object; span=(0, 1), match='a'>
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1460
				1461	Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
				1462	beginning of the string, whereas using :func:`search` with a regular expression
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1463	beginning with ``'^'`` will match at the beginning of each line. ::
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1464
				1465	>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
				1466	>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1467	<re.Match object; span=(4, 5), match='X'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1468
				1469
				1470	Making a Phonebook
				1471	^^^^^^^^^^^^^^^^^^
				1472
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1473	:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1474	method is invaluable for converting textual data into data structures that can be
				1475	easily read and modified by Python as demonstrated in the following example that
				1476	creates a phonebook.
				1477
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1478	First, here is the input. Normally it may come from a file, here we are using
Stéphane Wirtel	859c068	2018-10-12 09:51:05 +0200	[diff] [blame]	1479	triple-quoted string syntax
				1480
				1481	.. doctest::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1482
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1483	>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1484	...
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1485	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1486	... Frank Burger: 925.541.7625 662 South Dogwood Way
				1487	...
				1488	...
				1489	... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1490
				1491	The entries are separated by one or more newlines. Now we convert the string
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1492	into a list with each nonempty line having its own entry:
				1493
				1494	.. doctest::
				1495	:options: +NORMALIZE_WHITESPACE
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1496
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1497	>>> entries = re.split("\n+", text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1498	>>> entries
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1499	['Ross McFluff: 834.345.1254 155 Elm Street',
				1500	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
				1501	'Frank Burger: 925.541.7625 662 South Dogwood Way',
				1502	'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1503
				1504	Finally, split each entry into a list with first name, last name, telephone
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1505	number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1506	because the address has spaces, our splitting pattern, in it:
				1507
				1508	.. doctest::
				1509	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1510
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1511	>>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1512	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1513	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1514	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1515	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1516
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1517	The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1518	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1519	house number from the street name:
				1520
				1521	.. doctest::
				1522	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1523
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1524	>>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1525	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1526	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1527	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1528	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1529
				1530
				1531	Text Munging
				1532	^^^^^^^^^^^^
				1533
				1534	:func:`sub` replaces every occurrence of a pattern with a string or the
				1535	result of a function. This example demonstrates using :func:`sub` with
				1536	a function to "munge" text, or randomize the order of all the characters
				1537	in each word of a sentence except for the first and last characters::
				1538
				1539	>>> def repl(m):
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1540	... inner_word = list(m.group(2))
				1541	... random.shuffle(inner_word)
				1542	... return m.group(1) + "".join(inner_word) + m.group(3)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1543	>>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1544	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1545	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1546	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1547	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1548
				1549
				1550	Finding all Adverbs
				1551	^^^^^^^^^^^^^^^^^^^
				1552
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1553	:func:`findall` matches all occurrences of a pattern, not just the first
Andrés Delfino	5092439	2018-06-18 01:34:30 -0300	[diff] [blame]	1554	one as :func:`search` does. For example, if a writer wanted to
				1555	find all of the adverbs in some text, they might use :func:`findall` in
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1556	the following manner::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1557
				1558	>>> text = "He was carefully disguised but captured quickly by police."
				1559	>>> re.findall(r"\w+ly", text)
				1560	['carefully', 'quickly']
				1561
				1562
				1563	Finding all Adverbs and their Positions
				1564	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1565
				1566	If one wants more information about all matches of a pattern than the matched
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1567	text, :func:`finditer` is useful as it provides :ref:`match objects
				1568	<match-objects>` instead of strings. Continuing with the previous example, if
Andrés Delfino	5092439	2018-06-18 01:34:30 -0300	[diff] [blame]	1569	a writer wanted to find all of the adverbs and their positions in
				1570	some text, they would use :func:`finditer` in the following manner::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1571
				1572	>>> text = "He was carefully disguised but captured quickly by police."
				1573	>>> for m in re.finditer(r"\w+ly", text):
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1574	... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1575	07-16: carefully
				1576	40-47: quickly
				1577
				1578
				1579	Raw String Notation
				1580	^^^^^^^^^^^^^^^^^^^
				1581
				1582	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1583	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1584	another one to escape it. For example, the two following lines of code are
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1585	functionally identical::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1586
				1587	>>> re.match(r"\W(.)\1\W", " ff ")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1588	<re.Match object; span=(0, 4), match=' ff '>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1589	>>> re.match("\\W(.)\\1\\W", " ff ")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1590	<re.Match object; span=(0, 4), match=' ff '>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1591
				1592	When one wants to match a literal backslash, it must be escaped in the regular
				1593	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1594	notation, one must use ``"\\\\"``, making the following lines of code
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1595	functionally identical::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1596
				1597	>>> re.match(r"\\", r"\\")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1598	<re.Match object; span=(0, 1), match='\\'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1599	>>> re.match("\\\\", r"\\")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1600	<re.Match object; span=(0, 1), match='\\'>
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1601
				1602
				1603	Writing a Tokenizer
				1604	^^^^^^^^^^^^^^^^^^^
				1605
Georg Brandl	5d94134	2016-02-26 19:37:12 +0100	[diff] [blame]	1606	A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1607	analyzes a string to categorize groups of characters. This is a useful first
				1608	step in writing a compiler or interpreter.
				1609
				1610	The text categories are specified with regular expressions. The technique is
				1611	to combine those into a single master regular expression and to loop over
				1612	successive matches::
				1613
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1614	import collections
				1615	import re
				1616
Raymond Hettinger	b83942c	2018-11-09 01:19:33 -0800	[diff] [blame]	1617	Token = collections.namedtuple('Token', ['type', 'value', 'line', 'column'])
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1618
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1619	def tokenize(code):
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1620	keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
				1621	token_specification = [
Raymond Hettinger	b83942c	2018-11-09 01:19:33 -0800	[diff] [blame]	1622	('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
				1623	('ASSIGN', r':='), # Assignment operator
				1624	('END', r';'), # Statement terminator
				1625	('ID', r'[A-Za-z]+'), # Identifiers
				1626	('OP', r'[+\-*/]'), # Arithmetic operators
				1627	('NEWLINE', r'\n'), # Line endings
				1628	('SKIP', r'[ \t]+'), # Skip over spaces and tabs
				1629	('MISMATCH', r'.'), # Any other character
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1630	]
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1631	tok_regex = '\|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1632	line_num = 1
				1633	line_start = 0
				1634	for mo in re.finditer(tok_regex, code):
				1635	kind = mo.lastgroup
Raymond Hettinger	b83942c	2018-11-09 01:19:33 -0800	[diff] [blame]	1636	value = mo.group()
				1637	column = mo.start() - line_start
				1638	if kind == 'NUMBER':
				1639	value = float(value) if '.' in value else int(value)
				1640	elif kind == 'ID' and value in keywords:
				1641	kind = value
				1642	elif kind == 'NEWLINE':
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1643	line_start = mo.end()
				1644	line_num += 1
Raymond Hettinger	b83942c	2018-11-09 01:19:33 -0800	[diff] [blame]	1645	continue
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1646	elif kind == 'SKIP':
Raymond Hettinger	b83942c	2018-11-09 01:19:33 -0800	[diff] [blame]	1647	continue
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1648	elif kind == 'MISMATCH':
Raymond Hettinger	d0b9158	2017-02-06 07:15:31 -0800	[diff] [blame]	1649	raise RuntimeError(f'{value!r} unexpected on line {line_num}')
Raymond Hettinger	b83942c	2018-11-09 01:19:33 -0800	[diff] [blame]	1650	yield Token(kind, value, line_num, column)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1651
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1652	statements = '''
				1653	IF quantity THEN
				1654	total := total + price * quantity;
				1655	tax := price * 0.05;
				1656	ENDIF;
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1657	'''
Raymond Hettinger	23157e5	2011-05-13 01:38:31 -0700	[diff] [blame]	1658
				1659	for token in tokenize(statements):
				1660	print(token)
				1661
				1662	The tokenizer produces the following output::
Raymond Hettinger	9c47d77	2011-05-13 01:03:50 -0700	[diff] [blame]	1663
Raymond Hettinger	b83942c	2018-11-09 01:19:33 -0800	[diff] [blame]	1664	Token(type='IF', value='IF', line=2, column=4)
				1665	Token(type='ID', value='quantity', line=2, column=7)
				1666	Token(type='THEN', value='THEN', line=2, column=16)
				1667	Token(type='ID', value='total', line=3, column=8)
				1668	Token(type='ASSIGN', value=':=', line=3, column=14)
				1669	Token(type='ID', value='total', line=3, column=17)
				1670	Token(type='OP', value='+', line=3, column=23)
				1671	Token(type='ID', value='price', line=3, column=25)
				1672	Token(type='OP', value='*', line=3, column=31)
				1673	Token(type='ID', value='quantity', line=3, column=33)
				1674	Token(type='END', value=';', line=3, column=41)
				1675	Token(type='ID', value='tax', line=4, column=8)
				1676	Token(type='ASSIGN', value=':=', line=4, column=12)
				1677	Token(type='ID', value='price', line=4, column=15)
				1678	Token(type='OP', value='*', line=4, column=21)
				1679	Token(type='NUMBER', value=0.05, line=4, column=23)
				1680	Token(type='END', value=';', line=4, column=27)
				1681	Token(type='ENDIF', value='ENDIF', line=5, column=4)
				1682	Token(type='END', value=';', line=5, column=9)
Berker Peksag	a0a42d2	2018-03-23 16:46:52 +0300	[diff] [blame]	1683
				1684
				1685	.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
				1686	Media, 2009. The third edition of the book no longer covers Python at all,
				1687	but the first edition covered writing good regular expression patterns in
				1688	great detail.