Blame - Doc/library/re.rst - platform/external/python/cpython3

blob: f25d3d679a230902eb24934983b2424cb429dba9 [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`re` --- Regular expression operations
				2	===========================================
				3
				4	.. module:: re
				5	:synopsis: Regular expression operations.
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	10	Source code: :source:`Lib/re.py`
				11
				12	--------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	13
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14	This module provides regular expression matching operations similar to
Georg Brandl	ed2a1db	2009-06-08 07:48:27 +0000	[diff] [blame]	15	those found in Perl.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	16
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	17	Both patterns and strings to be searched can be Unicode strings (:class:`str`)
				18	as well as 8-bit strings (:class:`bytes`).
				19	However, Unicode strings and 8-bit strings cannot be mixed:
Martin Panter	6245cb3	2016-04-15 02:14:19 +0000	[diff] [blame]	20	that is, you cannot match a Unicode string with a byte pattern or
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	21	vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	22	string must be of the same type as both the pattern and the search string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	23
				24	Regular expressions use the backslash character (``'\'``) to indicate
				25	special forms or to allow special characters to be used without invoking
				26	their special meaning. This collides with Python's usage of the same
				27	character for the same purpose in string literals; for example, to match
				28	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				29	string, because the regular expression must be ``\\``, and each
				30	backslash must be expressed as ``\\`` inside a regular Python string
				31	literal.
				32
				33	The solution is to use Python's raw string notation for regular expression
				34	patterns; backslashes are not handled in any special way in a string literal
				35	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				36	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	37	newline. Usually patterns will be expressed in Python code using this raw
				38	string notation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	39
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	40	It is important to note that most regular expression operations are available as
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	41	module-level functions and methods on
				42	:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
				43	that don't require you to compile a regex object first, but miss some
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	44	fine-tuning parameters.
				45
Marco Buttu	ed6795e	2017-02-26 16:26:23 +0100	[diff] [blame]	46	.. seealso::
				47
Miss Islington (bot)	51b2f6d	2018-05-16 07:05:46 -0700	[diff] [blame]	48	The third-party `regex <https://pypi.org/project/regex/>`_ module,
Marco Buttu	ed6795e	2017-02-26 16:26:23 +0100	[diff] [blame]	49	which has an API compatible with the standard library :mod:`re` module,
				50	but offers additional functionality and a more thorough Unicode support.
				51
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	52
				53	.. _re-syntax:
				54
				55	Regular Expression Syntax
				56	-------------------------
				57
				58	A regular expression (or RE) specifies a set of strings that matches it; the
				59	functions in this module let you check if a particular string matches a given
				60	regular expression (or if a given regular expression matches a particular
				61	string, which comes down to the same thing).
				62
				63	Regular expressions can be concatenated to form new regular expressions; if A
				64	and B are both regular expressions, then AB is also a regular expression.
				65	In general, if a string p matches A and another string q matches B, the
				66	string pq will match AB. This holds unless A or B contain low precedence
				67	operations; boundary conditions between A and B; or have numbered group
				68	references. Thus, complex expressions can easily be constructed from simpler
				69	primitive expressions like the ones described here. For details of the theory
Miss Islington (bot)	67d3f8b	2018-03-23 08:55:26 -0700	[diff] [blame]	70	and implementation of regular expressions, consult the Friedl book [Frie09]_,
				71	or almost any textbook about compiler construction.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	72
				73	A brief explanation of the format of regular expressions follows. For further
Christian Heimes	2202f87	2008-02-06 14:31:34 +0000	[diff] [blame]	74	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	75
				76	Regular expressions can contain both special and ordinary characters. Most
				77	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				78	expressions; they simply match themselves. You can concatenate ordinary
				79	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				80	section, we'll write RE's in ``this special style``, usually without quotes, and
				81	strings to be matched ``'in single quotes'``.)
				82
				83	Some characters, like ``'\|'`` or ``'('``, are special. Special
				84	characters either stand for classes of ordinary characters, or affect
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	85	how the regular expressions around them are interpreted.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	86
Martin Panter	684340e	2016-10-15 01:18:16 +0000	[diff] [blame]	87	Repetition qualifiers (``*``, ``+``, ``?``, ``{m,n}``, etc) cannot be
				88	directly nested. This avoids ambiguity with the non-greedy modifier suffix
				89	``?``, and with other modifiers in other implementations. To apply a second
				90	repetition to an inner repetition, parentheses may be used. For example,
				91	the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.
				92
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	93
				94	The special characters are:
				95
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	96	.. index:: single: .; in regular expressions
				97
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	98	``.``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	99	(Dot.) In the default mode, this matches any character except a newline. If
				100	the :const:`DOTALL` flag has been specified, this matches any character
				101	including a newline.
				102
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	103	.. index:: single: ^; in regular expressions
				104
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	105	``^``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	106	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				107	matches immediately after each newline.
				108
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	109	.. index:: single: $; in regular expressions
				110
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	111	``$``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	112	Matches the end of the string or just before the newline at the end of the
				113	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				114	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				115	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes	25bb783	2008-01-11 16:17:00 +0000	[diff] [blame]	116	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				117	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				118	the newline, and one at the end of the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	119
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	120	.. index:: single: *; in regular expressions
				121
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	122	``*``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	123	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				124	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				125	by any number of 'b's.
				126
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	127	.. index:: single: +; in regular expressions
				128
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	129	``+``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	130	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				131	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				132	match just 'a'.
				133
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	134	.. index:: single: ?; in regular expressions
				135
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	136	``?``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	137	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				138	``ab?`` will match either 'a' or 'ab'.
				139
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	140	.. index::
				141	single: *?; in regular expressions
				142	single: +?; in regular expressions
				143	single: ??; in regular expressions
				144
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	145	``*?``, ``+?``, ``??``
				146	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				147	as much text as possible. Sometimes this behaviour isn't desired; if the RE
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	148	``<.*>`` is matched against ``'<a> b <c>'``, it will match the entire
				149	string, and not just ``'<a>'``. Adding ``?`` after the qualifier makes it
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	150	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
Georg Brandl	7ff033b	2016-04-12 07:51:41 +0200	[diff] [blame]	151	characters as possible will be matched. Using the RE ``<.*?>`` will match
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	152	only ``'<a>'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	153
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	154	.. index::
				155	single: {; in regular expressions
				156	single: }; in regular expressions
				157
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	158	``{m}``
				159	Specifies that exactly m copies of the previous RE should be matched; fewer
				160	matches cause the entire RE not to match. For example, ``a{6}`` will match
				161	exactly six ``'a'`` characters, but not five.
				162
				163	``{m,n}``
				164	Causes the resulting RE to match from m to n repetitions of the preceding
				165	RE, attempting to match as many repetitions as possible. For example,
				166	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				167	lower bound of zero, and omitting n specifies an infinite upper bound. As an
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	168	example, ``a{4,}b`` will match ``'aaaab'`` or a thousand ``'a'`` characters
				169	followed by a ``'b'``, but not ``'aaab'``. The comma may not be omitted or the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	170	modifier would be confused with the previously described form.
				171
				172	``{m,n}?``
				173	Causes the resulting RE to match from m to n repetitions of the preceding
				174	RE, attempting to match as few repetitions as possible. This is the
				175	non-greedy version of the previous qualifier. For example, on the
				176	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				177	while ``a{3,5}?`` will only match 3 characters.
				178
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	179	.. index:: single: \; in regular expressions
				180
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	181	``\``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	182	Either escapes special characters (permitting you to match characters like
				183	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				184	sequences are discussed below.
				185
				186	If you're not using a raw string to express the pattern, remember that Python
				187	also uses the backslash as an escape sequence in string literals; if the escape
				188	sequence isn't recognized by Python's parser, the backslash and subsequent
				189	character are included in the resulting string. However, if Python would
				190	recognize the resulting sequence, the backslash should be repeated twice. This
				191	is complicated and hard to understand, so it's highly recommended that you use
				192	raw strings for all but the simplest expressions.
				193
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	194	.. index::
				195	single: [; in regular expressions
				196	single: ]; in regular expressions
				197
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	198	``[]``
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	199	Used to indicate a set of characters. In a set:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	200
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	201	* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
				202	``'m'``, or ``'k'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	203
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	204	.. index:: single: -; in regular expressions
				205
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	206	* Ranges of characters can be indicated by giving two characters and separating
				207	them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
				208	``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
				209	``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	210	``[a\-z]``) or if it's placed as the first or last character
				211	(e.g. ``[-a]`` or ``[a-]``), it will match a literal ``'-'``.
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	212
				213	* Special characters lose their special meaning inside sets. For example,
				214	``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
				215	``'*'``, or ``')'``.
				216
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	217	.. index:: single: \; in regular expressions
				218
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	219	* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
				220	inside a set, although the characters they match depends on whether
				221	:const:`ASCII` or :const:`LOCALE` mode is in force.
				222
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	223	.. index:: single: ^; in regular expressions
				224
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	225	* Characters that are not within a range can be matched by :dfn:`complementing`
				226	the set. If the first character of the set is ``'^'``, all the characters
				227	that are not in the set will be matched. For example, ``[^5]`` will match
				228	any character except ``'5'``, and ``[^^]`` will match any character except
				229	``'^'``. ``^`` has no special meaning if it's not the first character in
				230	the set.
				231
				232	* To match a literal ``']'`` inside a set, precede it with a backslash, or
				233	place it at the beginning of the set. For example, both ``[()[\]{}]`` and
				234	``[]()[{}]`` will both match a parenthesis.
Mark Summerfield	9e670c2	2008-05-31 13:05:34 +0000	[diff] [blame]	235
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	236	.. .. index:: single: --; in regular expressions
				237	.. .. index:: single: &&; in regular expressions
				238	.. .. index:: single: ~~; in regular expressions
				239	.. .. index:: single: \|\|; in regular expressions
				240
Serhiy Storchaka	05cb728	2017-11-16 12:38:26 +0200	[diff] [blame]	241	* Support of nested sets and set operations as in `Unicode Technical
				242	Standard #18`_ might be added in the future. This would change the
				243	syntax, so to facilitate this change a :exc:`FutureWarning` will be raised
				244	in ambiguous cases for the time being.
Miss Islington (bot)	4322b8d	2018-10-06 12:56:45 -0700	[diff] [blame]	245	That includes sets starting with a literal ``'['`` or containing literal
Serhiy Storchaka	05cb728	2017-11-16 12:38:26 +0200	[diff] [blame]	246	character sequences ``'--'``, ``'&&'``, ``'~~'``, and ``'\|\|'``. To
				247	avoid a warning escape them with a backslash.
				248
				249	.. _Unicode Technical Standard #18: https://unicode.org/reports/tr18/
				250
				251	.. versionchanged:: 3.7
				252	:exc:`FutureWarning` is raised if a character set contains constructs
				253	that will change semantically in the future.
				254
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	255	.. index:: single: \|; in regular expressions
				256
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	257	``\|``
				258	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				259	will match either A or B. An arbitrary number of REs can be separated by the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	260	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				261	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				262	right. When one pattern completely matches, that branch is accepted. This means
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	263	that once A matches, B will not be tested further, even if it would
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	264	produce a longer overall match. In other words, the ``'\|'`` operator is never
				265	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				266	character class, as in ``[\|]``.
				267
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	268	.. index::
				269	single: (; in regular expressions
				270	single: ); in regular expressions
				271
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	272	``(...)``
				273	Matches whatever regular expression is inside the parentheses, and indicates the
				274	start and end of a group; the contents of a group can be retrieved after a match
				275	has been performed, and can be matched later in the string with the ``\number``
				276	special sequence, described below. To match the literals ``'('`` or ``')'``,
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	277	use ``$`` or ``$``, or enclose them inside a character class: ``[(]``, ``[)]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	278
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	279	.. index:: single: (?; in regular expressions
				280
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	281	``(?...)``
				282	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				283	otherwise). The first character after the ``'?'`` determines what the meaning
				284	and further syntax of the construct is. Extensions usually do not create a new
				285	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				286	currently supported extensions.
				287
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	288	``(?aiLmsux)``
				289	(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				290	``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling	1c50e86	2009-06-01 00:11:36 +0000	[diff] [blame]	291	letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	292	:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	293	:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	294	:const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
				295	for the entire regular expression.
				296	(The flags are described in :ref:`contents-of-module-re`.)
				297	This is useful if you wish to include the flags as part of the
				298	regular expression, instead of passing a flag argument to the
Serhiy Storchaka	bd48d27	2016-09-11 12:50:02 +0300	[diff] [blame]	299	:func:`re.compile` function. Flags should be used first in the
				300	expression string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	301
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	302	.. index:: single: (?:; in regular expressions
				303
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	304	``(?:...)``
Georg Brandl	3122ce3	2010-10-29 06:17:38 +0000	[diff] [blame]	305	A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	306	expression is inside the parentheses, but the substring matched by the group
				307	cannot be retrieved after performing a match or referenced later in the
				308	pattern.
				309
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	310	``(?aiLmsux-imsx:...)``
				311	(Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				312	``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
				313	one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
				314	The letters set or remove the corresponding flags:
				315	:const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
				316	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				317	:const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
				318	and :const:`re.X` (verbose), for the part of the expression.
				319	(The flags are described in :ref:`contents-of-module-re`.)
				320
				321	The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
				322	as inline flags, so they can't be combined or follow ``'-'``. Instead,
				323	when one of them appears in an inline group, it overrides the matching mode
				324	in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
				325	ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
				326	(default). In byte pattern ``(?L:...)`` switches to locale depending
				327	matching, and ``(?a:...)`` switches to ASCII-only matching (default).
				328	This override is only in effect for the narrow inline group, and the
				329	original matching mode is restored outside of the group.
Serhiy Storchaka	be9a4e5	2016-09-10 00:57:55 +0300	[diff] [blame]	330
Zachary Ware	c307672	2016-09-09 15:47:05 -0700	[diff] [blame]	331	.. versionadded:: 3.6
Serhiy Storchaka	be9a4e5	2016-09-10 00:57:55 +0300	[diff] [blame]	332
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	333	.. versionchanged:: 3.7
				334	The letters ``'a'``, ``'L'`` and ``'u'`` also can be used in a group.
				335
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	336	.. index:: single: (?P<; in regular expressions
				337
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	338	``(?P<name>...)``
				339	Similar to regular parentheses, but the substring matched by the group is
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	340	accessible via the symbolic group name name. Group names must be valid
				341	Python identifiers, and each group name must be defined only once within a
				342	regular expression. A symbolic group is also a numbered group, just as if
				343	the group were not named.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	344
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	345	Named groups can be referenced in three contexts. If the pattern is
				346	``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
				347	single or double quotes):
				348
				349	+---------------------------------------+----------------------------------+
				350	\| Context of reference to group "quote" \| Ways to reference it \|
				351	+=======================================+==================================+
				352	\| in the same pattern itself \| * ``(?P=quote)`` (as shown) \|
				353	\| \| * ``\1`` \|
				354	+---------------------------------------+----------------------------------+
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	355	\| when processing match object m \| * ``m.group('quote')`` \|
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	356	\| \| * ``m.end('quote')`` (etc.) \|
				357	+---------------------------------------+----------------------------------+
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	358	\| in a string passed to the repl \| * ``\g<quote>`` \|
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	359	\| argument of ``re.sub()`` \| * ``\g<1>`` \|
				360	\| \| * ``\1`` \|
				361	+---------------------------------------+----------------------------------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	362
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	363	.. index:: single: (?P=; in regular expressions
				364
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	365	``(?P=name)``
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	366	A backreference to a named group; it matches whatever text was matched by the
				367	earlier group named name.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	368
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	369	.. index:: single: (?#; in regular expressions
				370
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	371	``(?#...)``
				372	A comment; the contents of the parentheses are simply ignored.
				373
				374	``(?=...)``
				375	Matches if ``...`` matches next, but doesn't consume any of the string. This is
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	376	called a :dfn:`lookahead assertion`. For example, ``Isaac (?=Asimov)`` will match
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	377	``'Isaac '`` only if it's followed by ``'Asimov'``.
				378
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	379	.. index:: single: (?!; in regular expressions
				380
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	381	``(?!...)``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	382	Matches if ``...`` doesn't match next. This is a :dfn:`negative lookahead assertion`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	383	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				384	followed by ``'Asimov'``.
				385
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	386	.. index:: single: (?<=; in regular expressions
				387
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	388	``(?<=...)``
				389	Matches if the current position in the string is preceded by a match for ``...``
				390	that ends at the current position. This is called a :dfn:`positive lookbehind
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	391	assertion`. ``(?<=abc)def`` will find a match in ``'abcdef'``, since the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	392	lookbehind will back up 3 characters and check if the contained pattern matches.
				393	The contained pattern must only match strings of some fixed length, meaning that
				394	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
Ezio Melotti	0a6b541	2012-04-29 07:34:46 +0300	[diff] [blame]	395	patterns which start with positive lookbehind assertions will not match at the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	396	beginning of the string being searched; you will most likely want to use the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	397	:func:`search` function rather than the :func:`match` function:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	398
				399	>>> import re
				400	>>> m = re.search('(?<=abc)def', 'abcdef')
				401	>>> m.group(0)
				402	'def'
				403
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	404	This example looks for a word following a hyphen:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	405
Miss Islington (bot)	c7de1d7	2018-02-02 13:50:44 -0800	[diff] [blame]	406	>>> m = re.search(r'(?<=-)\w+', 'spam-egg')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	407	>>> m.group(0)
				408	'egg'
				409
Georg Brandl	8c16cb9	2016-02-25 20:17:45 +0100	[diff] [blame]	410	.. versionchanged:: 3.5
Serhiy Storchaka	4eea62f	2015-02-21 10:07:35 +0200	[diff] [blame]	411	Added support for group references of fixed length.
				412
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	413	.. index:: single: (?<!; in regular expressions
				414
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	415	``(?<!...)``
				416	Matches if the current position in the string is not preceded by a match for
				417	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				418	positive lookbehind assertions, the contained pattern must only match strings of
				419	some fixed length. Patterns which start with negative lookbehind assertions may
				420	match at the beginning of the string being searched.
				421
				422	``(?(id/name)yes-pattern\|no-pattern)``
orsenthil@gmail.com	476021b	2011-03-12 10:46:25 +0800	[diff] [blame]	423	Will try to match with ``yes-pattern`` if the group with given id or
				424	name exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
				425	optional and can be omitted. For example,
				426	``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>\|$)`` is a poor email matching pattern, which
				427	will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
Serhiy Storchaka	a4d170d	2013-12-23 18:20:51 +0200	[diff] [blame]	428	not with ``'<user@host.com'`` nor ``'user@host.com>'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	429
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	430
				431	The special sequences consist of ``'\'`` and a character from the list below.
Martin Panter	98e9051	2016-06-12 06:17:29 +0000	[diff] [blame]	432	If the ordinary character is not an ASCII digit or an ASCII letter, then the
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	433	resulting RE will match the second character. For example, ``\$`` matches the
				434	character ``'$'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	435
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	436	.. index:: single: \; in regular expressions
				437
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	438	``\number``
				439	Matches the contents of the group of the same number. Groups are numbered
				440	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl	2070e83	2013-10-06 12:58:20 +0200	[diff] [blame]	441	but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	442	can only be used to match one of the first 99 groups. If the first digit of
				443	number is 0, or number is 3 octal digits long, it will not be interpreted as
				444	a group match, but as the character with octal value number. Inside the
				445	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				446	characters.
				447
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	448	.. index:: single: \A; in regular expressions
				449
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	450	``\A``
				451	Matches only at the start of the string.
				452
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	453	.. index:: single: \b; in regular expressions
				454
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	455	``\b``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	456	Matches the empty string, but only at the beginning or end of a word.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	457	A word is defined as a sequence of word characters. Note that formally,
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	458	``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
				459	(or vice versa), or between ``\w`` and the beginning/end of the string.
				460	This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
				461	``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
				462
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	463	By default Unicode alphanumerics are the ones used in Unicode patterns, but
				464	this can be changed by using the :const:`ASCII` flag. Word boundaries are
				465	determined by the current locale if the :const:`LOCALE` flag is used.
				466	Inside a character range, ``\b`` represents the backspace character, for
				467	compatibility with Python's string literals.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	468
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	469	.. index:: single: \B; in regular expressions
				470
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	471	``\B``
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	472	Matches the empty string, but only when it is not at the beginning or end
				473	of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
				474	``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	475	``\B`` is just the opposite of ``\b``, so word characters in Unicode
				476	patterns are Unicode alphanumerics or the underscore, although this can
				477	be changed by using the :const:`ASCII` flag. Word boundaries are
				478	determined by the current locale if the :const:`LOCALE` flag is used.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	479
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	480	.. index:: single: \d; in regular expressions
				481
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	482	``\d``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	483	For Unicode (str) patterns:
Mark Dickinson	1f26828	2009-07-28 17:22:36 +0000	[diff] [blame]	484	Matches any Unicode decimal digit (that is, any character in
				485	Unicode character category [Nd]). This includes ``[0-9]``, and
				486	also many other digit characters. If the :const:`ASCII` flag is
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	487	used only ``[0-9]`` is matched.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	488
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	489	For 8-bit (bytes) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	490	Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	491
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	492	.. index:: single: \D; in regular expressions
				493
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	494	``\D``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	495	Matches any character which is not a decimal digit. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	496	the opposite of ``\d``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	497	becomes the equivalent of ``[^0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	498
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	499	.. index:: single: \s; in regular expressions
				500
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	501	``\s``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	502	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	503	Matches Unicode whitespace characters (which includes
				504	``[ \t\n\r\f\v]``, and also many other characters, for example the
				505	non-breaking spaces mandated by typography rules in many
				506	languages). If the :const:`ASCII` flag is used, only
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	507	``[ \t\n\r\f\v]`` is matched.
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	508
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	509	For 8-bit (bytes) patterns:
				510	Matches characters considered whitespace in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	511	this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	512
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	513	.. index:: single: \S; in regular expressions
				514
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	515	``\S``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	516	Matches any character which is not a whitespace character. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	517	the opposite of ``\s``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	518	becomes the equivalent of ``[^ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	519
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	520	.. index:: single: \w; in regular expressions
				521
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	522	``\w``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	523	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	524	Matches Unicode word characters; this includes most characters
				525	that can be part of a word in any language, as well as numbers and
				526	the underscore. If the :const:`ASCII` flag is used, only
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	527	``[a-zA-Z0-9_]`` is matched.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	528
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	529	For 8-bit (bytes) patterns:
				530	Matches characters considered alphanumeric in the ASCII character set;
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	531	this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
				532	used, matches characters considered alphanumeric in the current locale
				533	and the underscore.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	534
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	535	.. index:: single: \W; in regular expressions
				536
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	537	``\W``
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	538	Matches any character which is not a word character. This is
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	539	the opposite of ``\w``. If the :const:`ASCII` flag is used this
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	540	becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	541	used, matches characters considered alphanumeric in the current locale
				542	and the underscore.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	543
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	544	.. index:: single: \Z; in regular expressions
				545
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	546	``\Z``
				547	Matches only at the end of the string.
				548
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	549	.. index::
				550	single: \a; in regular expressions
				551	single: \b; in regular expressions
				552	single: \f; in regular expressions
				553	single: \n; in regular expressions
				554	single: \N; in regular expressions
				555	single: \r; in regular expressions
				556	single: \t; in regular expressions
				557	single: \u; in regular expressions
				558	single: \U; in regular expressions
				559	single: \v; in regular expressions
				560	single: \x; in regular expressions
				561	single: \\; in regular expressions
				562
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	563	Most of the standard escapes supported by Python string literals are also
				564	accepted by the regular expression parser::
				565
				566	\a \b \f \n
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	567	\r \t \u \U
				568	\v \x \\
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	569
Ezio Melotti	285e51b	2012-04-29 04:52:30 +0300	[diff] [blame]	570	(Note that ``\b`` is used to represent word boundaries, and means "backspace"
				571	only inside character classes.)
				572
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	573	``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	574	patterns. In bytes patterns they are errors.
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	575
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	576	Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	577	there are three octal digits, it is considered an octal escape. Otherwise, it is
				578	a group reference. As for string literals, octal escapes are always at most
				579	three digits in length.
				580
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	581	.. versionchanged:: 3.3
				582	The ``'\u'`` and ``'\U'`` escape sequences have been added.
				583
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	584	.. versionchanged:: 3.6
Martin Panter	98e9051	2016-06-12 06:17:29 +0000	[diff] [blame]	585	Unknown escapes consisting of ``'\'`` and an ASCII letter now are errors.
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	586
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	587
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	588
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	589	.. _contents-of-module-re:
				590
				591	Module Contents
				592	---------------
				593
				594	The module defines several functions, constants, and an exception. Some of the
				595	functions are simplified versions of the full featured methods for compiled
				596	regular expressions. Most non-trivial applications always use the compiled
				597	form.
				598
Ethan Furman	c88c80b	2016-11-21 08:29:31 -0800	[diff] [blame]	599	.. versionchanged:: 3.6
				600	Flag constants are now instances of :class:`RegexFlag`, which is a subclass of
				601	:class:`enum.IntFlag`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	602
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	603	.. function:: compile(pattern, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	604
Henk-Jaap Wagenaar	ed94a8b	2017-08-28 06:41:20 +0100	[diff] [blame]	605	Compile a regular expression pattern into a :ref:`regular expression object
				606	<re-objects>`, which can be used for matching using its
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	607	:func:`~Pattern.match`, :func:`~Pattern.search` and other methods, described
Henk-Jaap Wagenaar	ed94a8b	2017-08-28 06:41:20 +0100	[diff] [blame]	608	below.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	609
				610	The expression's behaviour can be modified by specifying a flags value.
				611	Values can be any of the following variables, combined using bitwise OR (the
				612	``\|`` operator).
				613
				614	The sequence ::
				615
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	616	prog = re.compile(pattern)
				617	result = prog.match(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	618
				619	is equivalent to ::
				620
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	621	result = re.match(pattern, string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	622
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	623	but using :func:`re.compile` and saving the resulting regular expression
				624	object for reuse is more efficient when the expression will be used several
				625	times in a single program.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	626
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	627	.. note::
				628
				629	The compiled versions of the most recent patterns passed to
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	630	:func:`re.compile` and the module-level matching functions are cached, so
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	631	programs that use only a few regular expressions at a time needn't worry
				632	about compiling regular expressions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	633
				634
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	635	.. data:: A
				636	ASCII
				637
Georg Brandl	4049ce0	2009-06-08 07:49:54 +0000	[diff] [blame]	638	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
				639	perform ASCII-only matching instead of full Unicode matching. This is only
				640	meaningful for Unicode patterns, and is ignored for byte patterns.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	641	Corresponds to the inline flag ``(?a)``.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	642
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	643	Note that for backward compatibility, the :const:`re.U` flag still
				644	exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	645	counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	646	matches are Unicode by default for strings (and Unicode matching
				647	isn't allowed for bytes).
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	648
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	649
Sandro Tosi	da785fd	2012-01-01 12:55:20 +0100	[diff] [blame]	650	.. data:: DEBUG
				651
				652	Display debug information about compiled expression.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	653	No corresponding inline flag.
Sandro Tosi	da785fd	2012-01-01 12:55:20 +0100	[diff] [blame]	654
				655
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	656	.. data:: I
				657	IGNORECASE
				658
Brian Ward	c9d6dbc	2017-05-24 00:03:38 -0700	[diff] [blame]	659	Perform case-insensitive matching; expressions like ``[A-Z]`` will also
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	660	match lowercase letters. Full Unicode matching (such as ``Ü`` matching
				661	``ü``) also works unless the :const:`re.ASCII` flag is used to disable
				662	non-ASCII matches. The current locale does not change the effect of this
				663	flag unless the :const:`re.LOCALE` flag is also used.
				664	Corresponds to the inline flag ``(?i)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	665
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	666	Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
				667	combination with the :const:`IGNORECASE` flag, they will match the 52 ASCII
				668	letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
				669	letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
				670	'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
				671	If the :const:`ASCII` flag is used, only letters 'a' to 'z'
Serhiy Storchaka	3557b05	2017-10-24 23:31:42 +0300	[diff] [blame]	672	and 'A' to 'Z' are matched.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	673
				674	.. data:: L
				675	LOCALE
				676
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	677	Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
				678	dependent on the current locale. This flag can be used only with bytes
				679	patterns. The use of this flag is discouraged as the locale mechanism
				680	is very unreliable, it only handles one "culture" at a time, and it only
				681	works with 8-bit locales. Unicode matching is already enabled by default
				682	in Python 3 for Unicode (str) patterns, and it is able to handle different
				683	locales/languages.
				684	Corresponds to the inline flag ``(?L)``.
Serhiy Storchaka	22a309a	2014-12-01 11:50:07 +0200	[diff] [blame]	685
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	686	.. versionchanged:: 3.6
				687	:const:`re.LOCALE` can be used only with bytes patterns and is
				688	not compatible with :const:`re.ASCII`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	689
Serhiy Storchaka	898ff03	2017-05-05 08:53:40 +0300	[diff] [blame]	690	.. versionchanged:: 3.7
				691	Compiled regular expression objects with the :const:`re.LOCALE` flag no
				692	longer depend on the locale at compile time. Only the locale at
				693	matching time affects the result of matching.
				694
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	695
				696	.. data:: M
				697	MULTILINE
				698
				699	When specified, the pattern character ``'^'`` matches at the beginning of the
				700	string and at the beginning of each line (immediately following each newline);
				701	and the pattern character ``'$'`` matches at the end of the string and at the
				702	end of each line (immediately preceding each newline). By default, ``'^'``
				703	matches only at the beginning of the string, and ``'$'`` only at the end of the
				704	string and immediately before the newline (if any) at the end of the string.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	705	Corresponds to the inline flag ``(?m)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	706
				707
				708	.. data:: S
				709	DOTALL
				710
				711	Make the ``'.'`` special character match any character at all, including a
				712	newline; without this flag, ``'.'`` will match anything except a newline.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	713	Corresponds to the inline flag ``(?s)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	714
				715
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	716	.. data:: X
				717	VERBOSE
				718
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	719	.. index:: single: #; in regular expressions
				720
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	721	This flag allows you to write regular expressions that look nicer and are
				722	more readable by allowing you to visually separate logical sections of the
				723	pattern and add comments. Whitespace within the pattern is ignored, except
Serhiy Storchaka	b0b44b4	2017-11-14 17:21:26 +0200	[diff] [blame]	724	when in a character class, or when preceded by an unescaped backslash,
				725	or within tokens like ``*?``, ``(?:`` or ``(?P<...>``.
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	726	When a line contains a ``#`` that is not in a character class and is not
				727	preceded by an unescaped backslash, all characters from the leftmost such
				728	``#`` through the end of the line are ignored.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	729
Zachary Ware	71a0b43	2015-11-11 23:32:14 -0600	[diff] [blame]	730	This means that the two following regular expression objects that match a
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	731	decimal number are functionally equal::
Georg Brandl	81ac1ce	2007-08-31 17:17:17 +0000	[diff] [blame]	732
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	733	a = re.compile(r"""\d + # the integral part
				734	\. # the decimal point
				735	\d * # some fractional digits""", re.X)
				736	b = re.compile(r"\d+\.\d*")
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	737
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	738	Corresponds to the inline flag ``(?x)``.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	739
				740
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	741	.. function:: search(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	742
Terry Jan Reedy	0edb5c1	2014-05-30 16:19:59 -0400	[diff] [blame]	743	Scan through string looking for the first location where the regular expression
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	744	pattern produces a match, and return a corresponding :ref:`match object
				745	<match-objects>`. Return ``None`` if no position in the string matches the
				746	pattern; note that this is different from finding a zero-length match at some
				747	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	748
				749
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	750	.. function:: match(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	751
				752	If zero or more characters at the beginning of string match the regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	753	expression pattern, return a corresponding :ref:`match object
				754	<match-objects>`. Return ``None`` if the string does not match the pattern;
				755	note that this is different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	756
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	757	Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
				758	at the beginning of the string and not at the beginning of each line.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	759
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	760	If you want to locate a match anywhere in string, use :func:`search`
				761	instead (see also :ref:`search-vs-match`).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	762
				763
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	764	.. function:: fullmatch(pattern, string, flags=0)
				765
				766	If the whole string matches the regular expression pattern, return a
				767	corresponding :ref:`match object <match-objects>`. Return ``None`` if the
				768	string does not match the pattern; note that this is different from a
				769	zero-length match.
				770
				771	.. versionadded:: 3.4
				772
				773
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	774	.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	775
				776	Split string by the occurrences of pattern. If capturing parentheses are
				777	used in pattern, then the text of all groups in the pattern are also returned
				778	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				779	splits occur, and the remainder of the string is returned as the final element
Georg Brandl	9647389	2008-03-06 07:09:43 +0000	[diff] [blame]	780	of the list. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	781
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	782	>>> re.split(r'\W+', 'Words, words, words.')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	783	['Words', 'words', 'words', '']
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	784	>>> re.split(r'(\W+)', 'Words, words, words.')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	785	['Words', ', ', 'words', ', ', 'words', '.', '']
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	786	>>> re.split(r'\W+', 'Words, words, words.', 1)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	787	['Words', 'words, words.']
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	788	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
				789	['0', '3', '9']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	790
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	791	If there are capturing groups in the separator and it matches at the start of
				792	the string, the result will start with an empty string. The same holds for
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	793	the end of the string::
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	794
Serhiy Storchaka	c615be5	2017-11-28 22:51:38 +0200	[diff] [blame]	795	>>> re.split(r'(\W+)', '...words, words...')
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	796	['', '...', 'words', ', ', 'words', '...', '']
				797
				798	That way, separator components are always found at the same relative
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	799	indices within the result list.
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	800
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	801	Empty matches for the pattern split the string only when not adjacent
				802	to a previous empty match.
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	803
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	804	>>> re.split(r'\b', 'Words, words, words.')
				805	['', 'Words', ', ', 'words', ', ', 'words', '.']
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	806	>>> re.split(r'\W*', '...words...')
				807	['', '', 'w', 'o', 'r', 'd', 's', '', '']
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	808	>>> re.split(r'(\W*)', '...words...')
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	809	['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	810
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	811	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	812	Added the optional flags argument.
				813
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	814	.. versionchanged:: 3.7
				815	Added support of splitting on a pattern that could match an empty string.
				816
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	817
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	818	.. function:: findall(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	819
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	820	Return all non-overlapping matches of pattern in string, as a list of
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	821	strings. The string is scanned left-to-right, and matches are returned in
				822	the order found. If one or more groups are present in the pattern, return a
				823	list of groups; this will be a list of tuples if the pattern has more than
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	824	one group. Empty matches are included in the result.
				825
				826	.. versionchanged:: 3.7
				827	Non-empty matches can now start just after a previous empty match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	828
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	829
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	830	.. function:: finditer(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	831
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	832	Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
				833	all non-overlapping matches for the RE pattern in string. The string
				834	is scanned left-to-right, and matches are returned in the order found. Empty
Serhiy Storchaka	70d56fb	2017-12-04 14:29:05 +0200	[diff] [blame]	835	matches are included in the result.
				836
				837	.. versionchanged:: 3.7
				838	Non-empty matches can now start just after a previous empty match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	839
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	840
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	841	.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	842
				843	Return the string obtained by replacing the leftmost non-overlapping occurrences
				844	of pattern in string by the replacement repl. If the pattern isn't found,
				845	string is returned unchanged. repl can be a string or a function; if it is
				846	a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi	6a633bb	2011-08-19 22:54:50 +0200	[diff] [blame]	847	converted to a single newline character, ``\r`` is converted to a carriage return, and
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	848	so forth. Unknown escapes such as ``\&`` are left alone. Backreferences, such
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	849	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	850	For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	851
				852	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				853	... r'static PyObject*\npy_\1(void)\n{',
				854	... 'def myfunc():')
				855	'static PyObject*\npy_myfunc(void)\n{'
				856
				857	If repl is a function, it is called for every non-overlapping occurrence of
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	858	pattern. The function takes a single :ref:`match object <match-objects>`
				859	argument, and returns the replacement string. For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	860
				861	>>> def dashrepl(matchobj):
				862	... if matchobj.group(0) == '-': return ' '
				863	... else: return '-'
				864	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				865	'pro--gram files'
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	866	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
				867	'Baked Beans & Spam'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	868
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	869	The pattern may be a string or a :ref:`pattern object <re-objects>`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	870
				871	The optional argument count is the maximum number of pattern occurrences to be
				872	replaced; count must be a non-negative integer. If omitted or zero, all
				873	occurrences will be replaced. Empty matches for the pattern are replaced only
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	874	when not adjacent to a previous empty match, so ``sub('x*', '-', 'abxd')`` returns
				875	``'-a-b--d-'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	876
Serhiy Storchaka	9a75b84	2018-10-26 11:18:42 +0300	[diff] [blame^]	877	.. index:: single: \g; in regular expressions
				878
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	879	In string-type repl arguments, in addition to the character escapes and
				880	backreferences described above,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	881	``\g<name>`` will use the substring matched by the group named ``name``, as
				882	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				883	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				884	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				885	reference to group 20, not a reference to group 2 followed by the literal
				886	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				887	substring matched by the RE.
				888
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	889	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	890	Added the optional flags argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	891
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	892	.. versionchanged:: 3.5
				893	Unmatched groups are replaced with an empty string.
				894
Serhiy Storchaka	9bd85b8	2016-06-11 19:15:00 +0300	[diff] [blame]	895	.. versionchanged:: 3.6
Serhiy Storchaka	53c53ea	2016-12-06 19:15:29 +0200	[diff] [blame]	896	Unknown escapes in pattern consisting of ``'\'`` and an ASCII letter
				897	now are errors.
				898
Serhiy Storchaka	ff3dbe9	2016-12-06 19:25:19 +0200	[diff] [blame]	899	.. versionchanged:: 3.7
				900	Unknown escapes in repl consisting of ``'\'`` and an ASCII letter
				901	now are errors.
Serhiy Storchaka	a54aae0	2015-03-24 22:58:14 +0200	[diff] [blame]	902
Serhiy Storchaka	fbb490f	2018-01-04 11:06:13 +0200	[diff] [blame]	903	Empty matches for the pattern are replaced when adjacent to a previous
				904	non-empty match.
				905
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	906
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	907	.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	908
				909	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				910	number_of_subs_made)``.
				911
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	912	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	913	Added the optional flags argument.
				914
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	915	.. versionchanged:: 3.5
				916	Unmatched groups are replaced with an empty string.
				917
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	918
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	919	.. function:: escape(pattern)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	920
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	921	Escape special characters in pattern.
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	922	This is useful if you want to match an arbitrary literal string that may
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	923	have regular expression metacharacters in it. For example::
				924
				925	>>> print(re.escape('python.exe'))
				926	python\.exe
				927
				928	>>> legal_chars = string.ascii_lowercase + string.digits + "!#$%&'*+-.^_`\|~:"
				929	>>> print('[%s]+' % re.escape(legal_chars))
Serhiy Storchaka	05cb728	2017-11-16 12:38:26 +0200	[diff] [blame]	930	[abcdefghijklmnopqrstuvwxyz0123456789!\#\$%\&'\*\+\-\.\^_`\\|\~:]+
Serhiy Storchaka	8fc7bc2	2017-04-13 19:17:36 +0300	[diff] [blame]	931
				932	>>> operators = ['+', '-', '', '/', '*']
				933	>>> print('\|'.join(map(re.escape, sorted(operators, reverse=True))))
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	934	/\|\-\|\+\|\\\|\*
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	935
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	936	This functions must not be used for the replacement string in :func:`sub`
				937	and :func:`subn`, only backslashes should be escaped. For example::
				938
				939	>>> digits_re = r'\d+'
				940	>>> sample = '/usr/sbin/sendmail - 0 errors, 12 warnings'
				941	>>> print(re.sub(digits_re, digits_re.replace('\\', r'\\'), sample))
				942	/usr/sbin/sendmail - \d+ errors, \d+ warnings
				943
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	944	.. versionchanged:: 3.3
				945	The ``'_'`` character is no longer escaped.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	946
Serhiy Storchaka	5908300	2017-04-13 21:06:43 +0300	[diff] [blame]	947	.. versionchanged:: 3.7
				948	Only characters that can have special meaning in a regular expression
				949	are escaped.
				950
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	951
R. David Murray	522c32a	2010-07-10 14:23:36 +0000	[diff] [blame]	952	.. function:: purge()
				953
				954	Clear the regular expression cache.
				955
				956
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	957	.. exception:: error(msg, pattern=None, pos=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	958
				959	Exception raised when a string passed to one of the functions here is not a
				960	valid regular expression (for example, it might contain unmatched parentheses)
				961	or when some other error occurs during compilation or matching. It is never an
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	962	error if a string contains no match for a pattern. The error instance has
				963	the following additional attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	964
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	965	.. attribute:: msg
				966
				967	The unformatted error message.
				968
				969	.. attribute:: pattern
				970
				971	The regular expression pattern.
				972
				973	.. attribute:: pos
				974
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	975	The index in pattern where compilation failed (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	976
				977	.. attribute:: lineno
				978
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	979	The line corresponding to pos (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	980
				981	.. attribute:: colno
				982
Serhiy Storchaka	12d6b5d	2017-05-27 16:12:48 +0300	[diff] [blame]	983	The column corresponding to pos (may be ``None``).
Serhiy Storchaka	ad446d5	2014-11-10 13:49:00 +0200	[diff] [blame]	984
				985	.. versionchanged:: 3.5
				986	Added additional attributes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	987
				988	.. _re-objects:
				989
				990	Regular Expression Objects
				991	--------------------------
				992
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	993	Compiled regular expression objects support the following methods and
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	994	attributes:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	995
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	996	.. method:: Pattern.search(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	997
Berker Peksag	84f387d	2016-06-08 14:56:56 +0300	[diff] [blame]	998	Scan through string looking for the first location where this regular
				999	expression produces a match, and return a corresponding :ref:`match object
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1000	<match-objects>`. Return ``None`` if no position in the string matches the
				1001	pattern; note that this is different from finding a zero-length match at some
				1002	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1003
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1004	The optional second parameter pos gives an index in the string where the
				1005	search is to start; it defaults to ``0``. This is not completely equivalent to
				1006	slicing the string; the ``'^'`` pattern character matches at the real beginning
				1007	of the string and at positions just after a newline, but not necessarily at the
				1008	index where the search is to start.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1009
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1010	The optional parameter endpos limits how far the string will be searched; it
				1011	will be as if the string is endpos characters long, so only the characters
				1012	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	1013	than pos, no match will be found; otherwise, if rx is a compiled regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1014	expression object, ``rx.search(string, 0, 50)`` is equivalent to
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1015	``rx.search(string[:50], 0)``. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1016
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1017	>>> pattern = re.compile("d")
				1018	>>> pattern.search("dog") # Match at index 0
				1019	<re.Match object; span=(0, 1), match='d'>
				1020	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1021
				1022
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1023	.. method:: Pattern.match(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1024
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1025	If zero or more characters at the beginning of string match this regular
				1026	expression, return a corresponding :ref:`match object <match-objects>`.
				1027	Return ``None`` if the string does not match the pattern; note that this is
				1028	different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1029
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1030	The optional pos and endpos parameters have the same meaning as for the
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1031	:meth:`~Pattern.search` method. ::
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	1032
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1033	>>> pattern = re.compile("o")
				1034	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
				1035	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				1036	<re.Match object; span=(1, 2), match='o'>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1037
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1038	If you want to locate a match anywhere in string, use
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1039	:meth:`~Pattern.search` instead (see also :ref:`search-vs-match`).
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1040
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1041
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1042	.. method:: Pattern.fullmatch(string[, pos[, endpos]])
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	1043
				1044	If the whole string matches this regular expression, return a corresponding
				1045	:ref:`match object <match-objects>`. Return ``None`` if the string does not
				1046	match the pattern; note that this is different from a zero-length match.
				1047
				1048	The optional pos and endpos parameters have the same meaning as for the
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1049	:meth:`~Pattern.search` method. ::
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	1050
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1051	>>> pattern = re.compile("o[gh]")
				1052	>>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
				1053	>>> pattern.fullmatch("ogre") # No match as not the full string matches.
				1054	>>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
				1055	<re.Match object; span=(1, 3), match='og'>
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	1056
				1057	.. versionadded:: 3.4
				1058
				1059
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1060	.. method:: Pattern.split(string, maxsplit=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1061
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1062	Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1063
				1064
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1065	.. method:: Pattern.findall(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1066
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1067	Similar to the :func:`findall` function, using the compiled pattern, but
				1068	also accepts optional pos and endpos parameters that limit the search
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1069	region like for :meth:`search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1070
				1071
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1072	.. method:: Pattern.finditer(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1073
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1074	Similar to the :func:`finditer` function, using the compiled pattern, but
				1075	also accepts optional pos and endpos parameters that limit the search
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1076	region like for :meth:`search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1077
				1078
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1079	.. method:: Pattern.sub(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1080
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1081	Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1082
				1083
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1084	.. method:: Pattern.subn(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1085
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1086	Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1087
				1088
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1089	.. attribute:: Pattern.flags
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1090
Georg Brandl	3a19e54	2012-03-17 17:29:27 +0100	[diff] [blame]	1091	The regex matching flags. This is a combination of the flags given to
				1092	:func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
				1093	flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1094
				1095
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1096	.. attribute:: Pattern.groups
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	1097
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1098	The number of capturing groups in the pattern.
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	1099
				1100
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1101	.. attribute:: Pattern.groupindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1102
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1103	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				1104	numbers. The dictionary is empty if no symbolic groups were used in the
				1105	pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1106
				1107
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1108	.. attribute:: Pattern.pattern
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1109
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1110	The pattern string from which the pattern object was compiled.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1111
				1112
Serhiy Storchaka	fdbd011	2017-04-16 10:16:03 +0300	[diff] [blame]	1113	.. versionchanged:: 3.7
				1114	Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Compiled
				1115	regular expression objects are considered atomic.
				1116
				1117
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1118	.. _match-objects:
				1119
				1120	Match Objects
				1121	-------------
				1122
Ezio Melotti	b87f82f	2012-11-04 06:59:22 +0200	[diff] [blame]	1123	Match objects always have a boolean value of ``True``.
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1124	Since :meth:`~Pattern.match` and :meth:`~Pattern.search` return ``None``
Ezio Melotti	b87f82f	2012-11-04 06:59:22 +0200	[diff] [blame]	1125	when there is no match, you can test whether there was a match with a simple
				1126	``if`` statement::
				1127
				1128	match = re.search(pattern, string)
				1129	if match:
				1130	process(match)
				1131
				1132	Match objects support the following methods and attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1133
				1134
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1135	.. method:: Match.expand(template)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1136
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1137	Return the string obtained by doing backslash substitution on the template
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1138	string template, as done by the :meth:`~Pattern.sub` method.
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1139	Escapes such as ``\n`` are converted to the appropriate characters,
				1140	and numeric backreferences (``\1``, ``\2``) and named backreferences
				1141	(``\g<1>``, ``\g<name>``) are replaced by the contents of the
				1142	corresponding group.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1143
Serhiy Storchaka	7438e4b	2014-10-10 11:06:31 +0300	[diff] [blame]	1144	.. versionchanged:: 3.5
				1145	Unmatched groups are replaced with an empty string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1146
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1147	.. method:: Match.group([group1, ...])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1148
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1149	Returns one or more subgroups of the match. If there is a single argument, the
				1150	result is a single string; if there are multiple arguments, the result is a
				1151	tuple with one item per argument. Without arguments, group1 defaults to zero
				1152	(the whole match is returned). If a groupN argument is zero, the corresponding
				1153	return value is the entire matching string; if it is in the inclusive range
				1154	[1..99], it is the string matching the corresponding parenthesized group. If a
				1155	group number is negative or larger than the number of groups defined in the
				1156	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				1157	part of the pattern that did not match, the corresponding result is ``None``.
				1158	If a group is contained in a part of the pattern that matched multiple times,
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1159	the last match is returned. ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1160
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1161	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				1162	>>> m.group(0) # The entire match
				1163	'Isaac Newton'
				1164	>>> m.group(1) # The first parenthesized subgroup.
				1165	'Isaac'
				1166	>>> m.group(2) # The second parenthesized subgroup.
				1167	'Newton'
				1168	>>> m.group(1, 2) # Multiple arguments give us a tuple.
				1169	('Isaac', 'Newton')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1170
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1171	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				1172	arguments may also be strings identifying groups by their group name. If a
				1173	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				1174	exception is raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1175
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1176	A moderately complicated example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1177
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1178	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				1179	>>> m.group('first_name')
				1180	'Malcolm'
				1181	>>> m.group('last_name')
				1182	'Reynolds'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1183
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1184	Named groups can also be referred to by their index::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1185
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1186	>>> m.group(1)
				1187	'Malcolm'
				1188	>>> m.group(2)
				1189	'Reynolds'
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1190
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1191	If a group matches multiple times, only the last match is accessible::
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1192
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1193	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				1194	>>> m.group(1) # Returns only the last match.
				1195	'c3'
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1196
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	1197
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1198	.. method:: Match.__getitem__(g)
Eric V. Smith	605bdae	2016-09-11 08:55:43 -0400	[diff] [blame]	1199
				1200	This is identical to ``m.group(g)``. This allows easier access to
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1201	an individual group from a match::
Eric V. Smith	605bdae	2016-09-11 08:55:43 -0400	[diff] [blame]	1202
				1203	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				1204	>>> m[0] # The entire match
				1205	'Isaac Newton'
				1206	>>> m[1] # The first parenthesized subgroup.
				1207	'Isaac'
				1208	>>> m[2] # The second parenthesized subgroup.
				1209	'Newton'
				1210
				1211	.. versionadded:: 3.6
				1212
				1213
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1214	.. method:: Match.groups(default=None)
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	1215
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1216	Return a tuple containing all the subgroups of the match, from 1 up to however
				1217	many groups are in the pattern. The default argument is used for groups that
				1218	did not participate in the match; it defaults to ``None``.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1219
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1220	For example::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1221
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1222	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				1223	>>> m.groups()
				1224	('24', '1632')
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1225
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1226	If we make the decimal place and everything after it optional, not all groups
				1227	might participate in the match. These groups will default to ``None`` unless
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1228	the default argument is given::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1229
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1230	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				1231	>>> m.groups() # Second group defaults to None.
				1232	('24', None)
				1233	>>> m.groups('0') # Now, the second group defaults to '0'.
				1234	('24', '0')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1235
				1236
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1237	.. method:: Match.groupdict(default=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1238
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1239	Return a dictionary containing all the named subgroups of the match, keyed by
				1240	the subgroup name. The default argument is used for groups that did not
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1241	participate in the match; it defaults to ``None``. For example::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1242
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1243	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				1244	>>> m.groupdict()
				1245	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1246
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1247
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1248	.. method:: Match.start([group])
				1249	Match.end([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1250
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1251	Return the indices of the start and end of the substring matched by group;
				1252	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				1253	group exists but did not contribute to the match. For a match object m, and
				1254	a group g that did contribute to the match, the substring matched by group g
				1255	(equivalent to ``m.group(g)``) is ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1256
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1257	m.string[m.start(g):m.end(g)]
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1258
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1259	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				1260	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				1261	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				1262	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1263
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1264	An example that will remove remove_this from email addresses::
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	1265
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1266	>>> email = "tony@tiremove_thisger.net"
				1267	>>> m = re.search("remove_this", email)
				1268	>>> email[:m.start()] + email[m.end():]
				1269	'tony@tiger.net'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1270
				1271
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1272	.. method:: Match.span([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1273
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1274	For a match m, return the 2-tuple ``(m.start(group), m.end(group))``. Note
				1275	that if group did not contribute to the match, this is ``(-1, -1)``.
				1276	group defaults to zero, the entire match.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1277
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1278
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1279	.. attribute:: Match.pos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1280
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1281	The value of pos which was passed to the :meth:`~Pattern.search` or
				1282	:meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	1283	the index into the string at which the RE engine started looking for a match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1284
				1285
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1286	.. attribute:: Match.endpos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1287
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1288	The value of endpos which was passed to the :meth:`~Pattern.search` or
				1289	:meth:`~Pattern.match` method of a :ref:`regex object <re-objects>`. This is
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	1290	the index into the string beyond which the RE engine will not go.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1291
				1292
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1293	.. attribute:: Match.lastindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1294
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1295	The integer index of the last matched capturing group, or ``None`` if no group
				1296	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				1297	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				1298	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				1299	string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1300
				1301
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1302	.. attribute:: Match.lastgroup
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1303
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1304	The name of the last matched capturing group, or ``None`` if the group didn't
				1305	have a name, or if no group was matched at all.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1306
				1307
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1308	.. attribute:: Match.re
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1309
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1310	The :ref:`regular expression object <re-objects>` whose :meth:`~Pattern.match` or
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1311	:meth:`~Pattern.search` method produced this match instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1312
				1313
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1314	.. attribute:: Match.string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1315
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1316	The string passed to :meth:`~Pattern.match` or :meth:`~Pattern.search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1317
				1318
Serhiy Storchaka	fdbd011	2017-04-16 10:16:03 +0300	[diff] [blame]	1319	.. versionchanged:: 3.7
				1320	Added support of :func:`copy.copy` and :func:`copy.deepcopy`. Match objects
				1321	are considered atomic.
				1322
				1323
Raymond Hettinger	1fa7682	2010-12-06 23:31:36 +0000	[diff] [blame]	1324	.. _re-examples:
				1325
				1326	Regular Expression Examples
				1327	---------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1328
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1329
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	1330	Checking for a Pair
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1331	^^^^^^^^^^^^^^^^^^^
				1332
				1333	In this example, we'll use the following helper function to display match
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1334	objects a little more gracefully:
				1335
				1336	.. testcode::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1337
				1338	def displaymatch(match):
				1339	if match is None:
				1340	return None
				1341	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				1342
				1343	Suppose you are writing a poker program where a player's hand is represented as
				1344	a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1345	for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1346	representing the card with that value.
				1347
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1348	To see if a given string is a valid hand, one could do the following::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1349
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1350	>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
				1351	>>> displaymatch(valid.match("akt5q")) # Valid.
				1352	"<Match: 'akt5q', groups=()>"
				1353	>>> displaymatch(valid.match("akt5e")) # Invalid.
				1354	>>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1355	>>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1356	"<Match: '727ak', groups=()>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1357
				1358	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1359	To match this with a regular expression, one could use backreferences as such::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1360
				1361	>>> pair = re.compile(r".(.).\1")
				1362	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1363	"<Match: '717', groups=('7',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1364	>>> displaymatch(pair.match("718ak")) # No pairs.
				1365	>>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1366	"<Match: '354aa', groups=('a',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1367
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	1368	To find out what card the pair consists of, one could use the
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1369	:meth:`~Match.group` method of the match object in the following manner:
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1370
				1371	.. doctest::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1372
				1373	>>> pair.match("717ak").group(1)
				1374	'7'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1375
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1376	# Error because re.match() returns None, which doesn't have a group() method:
				1377	>>> pair.match("718ak").group(1)
				1378	Traceback (most recent call last):
				1379	File "<pyshell#23>", line 1, in <module>
				1380	re.match(r".(.).\1", "718ak").group(1)
				1381	AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1382
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1383	>>> pair.match("354aa").group(1)
				1384	'a'
				1385
				1386
				1387	Simulating scanf()
				1388	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1389
				1390	.. index:: single: scanf()
				1391
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1392	Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1393	expressions are generally more powerful, though also more verbose, than
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1394	:c:func:`scanf` format strings. The table below offers some more-or-less
				1395	equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1396	expressions.
				1397
				1398	+--------------------------------+---------------------------------------------+
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1399	\| :c:func:`scanf` Token \| Regular Expression \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1400	+================================+=============================================+
				1401	\| ``%c`` \| ``.`` \|
				1402	+--------------------------------+---------------------------------------------+
				1403	\| ``%5c`` \| ``.{5}`` \|
				1404	+--------------------------------+---------------------------------------------+
				1405	\| ``%d`` \| ``[-+]?\d+`` \|
				1406	+--------------------------------+---------------------------------------------+
				1407	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1408	+--------------------------------+---------------------------------------------+
				1409	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1410	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1411	\| ``%o`` \| ``[-+]?[0-7]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1412	+--------------------------------+---------------------------------------------+
				1413	\| ``%s`` \| ``\S+`` \|
				1414	+--------------------------------+---------------------------------------------+
				1415	\| ``%u`` \| ``\d+`` \|
				1416	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1417	\| ``%x``, ``%X`` \| ``[-+]?(0[xX])?[\dA-Fa-f]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1418	+--------------------------------+---------------------------------------------+
				1419
				1420	To extract the filename and numbers from a string like ::
				1421
				1422	/usr/sbin/sendmail - 0 errors, 4 warnings
				1423
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1424	you would use a :c:func:`scanf` format like ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1425
				1426	%s - %d errors, %d warnings
				1427
				1428	The equivalent regular expression would be ::
				1429
				1430	(\S+) - (\d+) errors, (\d+) warnings
				1431
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1432
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1433	.. _search-vs-match:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1434
				1435	search() vs. match()
				1436	^^^^^^^^^^^^^^^^^^^^
				1437
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1438	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1439
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1440	Python offers two different primitive operations based on regular expressions:
				1441	:func:`re.match` checks for a match only at the beginning of the string, while
				1442	:func:`re.search` checks for a match anywhere in the string (this is what Perl
				1443	does by default).
				1444
				1445	For example::
				1446
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1447	>>> re.match("c", "abcdef") # No match
				1448	>>> re.search("c", "abcdef") # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1449	<re.Match object; span=(2, 3), match='c'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1450
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1451	Regular expressions beginning with ``'^'`` can be used with :func:`search` to
				1452	restrict the match at the beginning of the string::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1453
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1454	>>> re.match("c", "abcdef") # No match
				1455	>>> re.search("^c", "abcdef") # No match
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1456	>>> re.search("^a", "abcdef") # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1457	<re.Match object; span=(0, 1), match='a'>
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1458
				1459	Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
				1460	beginning of the string, whereas using :func:`search` with a regular expression
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1461	beginning with ``'^'`` will match at the beginning of each line. ::
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1462
				1463	>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
				1464	>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1465	<re.Match object; span=(4, 5), match='X'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1466
				1467
				1468	Making a Phonebook
				1469	^^^^^^^^^^^^^^^^^^
				1470
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1471	:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1472	method is invaluable for converting textual data into data structures that can be
				1473	easily read and modified by Python as demonstrated in the following example that
				1474	creates a phonebook.
				1475
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1476	First, here is the input. Normally it may come from a file, here we are using
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1477	triple-quoted string syntax::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1478
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1479	>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1480	...
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1481	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1482	... Frank Burger: 925.541.7625 662 South Dogwood Way
				1483	...
				1484	...
				1485	... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1486
				1487	The entries are separated by one or more newlines. Now we convert the string
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1488	into a list with each nonempty line having its own entry:
				1489
				1490	.. doctest::
				1491	:options: +NORMALIZE_WHITESPACE
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1492
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1493	>>> entries = re.split("\n+", text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1494	>>> entries
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1495	['Ross McFluff: 834.345.1254 155 Elm Street',
				1496	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
				1497	'Frank Burger: 925.541.7625 662 South Dogwood Way',
				1498	'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1499
				1500	Finally, split each entry into a list with first name, last name, telephone
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1501	number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1502	because the address has spaces, our splitting pattern, in it:
				1503
				1504	.. doctest::
				1505	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1506
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1507	>>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1508	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1509	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1510	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1511	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1512
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1513	The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1514	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1515	house number from the street name:
				1516
				1517	.. doctest::
				1518	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1519
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1520	>>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1521	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1522	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1523	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1524	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1525
				1526
				1527	Text Munging
				1528	^^^^^^^^^^^^
				1529
				1530	:func:`sub` replaces every occurrence of a pattern with a string or the
				1531	result of a function. This example demonstrates using :func:`sub` with
				1532	a function to "munge" text, or randomize the order of all the characters
				1533	in each word of a sentence except for the first and last characters::
				1534
				1535	>>> def repl(m):
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1536	... inner_word = list(m.group(2))
				1537	... random.shuffle(inner_word)
				1538	... return m.group(1) + "".join(inner_word) + m.group(3)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1539	>>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1540	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1541	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1542	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1543	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1544
				1545
				1546	Finding all Adverbs
				1547	^^^^^^^^^^^^^^^^^^^
				1548
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1549	:func:`findall` matches all occurrences of a pattern, not just the first
Miss Islington (bot)	5f16585	2018-06-17 21:49:43 -0700	[diff] [blame]	1550	one as :func:`search` does. For example, if a writer wanted to
				1551	find all of the adverbs in some text, they might use :func:`findall` in
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1552	the following manner::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1553
				1554	>>> text = "He was carefully disguised but captured quickly by police."
				1555	>>> re.findall(r"\w+ly", text)
				1556	['carefully', 'quickly']
				1557
				1558
				1559	Finding all Adverbs and their Positions
				1560	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1561
				1562	If one wants more information about all matches of a pattern than the matched
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1563	text, :func:`finditer` is useful as it provides :ref:`match objects
				1564	<match-objects>` instead of strings. Continuing with the previous example, if
Miss Islington (bot)	5f16585	2018-06-17 21:49:43 -0700	[diff] [blame]	1565	a writer wanted to find all of the adverbs and their positions in
				1566	some text, they would use :func:`finditer` in the following manner::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1567
				1568	>>> text = "He was carefully disguised but captured quickly by police."
				1569	>>> for m in re.finditer(r"\w+ly", text):
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1570	... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1571	07-16: carefully
				1572	40-47: quickly
				1573
				1574
				1575	Raw String Notation
				1576	^^^^^^^^^^^^^^^^^^^
				1577
				1578	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1579	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1580	another one to escape it. For example, the two following lines of code are
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1581	functionally identical::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1582
				1583	>>> re.match(r"\W(.)\1\W", " ff ")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1584	<re.Match object; span=(0, 4), match=' ff '>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1585	>>> re.match("\\W(.)\\1\\W", " ff ")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1586	<re.Match object; span=(0, 4), match=' ff '>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1587
				1588	When one wants to match a literal backslash, it must be escaped in the regular
				1589	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1590	notation, one must use ``"\\\\"``, making the following lines of code
Serhiy Storchaka	cd195e2	2017-10-14 11:14:26 +0300	[diff] [blame]	1591	functionally identical::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1592
				1593	>>> re.match(r"\\", r"\\")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1594	<re.Match object; span=(0, 1), match='\\'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1595	>>> re.match("\\\\", r"\\")
Serhiy Storchaka	0b5e61d	2017-10-04 20:09:49 +0300	[diff] [blame]	1596	<re.Match object; span=(0, 1), match='\\'>
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1597
				1598
				1599	Writing a Tokenizer
				1600	^^^^^^^^^^^^^^^^^^^
				1601
Georg Brandl	5d94134	2016-02-26 19:37:12 +0100	[diff] [blame]	1602	A `tokenizer or scanner <https://en.wikipedia.org/wiki/Lexical_analysis>`_
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1603	analyzes a string to categorize groups of characters. This is a useful first
				1604	step in writing a compiler or interpreter.
				1605
				1606	The text categories are specified with regular expressions. The technique is
				1607	to combine those into a single master regular expression and to loop over
				1608	successive matches::
				1609
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1610	import collections
				1611	import re
				1612
				1613	Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1614
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1615	def tokenize(code):
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1616	keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
				1617	token_specification = [
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	1618	('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
				1619	('ASSIGN', r':='), # Assignment operator
				1620	('END', r';'), # Statement terminator
				1621	('ID', r'[A-Za-z]+'), # Identifiers
				1622	('OP', r'[+\-*/]'), # Arithmetic operators
				1623	('NEWLINE', r'\n'), # Line endings
				1624	('SKIP', r'[ \t]+'), # Skip over spaces and tabs
				1625	('MISMATCH',r'.'), # Any other character
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1626	]
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1627	tok_regex = '\|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1628	line_num = 1
				1629	line_start = 0
				1630	for mo in re.finditer(tok_regex, code):
				1631	kind = mo.lastgroup
				1632	value = mo.group(kind)
				1633	if kind == 'NEWLINE':
				1634	line_start = mo.end()
				1635	line_num += 1
				1636	elif kind == 'SKIP':
				1637	pass
				1638	elif kind == 'MISMATCH':
Raymond Hettinger	d0b9158	2017-02-06 07:15:31 -0800	[diff] [blame]	1639	raise RuntimeError(f'{value!r} unexpected on line {line_num}')
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1640	else:
				1641	if kind == 'ID' and value in keywords:
				1642	kind = value
				1643	column = mo.start() - line_start
				1644	yield Token(kind, value, line_num, column)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1645
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1646	statements = '''
				1647	IF quantity THEN
				1648	total := total + price * quantity;
				1649	tax := price * 0.05;
				1650	ENDIF;
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1651	'''
Raymond Hettinger	23157e5	2011-05-13 01:38:31 -0700	[diff] [blame]	1652
				1653	for token in tokenize(statements):
				1654	print(token)
				1655
				1656	The tokenizer produces the following output::
Raymond Hettinger	9c47d77	2011-05-13 01:03:50 -0700	[diff] [blame]	1657
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1658	Token(typ='IF', value='IF', line=2, column=4)
				1659	Token(typ='ID', value='quantity', line=2, column=7)
				1660	Token(typ='THEN', value='THEN', line=2, column=16)
				1661	Token(typ='ID', value='total', line=3, column=8)
				1662	Token(typ='ASSIGN', value=':=', line=3, column=14)
				1663	Token(typ='ID', value='total', line=3, column=17)
				1664	Token(typ='OP', value='+', line=3, column=23)
				1665	Token(typ='ID', value='price', line=3, column=25)
				1666	Token(typ='OP', value='*', line=3, column=31)
				1667	Token(typ='ID', value='quantity', line=3, column=33)
				1668	Token(typ='END', value=';', line=3, column=41)
				1669	Token(typ='ID', value='tax', line=4, column=8)
				1670	Token(typ='ASSIGN', value=':=', line=4, column=12)
				1671	Token(typ='ID', value='price', line=4, column=15)
				1672	Token(typ='OP', value='*', line=4, column=21)
				1673	Token(typ='NUMBER', value='0.05', line=4, column=23)
				1674	Token(typ='END', value=';', line=4, column=27)
				1675	Token(typ='ENDIF', value='ENDIF', line=5, column=4)
				1676	Token(typ='END', value=';', line=5, column=9)
Miss Islington (bot)	67d3f8b	2018-03-23 08:55:26 -0700	[diff] [blame]	1677
				1678
				1679	.. [Frie09] Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O'Reilly
				1680	Media, 2009. The third edition of the book no longer covers Python at all,
				1681	but the first edition covered writing good regular expression patterns in
				1682	great detail.