Blame - Doc/library/re.rst - platform/external/python/cpython3

blob: c3c8b65d8d925af9dd86f37cd60881df1e2b4e0d [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`re` --- Regular expression operations
				2	===========================================
				3
				4	.. module:: re
				5	:synopsis: Regular expression operations.
				6	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				7	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				8
				9
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	10	This module provides regular expression matching operations similar to
Georg Brandl	ed2a1db	2009-06-08 07:48:27 +0000	[diff] [blame]	11	those found in Perl.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	12
				13	Both patterns and strings to be searched can be Unicode strings as well as
				14	8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed:
				15	that is, you cannot match an Unicode string with a byte pattern or
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	16	vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	17	string must be of the same type as both the pattern and the search string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
				19	Regular expressions use the backslash character (``'\'``) to indicate
				20	special forms or to allow special characters to be used without invoking
				21	their special meaning. This collides with Python's usage of the same
				22	character for the same purpose in string literals; for example, to match
				23	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				24	string, because the regular expression must be ``\\``, and each
				25	backslash must be expressed as ``\\`` inside a regular Python string
				26	literal.
				27
				28	The solution is to use Python's raw string notation for regular expression
				29	patterns; backslashes are not handled in any special way in a string literal
				30	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				31	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	32	newline. Usually patterns will be expressed in Python code using this raw
				33	string notation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	34
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	35	It is important to note that most regular expression operations are available as
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	36	module-level functions and methods on
				37	:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
				38	that don't require you to compile a regex object first, but miss some
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	39	fine-tuning parameters.
				40
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	41
				42	.. _re-syntax:
				43
				44	Regular Expression Syntax
				45	-------------------------
				46
				47	A regular expression (or RE) specifies a set of strings that matches it; the
				48	functions in this module let you check if a particular string matches a given
				49	regular expression (or if a given regular expression matches a particular
				50	string, which comes down to the same thing).
				51
				52	Regular expressions can be concatenated to form new regular expressions; if A
				53	and B are both regular expressions, then AB is also a regular expression.
				54	In general, if a string p matches A and another string q matches B, the
				55	string pq will match AB. This holds unless A or B contain low precedence
				56	operations; boundary conditions between A and B; or have numbered group
				57	references. Thus, complex expressions can easily be constructed from simpler
				58	primitive expressions like the ones described here. For details of the theory
				59	and implementation of regular expressions, consult the Friedl book referenced
				60	above, or almost any textbook about compiler construction.
				61
				62	A brief explanation of the format of regular expressions follows. For further
Christian Heimes	2202f87	2008-02-06 14:31:34 +0000	[diff] [blame]	63	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	64
				65	Regular expressions can contain both special and ordinary characters. Most
				66	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				67	expressions; they simply match themselves. You can concatenate ordinary
				68	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				69	section, we'll write RE's in ``this special style``, usually without quotes, and
				70	strings to be matched ``'in single quotes'``.)
				71
				72	Some characters, like ``'\|'`` or ``'('``, are special. Special
				73	characters either stand for classes of ordinary characters, or affect
				74	how the regular expressions around them are interpreted. Regular
				75	expression pattern strings may not contain null bytes, but can specify
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	76	the null byte using a ``\number`` notation such as ``'\x00'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	77
				78
				79	The special characters are:
				80
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	81	``'.'``
				82	(Dot.) In the default mode, this matches any character except a newline. If
				83	the :const:`DOTALL` flag has been specified, this matches any character
				84	including a newline.
				85
				86	``'^'``
				87	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				88	matches immediately after each newline.
				89
				90	``'$'``
				91	Matches the end of the string or just before the newline at the end of the
				92	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				93	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				94	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes	25bb783	2008-01-11 16:17:00 +0000	[diff] [blame]	95	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				96	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				97	the newline, and one at the end of the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	98
				99	``'*'``
				100	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				101	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				102	by any number of 'b's.
				103
				104	``'+'``
				105	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				106	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				107	match just 'a'.
				108
				109	``'?'``
				110	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				111	``ab?`` will match either 'a' or 'ab'.
				112
				113	``*?``, ``+?``, ``??``
				114	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				115	as much text as possible. Sometimes this behaviour isn't desired; if the RE
				116	``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
				117	string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
				118	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
				119	characters as possible will be matched. Using ``.*?`` in the previous
				120	expression will match only ``'<H1>'``.
				121
				122	``{m}``
				123	Specifies that exactly m copies of the previous RE should be matched; fewer
				124	matches cause the entire RE not to match. For example, ``a{6}`` will match
				125	exactly six ``'a'`` characters, but not five.
				126
				127	``{m,n}``
				128	Causes the resulting RE to match from m to n repetitions of the preceding
				129	RE, attempting to match as many repetitions as possible. For example,
				130	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				131	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				132	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				133	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				134	modifier would be confused with the previously described form.
				135
				136	``{m,n}?``
				137	Causes the resulting RE to match from m to n repetitions of the preceding
				138	RE, attempting to match as few repetitions as possible. This is the
				139	non-greedy version of the previous qualifier. For example, on the
				140	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				141	while ``a{3,5}?`` will only match 3 characters.
				142
				143	``'\'``
				144	Either escapes special characters (permitting you to match characters like
				145	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				146	sequences are discussed below.
				147
				148	If you're not using a raw string to express the pattern, remember that Python
				149	also uses the backslash as an escape sequence in string literals; if the escape
				150	sequence isn't recognized by Python's parser, the backslash and subsequent
				151	character are included in the resulting string. However, if Python would
				152	recognize the resulting sequence, the backslash should be repeated twice. This
				153	is complicated and hard to understand, so it's highly recommended that you use
				154	raw strings for all but the simplest expressions.
				155
				156	``[]``
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	157	Used to indicate a set of characters. In a set:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	158
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	159	* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
				160	``'m'``, or ``'k'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	161
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	162	* Ranges of characters can be indicated by giving two characters and separating
				163	them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
				164	``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
				165	``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
				166	``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
				167	it will match a literal ``'-'``.
				168
				169	* Special characters lose their special meaning inside sets. For example,
				170	``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
				171	``'*'``, or ``')'``.
				172
				173	* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
				174	inside a set, although the characters they match depends on whether
				175	:const:`ASCII` or :const:`LOCALE` mode is in force.
				176
				177	* Characters that are not within a range can be matched by :dfn:`complementing`
				178	the set. If the first character of the set is ``'^'``, all the characters
				179	that are not in the set will be matched. For example, ``[^5]`` will match
				180	any character except ``'5'``, and ``[^^]`` will match any character except
				181	``'^'``. ``^`` has no special meaning if it's not the first character in
				182	the set.
				183
				184	* To match a literal ``']'`` inside a set, precede it with a backslash, or
				185	place it at the beginning of the set. For example, both ``[()[\]{}]`` and
				186	``[]()[{}]`` will both match a parenthesis.
Mark Summerfield	9e670c2	2008-05-31 13:05:34 +0000	[diff] [blame]	187
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	188	``'\|'``
				189	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				190	will match either A or B. An arbitrary number of REs can be separated by the
				191	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				192	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				193	right. When one pattern completely matches, that branch is accepted. This means
				194	that once ``A`` matches, ``B`` will not be tested further, even if it would
				195	produce a longer overall match. In other words, the ``'\|'`` operator is never
				196	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				197	character class, as in ``[\|]``.
				198
				199	``(...)``
				200	Matches whatever regular expression is inside the parentheses, and indicates the
				201	start and end of a group; the contents of a group can be retrieved after a match
				202	has been performed, and can be matched later in the string with the ``\number``
				203	special sequence, described below. To match the literals ``'('`` or ``')'``,
				204	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				205
				206	``(?...)``
				207	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				208	otherwise). The first character after the ``'?'`` determines what the meaning
				209	and further syntax of the construct is. Extensions usually do not create a new
				210	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				211	currently supported extensions.
				212
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	213	``(?aiLmsux)``
				214	(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				215	``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling	1c50e86	2009-06-01 00:11:36 +0000	[diff] [blame]	216	letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	217	:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	218	:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	219	and :const:`re.X` (verbose), for the entire regular expression. (The
				220	flags are described in :ref:`contents-of-module-re`.) This
				221	is useful if you wish to include the flags as part of the regular
				222	expression, instead of passing a flag argument to the
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	223	:func:`re.compile` function.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	224
				225	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				226	used first in the expression string, or after one or more whitespace characters.
				227	If there are non-whitespace characters before the flag, the results are
				228	undefined.
				229
				230	``(?:...)``
Georg Brandl	3122ce3	2010-10-29 06:17:38 +0000	[diff] [blame]	231	A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	232	expression is inside the parentheses, but the substring matched by the group
				233	cannot be retrieved after performing a match or referenced later in the
				234	pattern.
				235
				236	``(?P<name>...)``
				237	Similar to regular parentheses, but the substring matched by the group is
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	238	accessible via the symbolic group name name. Group names must be valid
				239	Python identifiers, and each group name must be defined only once within a
				240	regular expression. A symbolic group is also a numbered group, just as if
				241	the group were not named.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	242
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	243	Named groups can be referenced in three contexts. If the pattern is
				244	``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
				245	single or double quotes):
				246
				247	+---------------------------------------+----------------------------------+
				248	\| Context of reference to group "quote" \| Ways to reference it \|
				249	+=======================================+==================================+
				250	\| in the same pattern itself \| * ``(?P=quote)`` (as shown) \|
				251	\| \| * ``\1`` \|
				252	+---------------------------------------+----------------------------------+
				253	\| when processing match object ``m`` \| * ``m.group('quote')`` \|
				254	\| \| * ``m.end('quote')`` (etc.) \|
				255	+---------------------------------------+----------------------------------+
				256	\| in a string passed to the ``repl`` \| * ``\g<quote>`` \|
				257	\| argument of ``re.sub()`` \| * ``\g<1>`` \|
				258	\| \| * ``\1`` \|
				259	+---------------------------------------+----------------------------------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	260
				261	``(?P=name)``
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	262	A backreference to a named group; it matches whatever text was matched by the
				263	earlier group named name.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	264
				265	``(?#...)``
				266	A comment; the contents of the parentheses are simply ignored.
				267
				268	``(?=...)``
				269	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				270	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				271	``'Isaac '`` only if it's followed by ``'Asimov'``.
				272
				273	``(?!...)``
				274	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				275	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				276	followed by ``'Asimov'``.
				277
				278	``(?<=...)``
				279	Matches if the current position in the string is preceded by a match for ``...``
				280	that ends at the current position. This is called a :dfn:`positive lookbehind
				281	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				282	lookbehind will back up 3 characters and check if the contained pattern matches.
				283	The contained pattern must only match strings of some fixed length, meaning that
Serhiy Storchaka	a3369a5	2015-02-21 12:08:52 +0200	[diff] [blame]	284	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Group
				285	references are not supported even if they match strings of some fixed length.
				286	Note that
Ezio Melotti	0a6b541	2012-04-29 07:34:46 +0300	[diff] [blame]	287	patterns which start with positive lookbehind assertions will not match at the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	288	beginning of the string being searched; you will most likely want to use the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	289	:func:`search` function rather than the :func:`match` function:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	290
				291	>>> import re
				292	>>> m = re.search('(?<=abc)def', 'abcdef')
				293	>>> m.group(0)
				294	'def'
				295
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	296	This example looks for a word following a hyphen:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	297
				298	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				299	>>> m.group(0)
				300	'egg'
				301
				302	``(?<!...)``
				303	Matches if the current position in the string is not preceded by a match for
				304	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				305	positive lookbehind assertions, the contained pattern must only match strings of
Serhiy Storchaka	a3369a5	2015-02-21 12:08:52 +0200	[diff] [blame]	306	some fixed length and shouldn't contain group references.
				307	Patterns which start with negative lookbehind assertions may
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	308	match at the beginning of the string being searched.
				309
				310	``(?(id/name)yes-pattern\|no-pattern)``
orsenthil@gmail.com	476021b	2011-03-12 10:46:25 +0800	[diff] [blame]	311	Will try to match with ``yes-pattern`` if the group with given id or
				312	name exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
				313	optional and can be omitted. For example,
				314	``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>\|$)`` is a poor email matching pattern, which
				315	will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
Serhiy Storchaka	a4d170d	2013-12-23 18:20:51 +0200	[diff] [blame]	316	not with ``'<user@host.com'`` nor ``'user@host.com>'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	317
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	318
				319	The special sequences consist of ``'\'`` and a character from the list below.
				320	If the ordinary character is not on the list, then the resulting RE will match
				321	the second character. For example, ``\$`` matches the character ``'$'``.
				322
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	323	``\number``
				324	Matches the contents of the group of the same number. Groups are numbered
				325	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
Georg Brandl	2070e83	2013-10-06 12:58:20 +0200	[diff] [blame]	326	but not ``'thethe'`` (note the space after the group). This special sequence
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	327	can only be used to match one of the first 99 groups. If the first digit of
				328	number is 0, or number is 3 octal digits long, it will not be interpreted as
				329	a group match, but as the character with octal value number. Inside the
				330	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				331	characters.
				332
				333	``\A``
				334	Matches only at the start of the string.
				335
				336	``\b``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	337	Matches the empty string, but only at the beginning or end of a word.
				338	A word is defined as a sequence of Unicode alphanumeric or underscore
				339	characters, so the end of a word is indicated by whitespace or a
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	340	non-alphanumeric, non-underscore Unicode character. Note that formally,
				341	``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
				342	(or vice versa), or between ``\w`` and the beginning/end of the string.
				343	This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
				344	``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
				345
				346	By default Unicode alphanumerics are the ones used, but this can be changed
				347	by using the :const:`ASCII` flag. Inside a character range, ``\b``
				348	represents the backspace character, for compatibility with Python's string
				349	literals.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	350
				351	``\B``
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	352	Matches the empty string, but only when it is not at the beginning or end
				353	of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
				354	``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
				355	``\B`` is just the opposite of ``\b``, so word characters are
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	356	Unicode alphanumerics or the underscore, although this can be changed
				357	by using the :const:`ASCII` flag.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	358
				359	``\d``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	360	For Unicode (str) patterns:
Mark Dickinson	1f26828	2009-07-28 17:22:36 +0000	[diff] [blame]	361	Matches any Unicode decimal digit (that is, any character in
				362	Unicode character category [Nd]). This includes ``[0-9]``, and
				363	also many other digit characters. If the :const:`ASCII` flag is
				364	used only ``[0-9]`` is matched (but the flag affects the entire
				365	regular expression, so in such cases using an explicit ``[0-9]``
				366	may be a better choice).
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	367	For 8-bit (bytes) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	368	Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	369
				370	``\D``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	371	Matches any character which is not a Unicode decimal digit. This is
				372	the opposite of ``\d``. If the :const:`ASCII` flag is used this
				373	becomes the equivalent of ``[^0-9]`` (but the flag affects the entire
				374	regular expression, so in such cases using an explicit ``[^0-9]`` may
				375	be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	376
				377	``\s``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	378	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	379	Matches Unicode whitespace characters (which includes
				380	``[ \t\n\r\f\v]``, and also many other characters, for example the
				381	non-breaking spaces mandated by typography rules in many
				382	languages). If the :const:`ASCII` flag is used, only
				383	``[ \t\n\r\f\v]`` is matched (but the flag affects the entire
				384	regular expression, so in such cases using an explicit
				385	``[ \t\n\r\f\v]`` may be a better choice).
				386
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	387	For 8-bit (bytes) patterns:
				388	Matches characters considered whitespace in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	389	this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	390
				391	``\S``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	392	Matches any character which is not a Unicode whitespace character. This is
				393	the opposite of ``\s``. If the :const:`ASCII` flag is used this
				394	becomes the equivalent of ``[^ \t\n\r\f\v]`` (but the flag affects the entire
				395	regular expression, so in such cases using an explicit ``[^ \t\n\r\f\v]`` may
				396	be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	397
				398	``\w``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	399	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	400	Matches Unicode word characters; this includes most characters
				401	that can be part of a word in any language, as well as numbers and
				402	the underscore. If the :const:`ASCII` flag is used, only
				403	``[a-zA-Z0-9_]`` is matched (but the flag affects the entire
				404	regular expression, so in such cases using an explicit
				405	``[a-zA-Z0-9_]`` may be a better choice).
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	406	For 8-bit (bytes) patterns:
				407	Matches characters considered alphanumeric in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	408	this is equivalent to ``[a-zA-Z0-9_]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	409
				410	``\W``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	411	Matches any character which is not a Unicode word character. This is
				412	the opposite of ``\w``. If the :const:`ASCII` flag is used this
				413	becomes the equivalent of ``[^a-zA-Z0-9_]`` (but the flag affects the
				414	entire regular expression, so in such cases using an explicit
				415	``[^a-zA-Z0-9_]`` may be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	416
				417	``\Z``
				418	Matches only at the end of the string.
				419
				420	Most of the standard escapes supported by Python string literals are also
				421	accepted by the regular expression parser::
				422
				423	\a \b \f \n
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	424	\r \t \u \U
				425	\v \x \\
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	426
Ezio Melotti	285e51b	2012-04-29 04:52:30 +0300	[diff] [blame]	427	(Note that ``\b`` is used to represent word boundaries, and means "backspace"
				428	only inside character classes.)
				429
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	430	``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode
				431	patterns. In bytes patterns they are not treated specially.
				432
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	433	Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	434	there are three octal digits, it is considered an octal escape. Otherwise, it is
				435	a group reference. As for string literals, octal escapes are always at most
				436	three digits in length.
				437
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	438	.. versionchanged:: 3.3
				439	The ``'\u'`` and ``'\U'`` escape sequences have been added.
				440
				441
Georg Brandl	bb2d669	2014-10-28 21:41:51 +0100	[diff] [blame]	442	.. seealso::
				443
				444	Mastering Regular Expressions
				445	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
				446	second edition of the book no longer covers Python at all, but the first
				447	edition covered writing good regular expression patterns in great detail.
				448
				449
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	450
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	451	.. _contents-of-module-re:
				452
				453	Module Contents
				454	---------------
				455
				456	The module defines several functions, constants, and an exception. Some of the
				457	functions are simplified versions of the full featured methods for compiled
				458	regular expressions. Most non-trivial applications always use the compiled
				459	form.
				460
				461
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	462	.. function:: compile(pattern, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	463
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	464	Compile a regular expression pattern into a regular expression object, which
Ezio Melotti	642d4b6	2014-06-20 00:52:11 +0300	[diff] [blame]	465	can be used for matching using its :func:`~regex.match` and
				466	:func:`~regex.search` methods, described below.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	467
				468	The expression's behaviour can be modified by specifying a flags value.
				469	Values can be any of the following variables, combined using bitwise OR (the
				470	``\|`` operator).
				471
				472	The sequence ::
				473
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	474	prog = re.compile(pattern)
				475	result = prog.match(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	476
				477	is equivalent to ::
				478
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	479	result = re.match(pattern, string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	480
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	481	but using :func:`re.compile` and saving the resulting regular expression
				482	object for reuse is more efficient when the expression will be used several
				483	times in a single program.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	484
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	485	.. note::
				486
				487	The compiled versions of the most recent patterns passed to
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	488	:func:`re.compile` and the module-level matching functions are cached, so
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	489	programs that use only a few regular expressions at a time needn't worry
				490	about compiling regular expressions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	491
				492
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	493	.. data:: A
				494	ASCII
				495
Georg Brandl	4049ce0	2009-06-08 07:49:54 +0000	[diff] [blame]	496	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
				497	perform ASCII-only matching instead of full Unicode matching. This is only
				498	meaningful for Unicode patterns, and is ignored for byte patterns.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	499
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	500	Note that for backward compatibility, the :const:`re.U` flag still
				501	exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	502	counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	503	matches are Unicode by default for strings (and Unicode matching
				504	isn't allowed for bytes).
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	505
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	506
Sandro Tosi	da785fd	2012-01-01 12:55:20 +0100	[diff] [blame]	507	.. data:: DEBUG
				508
				509	Display debug information about compiled expression.
				510
				511
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	512	.. data:: I
				513	IGNORECASE
				514
				515	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
Mark Summerfield	8676534	2008-08-20 07:40:18 +0000	[diff] [blame]	516	lowercase letters, too. This is not affected by the current locale
				517	and works for Unicode characters as expected.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	518
				519
				520	.. data:: L
				521	LOCALE
				522
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	523	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	524	current locale. The use of this flag is discouraged as the locale mechanism
				525	is very unreliable, and it only handles one "culture" at a time anyway;
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	526	you should use Unicode matching instead, which is the default in Python 3
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	527	for Unicode (str) patterns.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	528
				529
				530	.. data:: M
				531	MULTILINE
				532
				533	When specified, the pattern character ``'^'`` matches at the beginning of the
				534	string and at the beginning of each line (immediately following each newline);
				535	and the pattern character ``'$'`` matches at the end of the string and at the
				536	end of each line (immediately preceding each newline). By default, ``'^'``
				537	matches only at the beginning of the string, and ``'$'`` only at the end of the
				538	string and immediately before the newline (if any) at the end of the string.
				539
				540
				541	.. data:: S
				542	DOTALL
				543
				544	Make the ``'.'`` special character match any character at all, including a
				545	newline; without this flag, ``'.'`` will match anything except a newline.
				546
				547
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	548	.. data:: X
				549	VERBOSE
				550
				551	This flag allows you to write regular expressions that look nicer. Whitespace
				552	within the pattern is ignored, except when in a character class or preceded by
				553	an unescaped backslash, and, when a line contains a ``'#'`` neither in a
				554	character class or preceded by an unescaped backslash, all characters from the
				555	leftmost such ``'#'`` through the end of the line are ignored.
				556
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	557	That means that the two following regular expression objects that match a
				558	decimal number are functionally equal::
Georg Brandl	81ac1ce	2007-08-31 17:17:17 +0000	[diff] [blame]	559
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	560	a = re.compile(r"""\d + # the integral part
				561	\. # the decimal point
				562	\d * # some fractional digits""", re.X)
				563	b = re.compile(r"\d+\.\d*")
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	564
				565
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	566
				567
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	568	.. function:: search(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	569
Terry Jan Reedy	0edb5c1	2014-05-30 16:19:59 -0400	[diff] [blame]	570	Scan through string looking for the first location where the regular expression
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	571	pattern produces a match, and return a corresponding :ref:`match object
				572	<match-objects>`. Return ``None`` if no position in the string matches the
				573	pattern; note that this is different from finding a zero-length match at some
				574	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	575
				576
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	577	.. function:: match(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	578
				579	If zero or more characters at the beginning of string match the regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	580	expression pattern, return a corresponding :ref:`match object
				581	<match-objects>`. Return ``None`` if the string does not match the pattern;
				582	note that this is different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	583
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	584	Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
				585	at the beginning of the string and not at the beginning of each line.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	586
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	587	If you want to locate a match anywhere in string, use :func:`search`
				588	instead (see also :ref:`search-vs-match`).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	589
				590
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	591	.. function:: fullmatch(pattern, string, flags=0)
				592
				593	If the whole string matches the regular expression pattern, return a
				594	corresponding :ref:`match object <match-objects>`. Return ``None`` if the
				595	string does not match the pattern; note that this is different from a
				596	zero-length match.
				597
				598	.. versionadded:: 3.4
				599
				600
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	601	.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	602
				603	Split string by the occurrences of pattern. If capturing parentheses are
				604	used in pattern, then the text of all groups in the pattern are also returned
				605	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				606	splits occur, and the remainder of the string is returned as the final element
Georg Brandl	9647389	2008-03-06 07:09:43 +0000	[diff] [blame]	607	of the list. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	608
				609	>>> re.split('\W+', 'Words, words, words.')
				610	['Words', 'words', 'words', '']
				611	>>> re.split('(\W+)', 'Words, words, words.')
				612	['Words', ', ', 'words', ', ', 'words', '.', '']
				613	>>> re.split('\W+', 'Words, words, words.', 1)
				614	['Words', 'words, words.']
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	615	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
				616	['0', '3', '9']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	617
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	618	If there are capturing groups in the separator and it matches at the start of
				619	the string, the result will start with an empty string. The same holds for
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	620	the end of the string:
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	621
				622	>>> re.split('(\W+)', '...words, words...')
				623	['', '...', 'words', ', ', 'words', '...', '']
				624
				625	That way, separator components are always found at the same relative
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	626	indices within the result list.
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	627
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	628	Note that split will never split a string on an empty pattern match.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	629	For example:
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	630
				631	>>> re.split('x*', 'foo')
				632	['foo']
				633	>>> re.split("(?m)^$", "foo\n\nbar\n")
				634	['foo\n\nbar\n']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	635
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	636	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	637	Added the optional flags argument.
				638
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	639
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	640	.. function:: findall(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	641
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	642	Return all non-overlapping matches of pattern in string, as a list of
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	643	strings. The string is scanned left-to-right, and matches are returned in
				644	the order found. If one or more groups are present in the pattern, return a
				645	list of groups; this will be a list of tuples if the pattern has more than
				646	one group. Empty matches are included in the result unless they touch the
				647	beginning of another match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	648
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	649
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	650	.. function:: finditer(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	651
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	652	Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
				653	all non-overlapping matches for the RE pattern in string. The string
				654	is scanned left-to-right, and matches are returned in the order found. Empty
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	655	matches are included in the result unless they touch the beginning of another
				656	match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	657
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	658
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	659	.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	660
				661	Return the string obtained by replacing the leftmost non-overlapping occurrences
				662	of pattern in string by the replacement repl. If the pattern isn't found,
				663	string is returned unchanged. repl can be a string or a function; if it is
				664	a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi	6a633bb	2011-08-19 22:54:50 +0200	[diff] [blame]	665	converted to a single newline character, ``\r`` is converted to a carriage return, and
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	666	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				667	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	668	For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	669
				670	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				671	... r'static PyObject*\npy_\1(void)\n{',
				672	... 'def myfunc():')
				673	'static PyObject*\npy_myfunc(void)\n{'
				674
				675	If repl is a function, it is called for every non-overlapping occurrence of
				676	pattern. The function takes a single match object argument, and returns the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	677	replacement string. For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	678
				679	>>> def dashrepl(matchobj):
				680	... if matchobj.group(0) == '-': return ' '
				681	... else: return '-'
				682	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				683	'pro--gram files'
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	684	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
				685	'Baked Beans & Spam'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	686
Georg Brandl	1b5ab45	2009-08-13 07:56:35 +0000	[diff] [blame]	687	The pattern may be a string or an RE object.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	688
				689	The optional argument count is the maximum number of pattern occurrences to be
				690	replaced; count must be a non-negative integer. If omitted or zero, all
				691	occurrences will be replaced. Empty matches for the pattern are replaced only
				692	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				693	``'-a-b-c-'``.
				694
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame]	695	In string-type repl arguments, in addition to the character escapes and
				696	backreferences described above,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	697	``\g<name>`` will use the substring matched by the group named ``name``, as
				698	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				699	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				700	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				701	reference to group 20, not a reference to group 2 followed by the literal
				702	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				703	substring matched by the RE.
				704
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	705	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	706	Added the optional flags argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	707
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	708
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	709	.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	710
				711	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				712	number_of_subs_made)``.
				713
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	714	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	715	Added the optional flags argument.
				716
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	717
				718	.. function:: escape(string)
				719
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	720	Escape all the characters in pattern except ASCII letters, numbers and ``'_'``.
				721	This is useful if you want to match an arbitrary literal string that may
				722	have regular expression metacharacters in it.
				723
				724	.. versionchanged:: 3.3
				725	The ``'_'`` character is no longer escaped.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	726
				727
R. David Murray	522c32a	2010-07-10 14:23:36 +0000	[diff] [blame]	728	.. function:: purge()
				729
				730	Clear the regular expression cache.
				731
				732
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	733	.. exception:: error
				734
				735	Exception raised when a string passed to one of the functions here is not a
				736	valid regular expression (for example, it might contain unmatched parentheses)
				737	or when some other error occurs during compilation or matching. It is never an
				738	error if a string contains no match for a pattern.
				739
				740
				741	.. _re-objects:
				742
				743	Regular Expression Objects
				744	--------------------------
				745
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	746	Compiled regular expression objects support the following methods and
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	747	attributes:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	748
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	749	.. method:: regex.search(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	750
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	751	Scan through string looking for a location where this regular expression
				752	produces a match, and return a corresponding :ref:`match object
				753	<match-objects>`. Return ``None`` if no position in the string matches the
				754	pattern; note that this is different from finding a zero-length match at some
				755	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	756
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	757	The optional second parameter pos gives an index in the string where the
				758	search is to start; it defaults to ``0``. This is not completely equivalent to
				759	slicing the string; the ``'^'`` pattern character matches at the real beginning
				760	of the string and at positions just after a newline, but not necessarily at the
				761	index where the search is to start.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	762
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	763	The optional parameter endpos limits how far the string will be searched; it
				764	will be as if the string is endpos characters long, so only the characters
				765	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	766	than pos, no match will be found; otherwise, if rx is a compiled regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	767	expression object, ``rx.search(string, 0, 50)`` is equivalent to
				768	``rx.search(string[:50], 0)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	769
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	770	>>> pattern = re.compile("d")
				771	>>> pattern.search("dog") # Match at index 0
Ezio Melotti	7571941	2013-11-23 20:27:27 +0200	[diff] [blame]	772	<_sre.SRE_Match object; span=(0, 1), match='d'>
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	773	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	774
				775
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	776	.. method:: regex.match(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	777
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	778	If zero or more characters at the beginning of string match this regular
				779	expression, return a corresponding :ref:`match object <match-objects>`.
				780	Return ``None`` if the string does not match the pattern; note that this is
				781	different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	782
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	783	The optional pos and endpos parameters have the same meaning as for the
				784	:meth:`~regex.search` method.
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	785
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	786	>>> pattern = re.compile("o")
				787	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
				788	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
Ezio Melotti	7571941	2013-11-23 20:27:27 +0200	[diff] [blame]	789	<_sre.SRE_Match object; span=(1, 2), match='o'>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	790
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	791	If you want to locate a match anywhere in string, use
				792	:meth:`~regex.search` instead (see also :ref:`search-vs-match`).
				793
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	794
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	795	.. method:: regex.fullmatch(string[, pos[, endpos]])
				796
				797	If the whole string matches this regular expression, return a corresponding
				798	:ref:`match object <match-objects>`. Return ``None`` if the string does not
				799	match the pattern; note that this is different from a zero-length match.
				800
				801	The optional pos and endpos parameters have the same meaning as for the
				802	:meth:`~regex.search` method.
				803
				804	>>> pattern = re.compile("o[gh]")
				805	>>> pattern.fullmatch("dog") # No match as "o" is not at the start of "dog".
				806	>>> pattern.fullmatch("ogre") # No match as not the full string matches.
				807	>>> pattern.fullmatch("doggie", 1, 3) # Matches within given limits.
Serhiy Storchaka	475546f	2013-12-02 20:23:19 +0200	[diff] [blame]	808	<_sre.SRE_Match object; span=(1, 3), match='og'>
Serhiy Storchaka	32eddc1	2013-11-23 23:20:30 +0200	[diff] [blame]	809
				810	.. versionadded:: 3.4
				811
				812
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	813	.. method:: regex.split(string, maxsplit=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	814
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	815	Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	816
				817
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	818	.. method:: regex.findall(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	819
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	820	Similar to the :func:`findall` function, using the compiled pattern, but
				821	also accepts optional pos and endpos parameters that limit the search
				822	region like for :meth:`match`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	823
				824
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	825	.. method:: regex.finditer(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	826
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	827	Similar to the :func:`finditer` function, using the compiled pattern, but
				828	also accepts optional pos and endpos parameters that limit the search
				829	region like for :meth:`match`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	830
				831
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	832	.. method:: regex.sub(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	833
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	834	Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	835
				836
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	837	.. method:: regex.subn(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	838
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	839	Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	840
				841
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	842	.. attribute:: regex.flags
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	843
Georg Brandl	3a19e54	2012-03-17 17:29:27 +0100	[diff] [blame]	844	The regex matching flags. This is a combination of the flags given to
				845	:func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
				846	flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	847
				848
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	849	.. attribute:: regex.groups
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	850
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	851	The number of capturing groups in the pattern.
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	852
				853
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	854	.. attribute:: regex.groupindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	855
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	856	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				857	numbers. The dictionary is empty if no symbolic groups were used in the
				858	pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	859
				860
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	861	.. attribute:: regex.pattern
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	862
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	863	The pattern string from which the RE object was compiled.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	864
				865
				866	.. _match-objects:
				867
				868	Match Objects
				869	-------------
				870
Ezio Melotti	b87f82f	2012-11-04 06:59:22 +0200	[diff] [blame]	871	Match objects always have a boolean value of ``True``.
				872	Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
				873	when there is no match, you can test whether there was a match with a simple
				874	``if`` statement::
				875
				876	match = re.search(pattern, string)
				877	if match:
				878	process(match)
				879
				880	Match objects support the following methods and attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	881
				882
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	883	.. method:: match.expand(template)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	884
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	885	Return the string obtained by doing backslash substitution on the template
				886	string template, as done by the :meth:`~regex.sub` method.
				887	Escapes such as ``\n`` are converted to the appropriate characters,
				888	and numeric backreferences (``\1``, ``\2``) and named backreferences
				889	(``\g<1>``, ``\g<name>``) are replaced by the contents of the
				890	corresponding group.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	891
				892
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	893	.. method:: match.group([group1, ...])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	894
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	895	Returns one or more subgroups of the match. If there is a single argument, the
				896	result is a single string; if there are multiple arguments, the result is a
				897	tuple with one item per argument. Without arguments, group1 defaults to zero
				898	(the whole match is returned). If a groupN argument is zero, the corresponding
				899	return value is the entire matching string; if it is in the inclusive range
				900	[1..99], it is the string matching the corresponding parenthesized group. If a
				901	group number is negative or larger than the number of groups defined in the
				902	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				903	part of the pattern that did not match, the corresponding result is ``None``.
				904	If a group is contained in a part of the pattern that matched multiple times,
				905	the last match is returned.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	906
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	907	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				908	>>> m.group(0) # The entire match
				909	'Isaac Newton'
				910	>>> m.group(1) # The first parenthesized subgroup.
				911	'Isaac'
				912	>>> m.group(2) # The second parenthesized subgroup.
				913	'Newton'
				914	>>> m.group(1, 2) # Multiple arguments give us a tuple.
				915	('Isaac', 'Newton')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	916
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	917	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				918	arguments may also be strings identifying groups by their group name. If a
				919	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				920	exception is raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	921
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	922	A moderately complicated example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	923
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	924	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				925	>>> m.group('first_name')
				926	'Malcolm'
				927	>>> m.group('last_name')
				928	'Reynolds'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	929
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	930	Named groups can also be referred to by their index:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	931
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	932	>>> m.group(1)
				933	'Malcolm'
				934	>>> m.group(2)
				935	'Reynolds'
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	936
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	937	If a group matches multiple times, only the last match is accessible:
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	938
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	939	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				940	>>> m.group(1) # Returns only the last match.
				941	'c3'
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	942
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	943
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	944	.. method:: match.groups(default=None)
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	945
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	946	Return a tuple containing all the subgroups of the match, from 1 up to however
				947	many groups are in the pattern. The default argument is used for groups that
				948	did not participate in the match; it defaults to ``None``.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	949
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	950	For example:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	951
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	952	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				953	>>> m.groups()
				954	('24', '1632')
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	955
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	956	If we make the decimal place and everything after it optional, not all groups
				957	might participate in the match. These groups will default to ``None`` unless
				958	the default argument is given:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	959
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	960	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				961	>>> m.groups() # Second group defaults to None.
				962	('24', None)
				963	>>> m.groups('0') # Now, the second group defaults to '0'.
				964	('24', '0')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	965
				966
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	967	.. method:: match.groupdict(default=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	968
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	969	Return a dictionary containing all the named subgroups of the match, keyed by
				970	the subgroup name. The default argument is used for groups that did not
				971	participate in the match; it defaults to ``None``. For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	972
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	973	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				974	>>> m.groupdict()
				975	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	976
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	977
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	978	.. method:: match.start([group])
				979	match.end([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	980
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	981	Return the indices of the start and end of the substring matched by group;
				982	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				983	group exists but did not contribute to the match. For a match object m, and
				984	a group g that did contribute to the match, the substring matched by group g
				985	(equivalent to ``m.group(g)``) is ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	986
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	987	m.string[m.start(g):m.end(g)]
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	988
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	989	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				990	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				991	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				992	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	993
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	994	An example that will remove remove_this from email addresses:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	995
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	996	>>> email = "tony@tiremove_thisger.net"
				997	>>> m = re.search("remove_this", email)
				998	>>> email[:m.start()] + email[m.end():]
				999	'tony@tiger.net'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1000
				1001
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1002	.. method:: match.span([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1003
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1004	For a match m, return the 2-tuple ``(m.start(group), m.end(group))``. Note
				1005	that if group did not contribute to the match, this is ``(-1, -1)``.
				1006	group defaults to zero, the entire match.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1007
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1008
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1009	.. attribute:: match.pos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1010
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1011	The value of pos which was passed to the :meth:`~regex.search` or
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	1012	:meth:`~regex.match` method of a :ref:`regex object <re-objects>`. This is
				1013	the index into the string at which the RE engine started looking for a match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1014
				1015
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1016	.. attribute:: match.endpos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1017
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1018	The value of endpos which was passed to the :meth:`~regex.search` or
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	1019	:meth:`~regex.match` method of a :ref:`regex object <re-objects>`. This is
				1020	the index into the string beyond which the RE engine will not go.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1021
				1022
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1023	.. attribute:: match.lastindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1024
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1025	The integer index of the last matched capturing group, or ``None`` if no group
				1026	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				1027	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				1028	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				1029	string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1030
				1031
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1032	.. attribute:: match.lastgroup
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1033
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1034	The name of the last matched capturing group, or ``None`` if the group didn't
				1035	have a name, or if no group was matched at all.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1036
				1037
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1038	.. attribute:: match.re
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1039
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1040	The regular expression object whose :meth:`~regex.match` or
				1041	:meth:`~regex.search` method produced this match instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1042
				1043
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1044	.. attribute:: match.string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1045
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1046	The string passed to :meth:`~regex.match` or :meth:`~regex.search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1047
				1048
Raymond Hettinger	1fa7682	2010-12-06 23:31:36 +0000	[diff] [blame]	1049	.. _re-examples:
				1050
				1051	Regular Expression Examples
				1052	---------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1053
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1054
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	1055	Checking for a Pair
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1056	^^^^^^^^^^^^^^^^^^^
				1057
				1058	In this example, we'll use the following helper function to display match
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1059	objects a little more gracefully:
				1060
				1061	.. testcode::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1062
				1063	def displaymatch(match):
				1064	if match is None:
				1065	return None
				1066	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				1067
				1068	Suppose you are writing a poker program where a player's hand is represented as
				1069	a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1070	for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1071	representing the card with that value.
				1072
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1073	To see if a given string is a valid hand, one could do the following:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1074
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1075	>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
				1076	>>> displaymatch(valid.match("akt5q")) # Valid.
				1077	"<Match: 'akt5q', groups=()>"
				1078	>>> displaymatch(valid.match("akt5e")) # Invalid.
				1079	>>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1080	>>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1081	"<Match: '727ak', groups=()>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1082
				1083	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1084	To match this with a regular expression, one could use backreferences as such:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1085
				1086	>>> pair = re.compile(r".(.).\1")
				1087	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1088	"<Match: '717', groups=('7',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1089	>>> displaymatch(pair.match("718ak")) # No pairs.
				1090	>>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1091	"<Match: '354aa', groups=('a',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1092
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	1093	To find out what card the pair consists of, one could use the
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1094	:meth:`~match.group` method of the match object in the following manner:
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1095
				1096	.. doctest::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1097
				1098	>>> pair.match("717ak").group(1)
				1099	'7'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1100
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1101	# Error because re.match() returns None, which doesn't have a group() method:
				1102	>>> pair.match("718ak").group(1)
				1103	Traceback (most recent call last):
				1104	File "<pyshell#23>", line 1, in <module>
				1105	re.match(r".(.).\1", "718ak").group(1)
				1106	AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1107
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1108	>>> pair.match("354aa").group(1)
				1109	'a'
				1110
				1111
				1112	Simulating scanf()
				1113	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1114
				1115	.. index:: single: scanf()
				1116
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1117	Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1118	expressions are generally more powerful, though also more verbose, than
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1119	:c:func:`scanf` format strings. The table below offers some more-or-less
				1120	equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1121	expressions.
				1122
				1123	+--------------------------------+---------------------------------------------+
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1124	\| :c:func:`scanf` Token \| Regular Expression \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1125	+================================+=============================================+
				1126	\| ``%c`` \| ``.`` \|
				1127	+--------------------------------+---------------------------------------------+
				1128	\| ``%5c`` \| ``.{5}`` \|
				1129	+--------------------------------+---------------------------------------------+
				1130	\| ``%d`` \| ``[-+]?\d+`` \|
				1131	+--------------------------------+---------------------------------------------+
				1132	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1133	+--------------------------------+---------------------------------------------+
				1134	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1135	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1136	\| ``%o`` \| ``[-+]?[0-7]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1137	+--------------------------------+---------------------------------------------+
				1138	\| ``%s`` \| ``\S+`` \|
				1139	+--------------------------------+---------------------------------------------+
				1140	\| ``%u`` \| ``\d+`` \|
				1141	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1142	\| ``%x``, ``%X`` \| ``[-+]?(0[xX])?[\dA-Fa-f]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1143	+--------------------------------+---------------------------------------------+
				1144
				1145	To extract the filename and numbers from a string like ::
				1146
				1147	/usr/sbin/sendmail - 0 errors, 4 warnings
				1148
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1149	you would use a :c:func:`scanf` format like ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1150
				1151	%s - %d errors, %d warnings
				1152
				1153	The equivalent regular expression would be ::
				1154
				1155	(\S+) - (\d+) errors, (\d+) warnings
				1156
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1157
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1158	.. _search-vs-match:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1159
				1160	search() vs. match()
				1161	^^^^^^^^^^^^^^^^^^^^
				1162
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1163	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1164
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1165	Python offers two different primitive operations based on regular expressions:
				1166	:func:`re.match` checks for a match only at the beginning of the string, while
				1167	:func:`re.search` checks for a match anywhere in the string (this is what Perl
				1168	does by default).
				1169
				1170	For example::
				1171
				1172	>>> re.match("c", "abcdef") # No match
				1173	>>> re.search("c", "abcdef") # Match
Ezio Melotti	7571941	2013-11-23 20:27:27 +0200	[diff] [blame]	1174	<_sre.SRE_Match object; span=(2, 3), match='c'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1175
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1176	Regular expressions beginning with ``'^'`` can be used with :func:`search` to
				1177	restrict the match at the beginning of the string::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1178
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1179	>>> re.match("c", "abcdef") # No match
				1180	>>> re.search("^c", "abcdef") # No match
				1181	>>> re.search("^a", "abcdef") # Match
Ezio Melotti	7571941	2013-11-23 20:27:27 +0200	[diff] [blame]	1182	<_sre.SRE_Match object; span=(0, 1), match='a'>
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1183
				1184	Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
				1185	beginning of the string, whereas using :func:`search` with a regular expression
				1186	beginning with ``'^'`` will match at the beginning of each line.
				1187
				1188	>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
				1189	>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
Ezio Melotti	7571941	2013-11-23 20:27:27 +0200	[diff] [blame]	1190	<_sre.SRE_Match object; span=(4, 5), match='X'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1191
				1192
				1193	Making a Phonebook
				1194	^^^^^^^^^^^^^^^^^^
				1195
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1196	:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1197	method is invaluable for converting textual data into data structures that can be
				1198	easily read and modified by Python as demonstrated in the following example that
				1199	creates a phonebook.
				1200
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1201	First, here is the input. Normally it may come from a file, here we are using
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1202	triple-quoted string syntax:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1203
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1204	>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1205	...
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1206	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1207	... Frank Burger: 925.541.7625 662 South Dogwood Way
				1208	...
				1209	...
				1210	... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1211
				1212	The entries are separated by one or more newlines. Now we convert the string
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1213	into a list with each nonempty line having its own entry:
				1214
				1215	.. doctest::
				1216	:options: +NORMALIZE_WHITESPACE
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1217
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1218	>>> entries = re.split("\n+", text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1219	>>> entries
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1220	['Ross McFluff: 834.345.1254 155 Elm Street',
				1221	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
				1222	'Frank Burger: 925.541.7625 662 South Dogwood Way',
				1223	'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1224
				1225	Finally, split each entry into a list with first name, last name, telephone
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1226	number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1227	because the address has spaces, our splitting pattern, in it:
				1228
				1229	.. doctest::
				1230	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1231
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1232	>>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1233	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1234	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1235	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1236	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1237
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1238	The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1239	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1240	house number from the street name:
				1241
				1242	.. doctest::
				1243	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1244
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1245	>>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1246	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1247	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1248	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1249	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1250
				1251
				1252	Text Munging
				1253	^^^^^^^^^^^^
				1254
				1255	:func:`sub` replaces every occurrence of a pattern with a string or the
				1256	result of a function. This example demonstrates using :func:`sub` with
				1257	a function to "munge" text, or randomize the order of all the characters
				1258	in each word of a sentence except for the first and last characters::
				1259
				1260	>>> def repl(m):
				1261	... inner_word = list(m.group(2))
				1262	... random.shuffle(inner_word)
				1263	... return m.group(1) + "".join(inner_word) + m.group(3)
				1264	>>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1265	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1266	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1267	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1268	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1269
				1270
				1271	Finding all Adverbs
				1272	^^^^^^^^^^^^^^^^^^^
				1273
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1274	:func:`findall` matches all occurrences of a pattern, not just the first
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1275	one as :func:`search` does. For example, if one was a writer and wanted to
				1276	find all of the adverbs in some text, he or she might use :func:`findall` in
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1277	the following manner:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1278
				1279	>>> text = "He was carefully disguised but captured quickly by police."
				1280	>>> re.findall(r"\w+ly", text)
				1281	['carefully', 'quickly']
				1282
				1283
				1284	Finding all Adverbs and their Positions
				1285	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1286
				1287	If one wants more information about all matches of a pattern than the matched
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1288	text, :func:`finditer` is useful as it provides :ref:`match objects
				1289	<match-objects>` instead of strings. Continuing with the previous example, if
				1290	one was a writer who wanted to find all of the adverbs and their positions in
				1291	some text, he or she would use :func:`finditer` in the following manner:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1292
				1293	>>> text = "He was carefully disguised but captured quickly by police."
				1294	>>> for m in re.finditer(r"\w+ly", text):
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1295	... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1296	07-16: carefully
				1297	40-47: quickly
				1298
				1299
				1300	Raw String Notation
				1301	^^^^^^^^^^^^^^^^^^^
				1302
				1303	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1304	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1305	another one to escape it. For example, the two following lines of code are
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1306	functionally identical:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1307
				1308	>>> re.match(r"\W(.)\1\W", " ff ")
Ezio Melotti	7571941	2013-11-23 20:27:27 +0200	[diff] [blame]	1309	<_sre.SRE_Match object; span=(0, 4), match=' ff '>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1310	>>> re.match("\\W(.)\\1\\W", " ff ")
Ezio Melotti	7571941	2013-11-23 20:27:27 +0200	[diff] [blame]	1311	<_sre.SRE_Match object; span=(0, 4), match=' ff '>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1312
				1313	When one wants to match a literal backslash, it must be escaped in the regular
				1314	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1315	notation, one must use ``"\\\\"``, making the following lines of code
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1316	functionally identical:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1317
				1318	>>> re.match(r"\\", r"\\")
Ezio Melotti	7571941	2013-11-23 20:27:27 +0200	[diff] [blame]	1319	<_sre.SRE_Match object; span=(0, 1), match='\\'>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1320	>>> re.match("\\\\", r"\\")
Ezio Melotti	7571941	2013-11-23 20:27:27 +0200	[diff] [blame]	1321	<_sre.SRE_Match object; span=(0, 1), match='\\'>
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1322
				1323
				1324	Writing a Tokenizer
				1325	^^^^^^^^^^^^^^^^^^^
				1326
				1327	A `tokenizer or scanner <http://en.wikipedia.org/wiki/Lexical_analysis>`_
				1328	analyzes a string to categorize groups of characters. This is a useful first
				1329	step in writing a compiler or interpreter.
				1330
				1331	The text categories are specified with regular expressions. The technique is
				1332	to combine those into a single master regular expression and to loop over
				1333	successive matches::
				1334
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1335	import collections
				1336	import re
				1337
				1338	Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1339
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1340	def tokenize(code):
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1341	keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
				1342	token_specification = [
				1343	('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
				1344	('ASSIGN', r':='), # Assignment operator
				1345	('END', r';'), # Statement terminator
				1346	('ID', r'[A-Za-z]+'), # Identifiers
Raymond Hettinger	8323f68	2014-07-14 01:52:00 -0700	[diff] [blame]	1347	('OP', r'[+\-*/]'), # Arithmetic operators
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1348	('NEWLINE', r'\n'), # Line endings
Raymond Hettinger	8323f68	2014-07-14 01:52:00 -0700	[diff] [blame]	1349	('SKIP', r'[ \t]+'), # Skip over spaces and tabs
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1350	('MISMATCH',r'.'), # Any other character
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1351	]
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1352	tok_regex = '\|'.join('(?P<%s>%s)' % pair for pair in token_specification)
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1353	line_num = 1
				1354	line_start = 0
				1355	for mo in re.finditer(tok_regex, code):
				1356	kind = mo.lastgroup
				1357	value = mo.group(kind)
				1358	if kind == 'NEWLINE':
				1359	line_start = mo.end()
				1360	line_num += 1
				1361	elif kind == 'SKIP':
				1362	pass
				1363	elif kind == 'MISMATCH':
				1364	raise RuntimeError('%r unexpected on line %d' % (value, line_num))
				1365	else:
				1366	if kind == 'ID' and value in keywords:
				1367	kind = value
				1368	column = mo.start() - line_start
				1369	yield Token(kind, value, line_num, column)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1370
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1371	statements = '''
				1372	IF quantity THEN
				1373	total := total + price * quantity;
				1374	tax := price * 0.05;
				1375	ENDIF;
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1376	'''
Raymond Hettinger	23157e5	2011-05-13 01:38:31 -0700	[diff] [blame]	1377
				1378	for token in tokenize(statements):
				1379	print(token)
				1380
				1381	The tokenizer produces the following output::
Raymond Hettinger	9c47d77	2011-05-13 01:03:50 -0700	[diff] [blame]	1382
Raymond Hettinger	c566431	2014-08-03 23:38:54 -0700	[diff] [blame]	1383	Token(typ='IF', value='IF', line=2, column=4)
				1384	Token(typ='ID', value='quantity', line=2, column=7)
				1385	Token(typ='THEN', value='THEN', line=2, column=16)
				1386	Token(typ='ID', value='total', line=3, column=8)
				1387	Token(typ='ASSIGN', value=':=', line=3, column=14)
				1388	Token(typ='ID', value='total', line=3, column=17)
				1389	Token(typ='OP', value='+', line=3, column=23)
				1390	Token(typ='ID', value='price', line=3, column=25)
				1391	Token(typ='OP', value='*', line=3, column=31)
				1392	Token(typ='ID', value='quantity', line=3, column=33)
				1393	Token(typ='END', value=';', line=3, column=41)
				1394	Token(typ='ID', value='tax', line=4, column=8)
				1395	Token(typ='ASSIGN', value=':=', line=4, column=12)
				1396	Token(typ='ID', value='price', line=4, column=15)
				1397	Token(typ='OP', value='*', line=4, column=21)
				1398	Token(typ='NUMBER', value='0.05', line=4, column=23)
				1399	Token(typ='END', value=';', line=4, column=27)
				1400	Token(typ='ENDIF', value='ENDIF', line=5, column=4)
				1401	Token(typ='END', value=';', line=5, column=9)