Blame - Doc/library/re.rst - platform/external/python/cpython3

blob: 1421f35ffaefc85f65ac70f72af5c389fbb8b27b [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`re` --- Regular expression operations
				2	===========================================
				3
				4	.. module:: re
				5	:synopsis: Regular expression operations.
				6	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				7	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				8
				9
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	10	This module provides regular expression matching operations similar to
Georg Brandl	ed2a1db	2009-06-08 07:48:27 +0000	[diff] [blame]	11	those found in Perl.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	12
				13	Both patterns and strings to be searched can be Unicode strings as well as
				14	8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed:
				15	that is, you cannot match an Unicode string with a byte pattern or
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	16	vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	17	string must be of the same type as both the pattern and the search string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
				19	Regular expressions use the backslash character (``'\'``) to indicate
				20	special forms or to allow special characters to be used without invoking
				21	their special meaning. This collides with Python's usage of the same
				22	character for the same purpose in string literals; for example, to match
				23	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				24	string, because the regular expression must be ``\\``, and each
				25	backslash must be expressed as ``\\`` inside a regular Python string
				26	literal.
				27
				28	The solution is to use Python's raw string notation for regular expression
				29	patterns; backslashes are not handled in any special way in a string literal
				30	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				31	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	32	newline. Usually patterns will be expressed in Python code using this raw
				33	string notation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	34
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	35	It is important to note that most regular expression operations are available as
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	36	module-level functions and methods on
				37	:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
				38	that don't require you to compile a regex object first, but miss some
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	39	fine-tuning parameters.
				40
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	41	.. seealso::
				42
				43	Mastering Regular Expressions
				44	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	45	second edition of the book no longer covers Python at all, but the first
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	46	edition covered writing good regular expression patterns in great detail.
				47
				48
				49	.. _re-syntax:
				50
				51	Regular Expression Syntax
				52	-------------------------
				53
				54	A regular expression (or RE) specifies a set of strings that matches it; the
				55	functions in this module let you check if a particular string matches a given
				56	regular expression (or if a given regular expression matches a particular
				57	string, which comes down to the same thing).
				58
				59	Regular expressions can be concatenated to form new regular expressions; if A
				60	and B are both regular expressions, then AB is also a regular expression.
				61	In general, if a string p matches A and another string q matches B, the
				62	string pq will match AB. This holds unless A or B contain low precedence
				63	operations; boundary conditions between A and B; or have numbered group
				64	references. Thus, complex expressions can easily be constructed from simpler
				65	primitive expressions like the ones described here. For details of the theory
				66	and implementation of regular expressions, consult the Friedl book referenced
				67	above, or almost any textbook about compiler construction.
				68
				69	A brief explanation of the format of regular expressions follows. For further
Christian Heimes	2202f87	2008-02-06 14:31:34 +0000	[diff] [blame]	70	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	71
				72	Regular expressions can contain both special and ordinary characters. Most
				73	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				74	expressions; they simply match themselves. You can concatenate ordinary
				75	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				76	section, we'll write RE's in ``this special style``, usually without quotes, and
				77	strings to be matched ``'in single quotes'``.)
				78
				79	Some characters, like ``'\|'`` or ``'('``, are special. Special
				80	characters either stand for classes of ordinary characters, or affect
				81	how the regular expressions around them are interpreted. Regular
				82	expression pattern strings may not contain null bytes, but can specify
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	83	the null byte using a ``\number`` notation such as ``'\x00'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	84
				85
				86	The special characters are:
				87
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	88	``'.'``
				89	(Dot.) In the default mode, this matches any character except a newline. If
				90	the :const:`DOTALL` flag has been specified, this matches any character
				91	including a newline.
				92
				93	``'^'``
				94	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				95	matches immediately after each newline.
				96
				97	``'$'``
				98	Matches the end of the string or just before the newline at the end of the
				99	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				100	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				101	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes	25bb783	2008-01-11 16:17:00 +0000	[diff] [blame]	102	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				103	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				104	the newline, and one at the end of the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	105
				106	``'*'``
				107	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				108	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				109	by any number of 'b's.
				110
				111	``'+'``
				112	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				113	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				114	match just 'a'.
				115
				116	``'?'``
				117	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				118	``ab?`` will match either 'a' or 'ab'.
				119
				120	``*?``, ``+?``, ``??``
				121	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				122	as much text as possible. Sometimes this behaviour isn't desired; if the RE
				123	``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
				124	string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
				125	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
				126	characters as possible will be matched. Using ``.*?`` in the previous
				127	expression will match only ``'<H1>'``.
				128
				129	``{m}``
				130	Specifies that exactly m copies of the previous RE should be matched; fewer
				131	matches cause the entire RE not to match. For example, ``a{6}`` will match
				132	exactly six ``'a'`` characters, but not five.
				133
				134	``{m,n}``
				135	Causes the resulting RE to match from m to n repetitions of the preceding
				136	RE, attempting to match as many repetitions as possible. For example,
				137	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				138	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				139	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				140	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				141	modifier would be confused with the previously described form.
				142
				143	``{m,n}?``
				144	Causes the resulting RE to match from m to n repetitions of the preceding
				145	RE, attempting to match as few repetitions as possible. This is the
				146	non-greedy version of the previous qualifier. For example, on the
				147	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				148	while ``a{3,5}?`` will only match 3 characters.
				149
				150	``'\'``
				151	Either escapes special characters (permitting you to match characters like
				152	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				153	sequences are discussed below.
				154
				155	If you're not using a raw string to express the pattern, remember that Python
				156	also uses the backslash as an escape sequence in string literals; if the escape
				157	sequence isn't recognized by Python's parser, the backslash and subsequent
				158	character are included in the resulting string. However, if Python would
				159	recognize the resulting sequence, the backslash should be repeated twice. This
				160	is complicated and hard to understand, so it's highly recommended that you use
				161	raw strings for all but the simplest expressions.
				162
				163	``[]``
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	164	Used to indicate a set of characters. In a set:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	165
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	166	* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
				167	``'m'``, or ``'k'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	168
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	169	* Ranges of characters can be indicated by giving two characters and separating
				170	them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
				171	``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
				172	``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
				173	``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
				174	it will match a literal ``'-'``.
				175
				176	* Special characters lose their special meaning inside sets. For example,
				177	``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
				178	``'*'``, or ``')'``.
				179
				180	* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
				181	inside a set, although the characters they match depends on whether
				182	:const:`ASCII` or :const:`LOCALE` mode is in force.
				183
				184	* Characters that are not within a range can be matched by :dfn:`complementing`
				185	the set. If the first character of the set is ``'^'``, all the characters
				186	that are not in the set will be matched. For example, ``[^5]`` will match
				187	any character except ``'5'``, and ``[^^]`` will match any character except
				188	``'^'``. ``^`` has no special meaning if it's not the first character in
				189	the set.
				190
				191	* To match a literal ``']'`` inside a set, precede it with a backslash, or
				192	place it at the beginning of the set. For example, both ``[()[\]{}]`` and
				193	``[]()[{}]`` will both match a parenthesis.
Mark Summerfield	9e670c2	2008-05-31 13:05:34 +0000	[diff] [blame]	194
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	195	``'\|'``
				196	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				197	will match either A or B. An arbitrary number of REs can be separated by the
				198	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				199	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				200	right. When one pattern completely matches, that branch is accepted. This means
				201	that once ``A`` matches, ``B`` will not be tested further, even if it would
				202	produce a longer overall match. In other words, the ``'\|'`` operator is never
				203	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				204	character class, as in ``[\|]``.
				205
				206	``(...)``
				207	Matches whatever regular expression is inside the parentheses, and indicates the
				208	start and end of a group; the contents of a group can be retrieved after a match
				209	has been performed, and can be matched later in the string with the ``\number``
				210	special sequence, described below. To match the literals ``'('`` or ``')'``,
				211	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				212
				213	``(?...)``
				214	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				215	otherwise). The first character after the ``'?'`` determines what the meaning
				216	and further syntax of the construct is. Extensions usually do not create a new
				217	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				218	currently supported extensions.
				219
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	220	``(?aiLmsux)``
				221	(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				222	``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling	1c50e86	2009-06-01 00:11:36 +0000	[diff] [blame]	223	letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	224	:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	225	:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	226	and :const:`re.X` (verbose), for the entire regular expression. (The
				227	flags are described in :ref:`contents-of-module-re`.) This
				228	is useful if you wish to include the flags as part of the regular
				229	expression, instead of passing a flag argument to the
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	230	:func:`re.compile` function.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	231
				232	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				233	used first in the expression string, or after one or more whitespace characters.
				234	If there are non-whitespace characters before the flag, the results are
				235	undefined.
				236
				237	``(?:...)``
Georg Brandl	3122ce3	2010-10-29 06:17:38 +0000	[diff] [blame]	238	A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	239	expression is inside the parentheses, but the substring matched by the group
				240	cannot be retrieved after performing a match or referenced later in the
				241	pattern.
				242
				243	``(?P<name>...)``
				244	Similar to regular parentheses, but the substring matched by the group is
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame^]	245	accessible via the symbolic group name name. Group names must be valid
				246	Python identifiers, and each group name must be defined only once within a
				247	regular expression. A symbolic group is also a numbered group, just as if
				248	the group were not named.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	249
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame^]	250	Named groups can be referenced in three contexts. If the pattern is
				251	``(?P<quote>['"]).*?(?P=quote)`` (i.e. matching a string quoted with either
				252	single or double quotes):
				253
				254	+---------------------------------------+----------------------------------+
				255	\| Context of reference to group "quote" \| Ways to reference it \|
				256	+=======================================+==================================+
				257	\| in the same pattern itself \| * ``(?P=quote)`` (as shown) \|
				258	\| \| * ``\1`` \|
				259	+---------------------------------------+----------------------------------+
				260	\| when processing match object ``m`` \| * ``m.group('quote')`` \|
				261	\| \| * ``m.end('quote')`` (etc.) \|
				262	+---------------------------------------+----------------------------------+
				263	\| in a string passed to the ``repl`` \| * ``\g<quote>`` \|
				264	\| argument of ``re.sub()`` \| * ``\g<1>`` \|
				265	\| \| * ``\1`` \|
				266	+---------------------------------------+----------------------------------+
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	267
				268	``(?P=name)``
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame^]	269	A backreference to a named group; it matches whatever text was matched by the
				270	earlier group named name.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	271
				272	``(?#...)``
				273	A comment; the contents of the parentheses are simply ignored.
				274
				275	``(?=...)``
				276	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				277	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				278	``'Isaac '`` only if it's followed by ``'Asimov'``.
				279
				280	``(?!...)``
				281	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				282	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				283	followed by ``'Asimov'``.
				284
				285	``(?<=...)``
				286	Matches if the current position in the string is preceded by a match for ``...``
				287	that ends at the current position. This is called a :dfn:`positive lookbehind
				288	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				289	lookbehind will back up 3 characters and check if the contained pattern matches.
				290	The contained pattern must only match strings of some fixed length, meaning that
				291	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
Ezio Melotti	0a6b541	2012-04-29 07:34:46 +0300	[diff] [blame]	292	patterns which start with positive lookbehind assertions will not match at the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	293	beginning of the string being searched; you will most likely want to use the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	294	:func:`search` function rather than the :func:`match` function:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	295
				296	>>> import re
				297	>>> m = re.search('(?<=abc)def', 'abcdef')
				298	>>> m.group(0)
				299	'def'
				300
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	301	This example looks for a word following a hyphen:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	302
				303	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				304	>>> m.group(0)
				305	'egg'
				306
				307	``(?<!...)``
				308	Matches if the current position in the string is not preceded by a match for
				309	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				310	positive lookbehind assertions, the contained pattern must only match strings of
				311	some fixed length. Patterns which start with negative lookbehind assertions may
				312	match at the beginning of the string being searched.
				313
				314	``(?(id/name)yes-pattern\|no-pattern)``
orsenthil@gmail.com	476021b	2011-03-12 10:46:25 +0800	[diff] [blame]	315	Will try to match with ``yes-pattern`` if the group with given id or
				316	name exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
				317	optional and can be omitted. For example,
				318	``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>\|$)`` is a poor email matching pattern, which
				319	will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
				320	not with ``'<user@host.com'`` nor ``'user@host.com>'`` .
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	321
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	322
				323	The special sequences consist of ``'\'`` and a character from the list below.
				324	If the ordinary character is not on the list, then the resulting RE will match
				325	the second character. For example, ``\$`` matches the character ``'$'``.
				326
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	327	``\number``
				328	Matches the contents of the group of the same number. Groups are numbered
				329	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
				330	but not ``'the end'`` (note the space after the group). This special sequence
				331	can only be used to match one of the first 99 groups. If the first digit of
				332	number is 0, or number is 3 octal digits long, it will not be interpreted as
				333	a group match, but as the character with octal value number. Inside the
				334	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				335	characters.
				336
				337	``\A``
				338	Matches only at the start of the string.
				339
				340	``\b``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	341	Matches the empty string, but only at the beginning or end of a word.
				342	A word is defined as a sequence of Unicode alphanumeric or underscore
				343	characters, so the end of a word is indicated by whitespace or a
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	344	non-alphanumeric, non-underscore Unicode character. Note that formally,
				345	``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
				346	(or vice versa), or between ``\w`` and the beginning/end of the string.
				347	This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
				348	``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
				349
				350	By default Unicode alphanumerics are the ones used, but this can be changed
				351	by using the :const:`ASCII` flag. Inside a character range, ``\b``
				352	represents the backspace character, for compatibility with Python's string
				353	literals.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	354
				355	``\B``
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	356	Matches the empty string, but only when it is not at the beginning or end
				357	of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
				358	``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
				359	``\B`` is just the opposite of ``\b``, so word characters are
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	360	Unicode alphanumerics or the underscore, although this can be changed
				361	by using the :const:`ASCII` flag.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	362
				363	``\d``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	364	For Unicode (str) patterns:
Mark Dickinson	1f26828	2009-07-28 17:22:36 +0000	[diff] [blame]	365	Matches any Unicode decimal digit (that is, any character in
				366	Unicode character category [Nd]). This includes ``[0-9]``, and
				367	also many other digit characters. If the :const:`ASCII` flag is
				368	used only ``[0-9]`` is matched (but the flag affects the entire
				369	regular expression, so in such cases using an explicit ``[0-9]``
				370	may be a better choice).
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	371	For 8-bit (bytes) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	372	Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	373
				374	``\D``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	375	Matches any character which is not a Unicode decimal digit. This is
				376	the opposite of ``\d``. If the :const:`ASCII` flag is used this
				377	becomes the equivalent of ``[^0-9]`` (but the flag affects the entire
				378	regular expression, so in such cases using an explicit ``[^0-9]`` may
				379	be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	380
				381	``\s``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	382	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	383	Matches Unicode whitespace characters (which includes
				384	``[ \t\n\r\f\v]``, and also many other characters, for example the
				385	non-breaking spaces mandated by typography rules in many
				386	languages). If the :const:`ASCII` flag is used, only
				387	``[ \t\n\r\f\v]`` is matched (but the flag affects the entire
				388	regular expression, so in such cases using an explicit
				389	``[ \t\n\r\f\v]`` may be a better choice).
				390
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	391	For 8-bit (bytes) patterns:
				392	Matches characters considered whitespace in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	393	this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	394
				395	``\S``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	396	Matches any character which is not a Unicode whitespace character. This is
				397	the opposite of ``\s``. If the :const:`ASCII` flag is used this
				398	becomes the equivalent of ``[^ \t\n\r\f\v]`` (but the flag affects the entire
				399	regular expression, so in such cases using an explicit ``[^ \t\n\r\f\v]`` may
				400	be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	401
				402	``\w``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	403	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	404	Matches Unicode word characters; this includes most characters
				405	that can be part of a word in any language, as well as numbers and
				406	the underscore. If the :const:`ASCII` flag is used, only
				407	``[a-zA-Z0-9_]`` is matched (but the flag affects the entire
				408	regular expression, so in such cases using an explicit
				409	``[a-zA-Z0-9_]`` may be a better choice).
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	410	For 8-bit (bytes) patterns:
				411	Matches characters considered alphanumeric in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	412	this is equivalent to ``[a-zA-Z0-9_]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	413
				414	``\W``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	415	Matches any character which is not a Unicode word character. This is
				416	the opposite of ``\w``. If the :const:`ASCII` flag is used this
				417	becomes the equivalent of ``[^a-zA-Z0-9_]`` (but the flag affects the
				418	entire regular expression, so in such cases using an explicit
				419	``[^a-zA-Z0-9_]`` may be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	420
				421	``\Z``
				422	Matches only at the end of the string.
				423
				424	Most of the standard escapes supported by Python string literals are also
				425	accepted by the regular expression parser::
				426
				427	\a \b \f \n
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	428	\r \t \u \U
				429	\v \x \\
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	430
Ezio Melotti	285e51b	2012-04-29 04:52:30 +0300	[diff] [blame]	431	(Note that ``\b`` is used to represent word boundaries, and means "backspace"
				432	only inside character classes.)
				433
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	434	``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode
				435	patterns. In bytes patterns they are not treated specially.
				436
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	437	Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	438	there are three octal digits, it is considered an octal escape. Otherwise, it is
				439	a group reference. As for string literals, octal escapes are always at most
				440	three digits in length.
				441
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	442	.. versionchanged:: 3.3
				443	The ``'\u'`` and ``'\U'`` escape sequences have been added.
				444
				445
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	446
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	447	.. _contents-of-module-re:
				448
				449	Module Contents
				450	---------------
				451
				452	The module defines several functions, constants, and an exception. Some of the
				453	functions are simplified versions of the full featured methods for compiled
				454	regular expressions. Most non-trivial applications always use the compiled
				455	form.
				456
				457
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	458	.. function:: compile(pattern, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	459
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	460	Compile a regular expression pattern into a regular expression object, which
				461	can be used for matching using its :func:`match` and :func:`search` methods,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	462	described below.
				463
				464	The expression's behaviour can be modified by specifying a flags value.
				465	Values can be any of the following variables, combined using bitwise OR (the
				466	``\|`` operator).
				467
				468	The sequence ::
				469
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	470	prog = re.compile(pattern)
				471	result = prog.match(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	472
				473	is equivalent to ::
				474
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	475	result = re.match(pattern, string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	476
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	477	but using :func:`re.compile` and saving the resulting regular expression
				478	object for reuse is more efficient when the expression will be used several
				479	times in a single program.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	480
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	481	.. note::
				482
				483	The compiled versions of the most recent patterns passed to
				484	:func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
				485	programs that use only a few regular expressions at a time needn't worry
				486	about compiling regular expressions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	487
				488
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	489	.. data:: A
				490	ASCII
				491
Georg Brandl	4049ce0	2009-06-08 07:49:54 +0000	[diff] [blame]	492	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
				493	perform ASCII-only matching instead of full Unicode matching. This is only
				494	meaningful for Unicode patterns, and is ignored for byte patterns.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	495
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	496	Note that for backward compatibility, the :const:`re.U` flag still
				497	exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	498	counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	499	matches are Unicode by default for strings (and Unicode matching
				500	isn't allowed for bytes).
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	501
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	502
Sandro Tosi	da785fd	2012-01-01 12:55:20 +0100	[diff] [blame]	503	.. data:: DEBUG
				504
				505	Display debug information about compiled expression.
				506
				507
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	508	.. data:: I
				509	IGNORECASE
				510
				511	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
Mark Summerfield	8676534	2008-08-20 07:40:18 +0000	[diff] [blame]	512	lowercase letters, too. This is not affected by the current locale
				513	and works for Unicode characters as expected.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	514
				515
				516	.. data:: L
				517	LOCALE
				518
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	519	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	520	current locale. The use of this flag is discouraged as the locale mechanism
				521	is very unreliable, and it only handles one "culture" at a time anyway;
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	522	you should use Unicode matching instead, which is the default in Python 3
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	523	for Unicode (str) patterns.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	524
				525
				526	.. data:: M
				527	MULTILINE
				528
				529	When specified, the pattern character ``'^'`` matches at the beginning of the
				530	string and at the beginning of each line (immediately following each newline);
				531	and the pattern character ``'$'`` matches at the end of the string and at the
				532	end of each line (immediately preceding each newline). By default, ``'^'``
				533	matches only at the beginning of the string, and ``'$'`` only at the end of the
				534	string and immediately before the newline (if any) at the end of the string.
				535
				536
				537	.. data:: S
				538	DOTALL
				539
				540	Make the ``'.'`` special character match any character at all, including a
				541	newline; without this flag, ``'.'`` will match anything except a newline.
				542
				543
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	544	.. data:: X
				545	VERBOSE
				546
				547	This flag allows you to write regular expressions that look nicer. Whitespace
				548	within the pattern is ignored, except when in a character class or preceded by
				549	an unescaped backslash, and, when a line contains a ``'#'`` neither in a
				550	character class or preceded by an unescaped backslash, all characters from the
				551	leftmost such ``'#'`` through the end of the line are ignored.
				552
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	553	That means that the two following regular expression objects that match a
				554	decimal number are functionally equal::
Georg Brandl	81ac1ce	2007-08-31 17:17:17 +0000	[diff] [blame]	555
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	556	a = re.compile(r"""\d + # the integral part
				557	\. # the decimal point
				558	\d * # some fractional digits""", re.X)
				559	b = re.compile(r"\d+\.\d*")
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	560
				561
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	562
				563
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	564	.. function:: search(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	565
				566	Scan through string looking for a location where the regular expression
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	567	pattern produces a match, and return a corresponding :ref:`match object
				568	<match-objects>`. Return ``None`` if no position in the string matches the
				569	pattern; note that this is different from finding a zero-length match at some
				570	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	571
				572
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	573	.. function:: match(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	574
				575	If zero or more characters at the beginning of string match the regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	576	expression pattern, return a corresponding :ref:`match object
				577	<match-objects>`. Return ``None`` if the string does not match the pattern;
				578	note that this is different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	579
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	580	Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
				581	at the beginning of the string and not at the beginning of each line.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	582
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	583	If you want to locate a match anywhere in string, use :func:`search`
				584	instead (see also :ref:`search-vs-match`).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	585
				586
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	587	.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	588
				589	Split string by the occurrences of pattern. If capturing parentheses are
				590	used in pattern, then the text of all groups in the pattern are also returned
				591	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				592	splits occur, and the remainder of the string is returned as the final element
Georg Brandl	9647389	2008-03-06 07:09:43 +0000	[diff] [blame]	593	of the list. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	594
				595	>>> re.split('\W+', 'Words, words, words.')
				596	['Words', 'words', 'words', '']
				597	>>> re.split('(\W+)', 'Words, words, words.')
				598	['Words', ', ', 'words', ', ', 'words', '.', '']
				599	>>> re.split('\W+', 'Words, words, words.', 1)
				600	['Words', 'words, words.']
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	601	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
				602	['0', '3', '9']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	603
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	604	If there are capturing groups in the separator and it matches at the start of
				605	the string, the result will start with an empty string. The same holds for
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	606	the end of the string:
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	607
				608	>>> re.split('(\W+)', '...words, words...')
				609	['', '...', 'words', ', ', 'words', '...', '']
				610
				611	That way, separator components are always found at the same relative
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	612	indices within the result list.
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	613
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	614	Note that split will never split a string on an empty pattern match.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	615	For example:
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	616
				617	>>> re.split('x*', 'foo')
				618	['foo']
				619	>>> re.split("(?m)^$", "foo\n\nbar\n")
				620	['foo\n\nbar\n']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	621
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	622	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	623	Added the optional flags argument.
				624
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	625
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	626	.. function:: findall(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	627
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	628	Return all non-overlapping matches of pattern in string, as a list of
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	629	strings. The string is scanned left-to-right, and matches are returned in
				630	the order found. If one or more groups are present in the pattern, return a
				631	list of groups; this will be a list of tuples if the pattern has more than
				632	one group. Empty matches are included in the result unless they touch the
				633	beginning of another match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	634
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	635
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	636	.. function:: finditer(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	637
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	638	Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
				639	all non-overlapping matches for the RE pattern in string. The string
				640	is scanned left-to-right, and matches are returned in the order found. Empty
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	641	matches are included in the result unless they touch the beginning of another
				642	match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	643
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	644
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	645	.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	646
				647	Return the string obtained by replacing the leftmost non-overlapping occurrences
				648	of pattern in string by the replacement repl. If the pattern isn't found,
				649	string is returned unchanged. repl can be a string or a function; if it is
				650	a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi	6a633bb	2011-08-19 22:54:50 +0200	[diff] [blame]	651	converted to a single newline character, ``\r`` is converted to a carriage return, and
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	652	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				653	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	654	For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	655
				656	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				657	... r'static PyObject*\npy_\1(void)\n{',
				658	... 'def myfunc():')
				659	'static PyObject*\npy_myfunc(void)\n{'
				660
				661	If repl is a function, it is called for every non-overlapping occurrence of
				662	pattern. The function takes a single match object argument, and returns the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	663	replacement string. For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	664
				665	>>> def dashrepl(matchobj):
				666	... if matchobj.group(0) == '-': return ' '
				667	... else: return '-'
				668	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				669	'pro--gram files'
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	670	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
				671	'Baked Beans & Spam'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	672
Georg Brandl	1b5ab45	2009-08-13 07:56:35 +0000	[diff] [blame]	673	The pattern may be a string or an RE object.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	674
				675	The optional argument count is the maximum number of pattern occurrences to be
				676	replaced; count must be a non-negative integer. If omitted or zero, all
				677	occurrences will be replaced. Empty matches for the pattern are replaced only
				678	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				679	``'-a-b-c-'``.
				680
Georg Brandl	3c6780c6	2013-10-06 12:08:14 +0200	[diff] [blame^]	681	In string-type repl arguments, in addition to the character escapes and
				682	backreferences described above,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	683	``\g<name>`` will use the substring matched by the group named ``name``, as
				684	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				685	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				686	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				687	reference to group 20, not a reference to group 2 followed by the literal
				688	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				689	substring matched by the RE.
				690
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	691	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	692	Added the optional flags argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	693
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	694
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	695	.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	696
				697	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				698	number_of_subs_made)``.
				699
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	700	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	701	Added the optional flags argument.
				702
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	703
				704	.. function:: escape(string)
				705
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	706	Escape all the characters in pattern except ASCII letters, numbers and ``'_'``.
				707	This is useful if you want to match an arbitrary literal string that may
				708	have regular expression metacharacters in it.
				709
				710	.. versionchanged:: 3.3
				711	The ``'_'`` character is no longer escaped.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	712
				713
R. David Murray	522c32a	2010-07-10 14:23:36 +0000	[diff] [blame]	714	.. function:: purge()
				715
				716	Clear the regular expression cache.
				717
				718
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	719	.. exception:: error
				720
				721	Exception raised when a string passed to one of the functions here is not a
				722	valid regular expression (for example, it might contain unmatched parentheses)
				723	or when some other error occurs during compilation or matching. It is never an
				724	error if a string contains no match for a pattern.
				725
				726
				727	.. _re-objects:
				728
				729	Regular Expression Objects
				730	--------------------------
				731
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	732	Compiled regular expression objects support the following methods and
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	733	attributes:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	734
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	735	.. method:: regex.search(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	736
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	737	Scan through string looking for a location where this regular expression
				738	produces a match, and return a corresponding :ref:`match object
				739	<match-objects>`. Return ``None`` if no position in the string matches the
				740	pattern; note that this is different from finding a zero-length match at some
				741	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	742
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	743	The optional second parameter pos gives an index in the string where the
				744	search is to start; it defaults to ``0``. This is not completely equivalent to
				745	slicing the string; the ``'^'`` pattern character matches at the real beginning
				746	of the string and at positions just after a newline, but not necessarily at the
				747	index where the search is to start.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	748
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	749	The optional parameter endpos limits how far the string will be searched; it
				750	will be as if the string is endpos characters long, so only the characters
				751	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	752	than pos, no match will be found; otherwise, if rx is a compiled regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	753	expression object, ``rx.search(string, 0, 50)`` is equivalent to
				754	``rx.search(string[:50], 0)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	755
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	756	>>> pattern = re.compile("d")
				757	>>> pattern.search("dog") # Match at index 0
				758	<_sre.SRE_Match object at ...>
				759	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	760
				761
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	762	.. method:: regex.match(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	763
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	764	If zero or more characters at the beginning of string match this regular
				765	expression, return a corresponding :ref:`match object <match-objects>`.
				766	Return ``None`` if the string does not match the pattern; note that this is
				767	different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	768
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	769	The optional pos and endpos parameters have the same meaning as for the
				770	:meth:`~regex.search` method.
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	771
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	772	>>> pattern = re.compile("o")
				773	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
				774	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				775	<_sre.SRE_Match object at ...>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	776
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	777	If you want to locate a match anywhere in string, use
				778	:meth:`~regex.search` instead (see also :ref:`search-vs-match`).
				779
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	780
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	781	.. method:: regex.split(string, maxsplit=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	782
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	783	Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	784
				785
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	786	.. method:: regex.findall(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	787
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	788	Similar to the :func:`findall` function, using the compiled pattern, but
				789	also accepts optional pos and endpos parameters that limit the search
				790	region like for :meth:`match`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	791
				792
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	793	.. method:: regex.finditer(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	794
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	795	Similar to the :func:`finditer` function, using the compiled pattern, but
				796	also accepts optional pos and endpos parameters that limit the search
				797	region like for :meth:`match`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	798
				799
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	800	.. method:: regex.sub(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	801
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	802	Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	803
				804
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	805	.. method:: regex.subn(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	806
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	807	Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	808
				809
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	810	.. attribute:: regex.flags
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	811
Georg Brandl	3a19e54	2012-03-17 17:29:27 +0100	[diff] [blame]	812	The regex matching flags. This is a combination of the flags given to
				813	:func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
				814	flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	815
				816
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	817	.. attribute:: regex.groups
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	818
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	819	The number of capturing groups in the pattern.
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	820
				821
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	822	.. attribute:: regex.groupindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	823
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	824	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				825	numbers. The dictionary is empty if no symbolic groups were used in the
				826	pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	827
				828
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	829	.. attribute:: regex.pattern
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	830
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	831	The pattern string from which the RE object was compiled.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	832
				833
				834	.. _match-objects:
				835
				836	Match Objects
				837	-------------
				838
Ezio Melotti	b87f82f	2012-11-04 06:59:22 +0200	[diff] [blame]	839	Match objects always have a boolean value of ``True``.
				840	Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
				841	when there is no match, you can test whether there was a match with a simple
				842	``if`` statement::
				843
				844	match = re.search(pattern, string)
				845	if match:
				846	process(match)
				847
				848	Match objects support the following methods and attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	849
				850
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	851	.. method:: match.expand(template)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	852
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	853	Return the string obtained by doing backslash substitution on the template
				854	string template, as done by the :meth:`~regex.sub` method.
				855	Escapes such as ``\n`` are converted to the appropriate characters,
				856	and numeric backreferences (``\1``, ``\2``) and named backreferences
				857	(``\g<1>``, ``\g<name>``) are replaced by the contents of the
				858	corresponding group.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	859
				860
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	861	.. method:: match.group([group1, ...])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	862
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	863	Returns one or more subgroups of the match. If there is a single argument, the
				864	result is a single string; if there are multiple arguments, the result is a
				865	tuple with one item per argument. Without arguments, group1 defaults to zero
				866	(the whole match is returned). If a groupN argument is zero, the corresponding
				867	return value is the entire matching string; if it is in the inclusive range
				868	[1..99], it is the string matching the corresponding parenthesized group. If a
				869	group number is negative or larger than the number of groups defined in the
				870	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				871	part of the pattern that did not match, the corresponding result is ``None``.
				872	If a group is contained in a part of the pattern that matched multiple times,
				873	the last match is returned.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	874
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	875	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				876	>>> m.group(0) # The entire match
				877	'Isaac Newton'
				878	>>> m.group(1) # The first parenthesized subgroup.
				879	'Isaac'
				880	>>> m.group(2) # The second parenthesized subgroup.
				881	'Newton'
				882	>>> m.group(1, 2) # Multiple arguments give us a tuple.
				883	('Isaac', 'Newton')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	884
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	885	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				886	arguments may also be strings identifying groups by their group name. If a
				887	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				888	exception is raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	889
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	890	A moderately complicated example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	891
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	892	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				893	>>> m.group('first_name')
				894	'Malcolm'
				895	>>> m.group('last_name')
				896	'Reynolds'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	897
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	898	Named groups can also be referred to by their index:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	899
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	900	>>> m.group(1)
				901	'Malcolm'
				902	>>> m.group(2)
				903	'Reynolds'
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	904
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	905	If a group matches multiple times, only the last match is accessible:
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	906
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	907	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				908	>>> m.group(1) # Returns only the last match.
				909	'c3'
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	910
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	911
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	912	.. method:: match.groups(default=None)
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	913
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	914	Return a tuple containing all the subgroups of the match, from 1 up to however
				915	many groups are in the pattern. The default argument is used for groups that
				916	did not participate in the match; it defaults to ``None``.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	917
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	918	For example:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	919
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	920	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				921	>>> m.groups()
				922	('24', '1632')
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	923
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	924	If we make the decimal place and everything after it optional, not all groups
				925	might participate in the match. These groups will default to ``None`` unless
				926	the default argument is given:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	927
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	928	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				929	>>> m.groups() # Second group defaults to None.
				930	('24', None)
				931	>>> m.groups('0') # Now, the second group defaults to '0'.
				932	('24', '0')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	933
				934
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	935	.. method:: match.groupdict(default=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	936
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	937	Return a dictionary containing all the named subgroups of the match, keyed by
				938	the subgroup name. The default argument is used for groups that did not
				939	participate in the match; it defaults to ``None``. For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	940
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	941	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				942	>>> m.groupdict()
				943	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	944
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	945
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	946	.. method:: match.start([group])
				947	match.end([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	948
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	949	Return the indices of the start and end of the substring matched by group;
				950	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				951	group exists but did not contribute to the match. For a match object m, and
				952	a group g that did contribute to the match, the substring matched by group g
				953	(equivalent to ``m.group(g)``) is ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	954
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	955	m.string[m.start(g):m.end(g)]
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	956
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	957	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				958	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				959	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				960	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	961
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	962	An example that will remove remove_this from email addresses:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	963
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	964	>>> email = "tony@tiremove_thisger.net"
				965	>>> m = re.search("remove_this", email)
				966	>>> email[:m.start()] + email[m.end():]
				967	'tony@tiger.net'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	968
				969
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	970	.. method:: match.span([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	971
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	972	For a match m, return the 2-tuple ``(m.start(group), m.end(group))``. Note
				973	that if group did not contribute to the match, this is ``(-1, -1)``.
				974	group defaults to zero, the entire match.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	975
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	976
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	977	.. attribute:: match.pos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	978
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	979	The value of pos which was passed to the :meth:`~regex.search` or
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	980	:meth:`~regex.match` method of a :ref:`regex object <re-objects>`. This is
				981	the index into the string at which the RE engine started looking for a match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	982
				983
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	984	.. attribute:: match.endpos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	985
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	986	The value of endpos which was passed to the :meth:`~regex.search` or
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	987	:meth:`~regex.match` method of a :ref:`regex object <re-objects>`. This is
				988	the index into the string beyond which the RE engine will not go.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	989
				990
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	991	.. attribute:: match.lastindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	992
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	993	The integer index of the last matched capturing group, or ``None`` if no group
				994	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				995	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				996	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				997	string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	998
				999
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1000	.. attribute:: match.lastgroup
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1001
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1002	The name of the last matched capturing group, or ``None`` if the group didn't
				1003	have a name, or if no group was matched at all.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1004
				1005
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1006	.. attribute:: match.re
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1007
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1008	The regular expression object whose :meth:`~regex.match` or
				1009	:meth:`~regex.search` method produced this match instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1010
				1011
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1012	.. attribute:: match.string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1013
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1014	The string passed to :meth:`~regex.match` or :meth:`~regex.search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1015
				1016
Raymond Hettinger	1fa7682	2010-12-06 23:31:36 +0000	[diff] [blame]	1017	.. _re-examples:
				1018
				1019	Regular Expression Examples
				1020	---------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1021
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1022
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	1023	Checking for a Pair
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1024	^^^^^^^^^^^^^^^^^^^
				1025
				1026	In this example, we'll use the following helper function to display match
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1027	objects a little more gracefully:
				1028
				1029	.. testcode::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1030
				1031	def displaymatch(match):
				1032	if match is None:
				1033	return None
				1034	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				1035
				1036	Suppose you are writing a poker program where a player's hand is represented as
				1037	a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1038	for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1039	representing the card with that value.
				1040
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1041	To see if a given string is a valid hand, one could do the following:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1042
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1043	>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
				1044	>>> displaymatch(valid.match("akt5q")) # Valid.
				1045	"<Match: 'akt5q', groups=()>"
				1046	>>> displaymatch(valid.match("akt5e")) # Invalid.
				1047	>>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1048	>>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1049	"<Match: '727ak', groups=()>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1050
				1051	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1052	To match this with a regular expression, one could use backreferences as such:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1053
				1054	>>> pair = re.compile(r".(.).\1")
				1055	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1056	"<Match: '717', groups=('7',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1057	>>> displaymatch(pair.match("718ak")) # No pairs.
				1058	>>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1059	"<Match: '354aa', groups=('a',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1060
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	1061	To find out what card the pair consists of, one could use the
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1062	:meth:`~match.group` method of the match object in the following manner:
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1063
				1064	.. doctest::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1065
				1066	>>> pair.match("717ak").group(1)
				1067	'7'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1068
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1069	# Error because re.match() returns None, which doesn't have a group() method:
				1070	>>> pair.match("718ak").group(1)
				1071	Traceback (most recent call last):
				1072	File "<pyshell#23>", line 1, in <module>
				1073	re.match(r".(.).\1", "718ak").group(1)
				1074	AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1075
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1076	>>> pair.match("354aa").group(1)
				1077	'a'
				1078
				1079
				1080	Simulating scanf()
				1081	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1082
				1083	.. index:: single: scanf()
				1084
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1085	Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1086	expressions are generally more powerful, though also more verbose, than
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1087	:c:func:`scanf` format strings. The table below offers some more-or-less
				1088	equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1089	expressions.
				1090
				1091	+--------------------------------+---------------------------------------------+
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1092	\| :c:func:`scanf` Token \| Regular Expression \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1093	+================================+=============================================+
				1094	\| ``%c`` \| ``.`` \|
				1095	+--------------------------------+---------------------------------------------+
				1096	\| ``%5c`` \| ``.{5}`` \|
				1097	+--------------------------------+---------------------------------------------+
				1098	\| ``%d`` \| ``[-+]?\d+`` \|
				1099	+--------------------------------+---------------------------------------------+
				1100	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1101	+--------------------------------+---------------------------------------------+
				1102	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1103	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1104	\| ``%o`` \| ``[-+]?[0-7]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1105	+--------------------------------+---------------------------------------------+
				1106	\| ``%s`` \| ``\S+`` \|
				1107	+--------------------------------+---------------------------------------------+
				1108	\| ``%u`` \| ``\d+`` \|
				1109	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1110	\| ``%x``, ``%X`` \| ``[-+]?(0[xX])?[\dA-Fa-f]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1111	+--------------------------------+---------------------------------------------+
				1112
				1113	To extract the filename and numbers from a string like ::
				1114
				1115	/usr/sbin/sendmail - 0 errors, 4 warnings
				1116
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1117	you would use a :c:func:`scanf` format like ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1118
				1119	%s - %d errors, %d warnings
				1120
				1121	The equivalent regular expression would be ::
				1122
				1123	(\S+) - (\d+) errors, (\d+) warnings
				1124
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1125
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1126	.. _search-vs-match:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1127
				1128	search() vs. match()
				1129	^^^^^^^^^^^^^^^^^^^^
				1130
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1131	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1132
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1133	Python offers two different primitive operations based on regular expressions:
				1134	:func:`re.match` checks for a match only at the beginning of the string, while
				1135	:func:`re.search` checks for a match anywhere in the string (this is what Perl
				1136	does by default).
				1137
				1138	For example::
				1139
				1140	>>> re.match("c", "abcdef") # No match
				1141	>>> re.search("c", "abcdef") # Match
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1142	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1143
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1144	Regular expressions beginning with ``'^'`` can be used with :func:`search` to
				1145	restrict the match at the beginning of the string::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1146
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1147	>>> re.match("c", "abcdef") # No match
				1148	>>> re.search("^c", "abcdef") # No match
				1149	>>> re.search("^a", "abcdef") # Match
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1150	<_sre.SRE_Match object at ...>
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1151
				1152	Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
				1153	beginning of the string, whereas using :func:`search` with a regular expression
				1154	beginning with ``'^'`` will match at the beginning of each line.
				1155
				1156	>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
				1157	>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
				1158	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1159
				1160
				1161	Making a Phonebook
				1162	^^^^^^^^^^^^^^^^^^
				1163
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1164	:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1165	method is invaluable for converting textual data into data structures that can be
				1166	easily read and modified by Python as demonstrated in the following example that
				1167	creates a phonebook.
				1168
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1169	First, here is the input. Normally it may come from a file, here we are using
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1170	triple-quoted string syntax:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1171
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1172	>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1173	...
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1174	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1175	... Frank Burger: 925.541.7625 662 South Dogwood Way
				1176	...
				1177	...
				1178	... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1179
				1180	The entries are separated by one or more newlines. Now we convert the string
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1181	into a list with each nonempty line having its own entry:
				1182
				1183	.. doctest::
				1184	:options: +NORMALIZE_WHITESPACE
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1185
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1186	>>> entries = re.split("\n+", text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1187	>>> entries
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1188	['Ross McFluff: 834.345.1254 155 Elm Street',
				1189	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
				1190	'Frank Burger: 925.541.7625 662 South Dogwood Way',
				1191	'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1192
				1193	Finally, split each entry into a list with first name, last name, telephone
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1194	number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1195	because the address has spaces, our splitting pattern, in it:
				1196
				1197	.. doctest::
				1198	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1199
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1200	>>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1201	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1202	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1203	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1204	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1205
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1206	The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1207	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1208	house number from the street name:
				1209
				1210	.. doctest::
				1211	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1212
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1213	>>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1214	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1215	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1216	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1217	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1218
				1219
				1220	Text Munging
				1221	^^^^^^^^^^^^
				1222
				1223	:func:`sub` replaces every occurrence of a pattern with a string or the
				1224	result of a function. This example demonstrates using :func:`sub` with
				1225	a function to "munge" text, or randomize the order of all the characters
				1226	in each word of a sentence except for the first and last characters::
				1227
				1228	>>> def repl(m):
				1229	... inner_word = list(m.group(2))
				1230	... random.shuffle(inner_word)
				1231	... return m.group(1) + "".join(inner_word) + m.group(3)
				1232	>>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1233	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1234	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1235	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1236	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1237
				1238
				1239	Finding all Adverbs
				1240	^^^^^^^^^^^^^^^^^^^
				1241
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1242	:func:`findall` matches all occurrences of a pattern, not just the first
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1243	one as :func:`search` does. For example, if one was a writer and wanted to
				1244	find all of the adverbs in some text, he or she might use :func:`findall` in
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1245	the following manner:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1246
				1247	>>> text = "He was carefully disguised but captured quickly by police."
				1248	>>> re.findall(r"\w+ly", text)
				1249	['carefully', 'quickly']
				1250
				1251
				1252	Finding all Adverbs and their Positions
				1253	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1254
				1255	If one wants more information about all matches of a pattern than the matched
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1256	text, :func:`finditer` is useful as it provides :ref:`match objects
				1257	<match-objects>` instead of strings. Continuing with the previous example, if
				1258	one was a writer who wanted to find all of the adverbs and their positions in
				1259	some text, he or she would use :func:`finditer` in the following manner:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1260
				1261	>>> text = "He was carefully disguised but captured quickly by police."
				1262	>>> for m in re.finditer(r"\w+ly", text):
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1263	... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1264	07-16: carefully
				1265	40-47: quickly
				1266
				1267
				1268	Raw String Notation
				1269	^^^^^^^^^^^^^^^^^^^
				1270
				1271	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1272	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1273	another one to escape it. For example, the two following lines of code are
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1274	functionally identical:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1275
				1276	>>> re.match(r"\W(.)\1\W", " ff ")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1277	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1278	>>> re.match("\\W(.)\\1\\W", " ff ")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1279	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1280
				1281	When one wants to match a literal backslash, it must be escaped in the regular
				1282	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1283	notation, one must use ``"\\\\"``, making the following lines of code
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1284	functionally identical:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1285
				1286	>>> re.match(r"\\", r"\\")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1287	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1288	>>> re.match("\\\\", r"\\")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1289	<_sre.SRE_Match object at ...>
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1290
				1291
				1292	Writing a Tokenizer
				1293	^^^^^^^^^^^^^^^^^^^
				1294
				1295	A `tokenizer or scanner <http://en.wikipedia.org/wiki/Lexical_analysis>`_
				1296	analyzes a string to categorize groups of characters. This is a useful first
				1297	step in writing a compiler or interpreter.
				1298
				1299	The text categories are specified with regular expressions. The technique is
				1300	to combine those into a single master regular expression and to loop over
				1301	successive matches::
				1302
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1303	import collections
				1304	import re
				1305
				1306	Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1307
				1308	def tokenize(s):
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1309	keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
				1310	token_specification = [
				1311	('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
				1312	('ASSIGN', r':='), # Assignment operator
				1313	('END', r';'), # Statement terminator
				1314	('ID', r'[A-Za-z]+'), # Identifiers
				1315	('OP', r'[+*\/\-]'), # Arithmetic operators
				1316	('NEWLINE', r'\n'), # Line endings
				1317	('SKIP', r'[ \t]'), # Skip over spaces and tabs
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1318	]
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1319	tok_regex = '\|'.join('(?P<%s>%s)' % pair for pair in token_specification)
				1320	get_token = re.compile(tok_regex).match
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1321	line = 1
				1322	pos = line_start = 0
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1323	mo = get_token(s)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1324	while mo is not None:
				1325	typ = mo.lastgroup
				1326	if typ == 'NEWLINE':
				1327	line_start = pos
				1328	line += 1
				1329	elif typ != 'SKIP':
Georg Brandl	325477e	2011-05-13 06:54:23 +0200	[diff] [blame]	1330	val = mo.group(typ)
Raymond Hettinger	c2c7c37	2010-12-07 09:44:21 +0000	[diff] [blame]	1331	if typ == 'ID' and val in keywords:
				1332	typ = val
Georg Brandl	325477e	2011-05-13 06:54:23 +0200	[diff] [blame]	1333	yield Token(typ, val, line, mo.start()-line_start)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1334	pos = mo.end()
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1335	mo = get_token(s, pos)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1336	if pos != len(s):
				1337	raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))
				1338
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1339	statements = '''
				1340	IF quantity THEN
				1341	total := total + price * quantity;
				1342	tax := price * 0.05;
				1343	ENDIF;
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1344	'''
Raymond Hettinger	23157e5	2011-05-13 01:38:31 -0700	[diff] [blame]	1345
				1346	for token in tokenize(statements):
				1347	print(token)
				1348
				1349	The tokenizer produces the following output::
Raymond Hettinger	9c47d77	2011-05-13 01:03:50 -0700	[diff] [blame]	1350
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1351	Token(typ='IF', value='IF', line=2, column=5)
				1352	Token(typ='ID', value='quantity', line=2, column=8)
				1353	Token(typ='THEN', value='THEN', line=2, column=17)
				1354	Token(typ='ID', value='total', line=3, column=9)
				1355	Token(typ='ASSIGN', value=':=', line=3, column=15)
				1356	Token(typ='ID', value='total', line=3, column=18)
				1357	Token(typ='OP', value='+', line=3, column=24)
				1358	Token(typ='ID', value='price', line=3, column=26)
				1359	Token(typ='OP', value='*', line=3, column=32)
				1360	Token(typ='ID', value='quantity', line=3, column=34)
				1361	Token(typ='END', value=';', line=3, column=42)
				1362	Token(typ='ID', value='tax', line=4, column=9)
				1363	Token(typ='ASSIGN', value=':=', line=4, column=13)
				1364	Token(typ='ID', value='price', line=4, column=16)
				1365	Token(typ='OP', value='*', line=4, column=22)
				1366	Token(typ='NUMBER', value='0.05', line=4, column=24)
				1367	Token(typ='END', value=';', line=4, column=28)
				1368	Token(typ='ENDIF', value='ENDIF', line=5, column=5)
				1369	Token(typ='END', value=';', line=5, column=10)