Blame - Doc/library/re.rst - platform/external/python/cpython3

blob: b0cb870eb1e53068bec4ca0ce45e9bd280ef0e2c [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`re` --- Regular expression operations
				2	===========================================
				3
				4	.. module:: re
				5	:synopsis: Regular expression operations.
				6	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				7	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				8
				9
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	10	This module provides regular expression matching operations similar to
Georg Brandl	ed2a1db	2009-06-08 07:48:27 +0000	[diff] [blame]	11	those found in Perl.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	12
				13	Both patterns and strings to be searched can be Unicode strings as well as
				14	8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed:
				15	that is, you cannot match an Unicode string with a byte pattern or
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	16	vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	17	string must be of the same type as both the pattern and the search string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
				19	Regular expressions use the backslash character (``'\'``) to indicate
				20	special forms or to allow special characters to be used without invoking
				21	their special meaning. This collides with Python's usage of the same
				22	character for the same purpose in string literals; for example, to match
				23	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				24	string, because the regular expression must be ``\\``, and each
				25	backslash must be expressed as ``\\`` inside a regular Python string
				26	literal.
				27
				28	The solution is to use Python's raw string notation for regular expression
				29	patterns; backslashes are not handled in any special way in a string literal
				30	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				31	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	32	newline. Usually patterns will be expressed in Python code using this raw
				33	string notation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	34
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	35	It is important to note that most regular expression operations are available as
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	36	module-level functions and methods on
				37	:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
				38	that don't require you to compile a regex object first, but miss some
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	39	fine-tuning parameters.
				40
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	41	.. seealso::
				42
				43	Mastering Regular Expressions
				44	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	45	second edition of the book no longer covers Python at all, but the first
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	46	edition covered writing good regular expression patterns in great detail.
				47
				48
				49	.. _re-syntax:
				50
				51	Regular Expression Syntax
				52	-------------------------
				53
				54	A regular expression (or RE) specifies a set of strings that matches it; the
				55	functions in this module let you check if a particular string matches a given
				56	regular expression (or if a given regular expression matches a particular
				57	string, which comes down to the same thing).
				58
				59	Regular expressions can be concatenated to form new regular expressions; if A
				60	and B are both regular expressions, then AB is also a regular expression.
				61	In general, if a string p matches A and another string q matches B, the
				62	string pq will match AB. This holds unless A or B contain low precedence
				63	operations; boundary conditions between A and B; or have numbered group
				64	references. Thus, complex expressions can easily be constructed from simpler
				65	primitive expressions like the ones described here. For details of the theory
				66	and implementation of regular expressions, consult the Friedl book referenced
				67	above, or almost any textbook about compiler construction.
				68
				69	A brief explanation of the format of regular expressions follows. For further
Christian Heimes	2202f87	2008-02-06 14:31:34 +0000	[diff] [blame]	70	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	71
				72	Regular expressions can contain both special and ordinary characters. Most
				73	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				74	expressions; they simply match themselves. You can concatenate ordinary
				75	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				76	section, we'll write RE's in ``this special style``, usually without quotes, and
				77	strings to be matched ``'in single quotes'``.)
				78
				79	Some characters, like ``'\|'`` or ``'('``, are special. Special
				80	characters either stand for classes of ordinary characters, or affect
				81	how the regular expressions around them are interpreted. Regular
				82	expression pattern strings may not contain null bytes, but can specify
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	83	the null byte using a ``\number`` notation such as ``'\x00'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	84
				85
				86	The special characters are:
				87
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	88	``'.'``
				89	(Dot.) In the default mode, this matches any character except a newline. If
				90	the :const:`DOTALL` flag has been specified, this matches any character
				91	including a newline.
				92
				93	``'^'``
				94	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				95	matches immediately after each newline.
				96
				97	``'$'``
				98	Matches the end of the string or just before the newline at the end of the
				99	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				100	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				101	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes	25bb783	2008-01-11 16:17:00 +0000	[diff] [blame]	102	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				103	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				104	the newline, and one at the end of the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	105
				106	``'*'``
				107	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				108	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				109	by any number of 'b's.
				110
				111	``'+'``
				112	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				113	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				114	match just 'a'.
				115
				116	``'?'``
				117	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				118	``ab?`` will match either 'a' or 'ab'.
				119
				120	``*?``, ``+?``, ``??``
				121	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				122	as much text as possible. Sometimes this behaviour isn't desired; if the RE
				123	``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
				124	string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
				125	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
				126	characters as possible will be matched. Using ``.*?`` in the previous
				127	expression will match only ``'<H1>'``.
				128
				129	``{m}``
				130	Specifies that exactly m copies of the previous RE should be matched; fewer
				131	matches cause the entire RE not to match. For example, ``a{6}`` will match
				132	exactly six ``'a'`` characters, but not five.
				133
				134	``{m,n}``
				135	Causes the resulting RE to match from m to n repetitions of the preceding
				136	RE, attempting to match as many repetitions as possible. For example,
				137	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				138	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				139	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				140	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				141	modifier would be confused with the previously described form.
				142
				143	``{m,n}?``
				144	Causes the resulting RE to match from m to n repetitions of the preceding
				145	RE, attempting to match as few repetitions as possible. This is the
				146	non-greedy version of the previous qualifier. For example, on the
				147	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				148	while ``a{3,5}?`` will only match 3 characters.
				149
				150	``'\'``
				151	Either escapes special characters (permitting you to match characters like
				152	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				153	sequences are discussed below.
				154
				155	If you're not using a raw string to express the pattern, remember that Python
				156	also uses the backslash as an escape sequence in string literals; if the escape
				157	sequence isn't recognized by Python's parser, the backslash and subsequent
				158	character are included in the resulting string. However, if Python would
				159	recognize the resulting sequence, the backslash should be repeated twice. This
				160	is complicated and hard to understand, so it's highly recommended that you use
				161	raw strings for all but the simplest expressions.
				162
				163	``[]``
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	164	Used to indicate a set of characters. In a set:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	165
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	166	* Characters can be listed individually, e.g. ``[amk]`` will match ``'a'``,
				167	``'m'``, or ``'k'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	168
Ezio Melotti	81231d9	2011-10-20 19:38:04 +0300	[diff] [blame]	169	* Ranges of characters can be indicated by giving two characters and separating
				170	them by a ``'-'``, for example ``[a-z]`` will match any lowercase ASCII letter,
				171	``[0-5][0-9]`` will match all the two-digits numbers from ``00`` to ``59``, and
				172	``[0-9A-Fa-f]`` will match any hexadecimal digit. If ``-`` is escaped (e.g.
				173	``[a\-z]``) or if it's placed as the first or last character (e.g. ``[a-]``),
				174	it will match a literal ``'-'``.
				175
				176	* Special characters lose their special meaning inside sets. For example,
				177	``[(+*)]`` will match any of the literal characters ``'('``, ``'+'``,
				178	``'*'``, or ``')'``.
				179
				180	* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
				181	inside a set, although the characters they match depends on whether
				182	:const:`ASCII` or :const:`LOCALE` mode is in force.
				183
				184	* Characters that are not within a range can be matched by :dfn:`complementing`
				185	the set. If the first character of the set is ``'^'``, all the characters
				186	that are not in the set will be matched. For example, ``[^5]`` will match
				187	any character except ``'5'``, and ``[^^]`` will match any character except
				188	``'^'``. ``^`` has no special meaning if it's not the first character in
				189	the set.
				190
				191	* To match a literal ``']'`` inside a set, precede it with a backslash, or
				192	place it at the beginning of the set. For example, both ``[()[\]{}]`` and
				193	``[]()[{}]`` will both match a parenthesis.
Mark Summerfield	9e670c2	2008-05-31 13:05:34 +0000	[diff] [blame]	194
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	195	``'\|'``
				196	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				197	will match either A or B. An arbitrary number of REs can be separated by the
				198	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				199	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				200	right. When one pattern completely matches, that branch is accepted. This means
				201	that once ``A`` matches, ``B`` will not be tested further, even if it would
				202	produce a longer overall match. In other words, the ``'\|'`` operator is never
				203	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				204	character class, as in ``[\|]``.
				205
				206	``(...)``
				207	Matches whatever regular expression is inside the parentheses, and indicates the
				208	start and end of a group; the contents of a group can be retrieved after a match
				209	has been performed, and can be matched later in the string with the ``\number``
				210	special sequence, described below. To match the literals ``'('`` or ``')'``,
				211	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				212
				213	``(?...)``
				214	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				215	otherwise). The first character after the ``'?'`` determines what the meaning
				216	and further syntax of the construct is. Extensions usually do not create a new
				217	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				218	currently supported extensions.
				219
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	220	``(?aiLmsux)``
				221	(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				222	``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling	1c50e86	2009-06-01 00:11:36 +0000	[diff] [blame]	223	letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	224	:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	225	:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	226	and :const:`re.X` (verbose), for the entire regular expression. (The
				227	flags are described in :ref:`contents-of-module-re`.) This
				228	is useful if you wish to include the flags as part of the regular
				229	expression, instead of passing a flag argument to the
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	230	:func:`re.compile` function.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	231
				232	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				233	used first in the expression string, or after one or more whitespace characters.
				234	If there are non-whitespace characters before the flag, the results are
				235	undefined.
				236
				237	``(?:...)``
Georg Brandl	3122ce3	2010-10-29 06:17:38 +0000	[diff] [blame]	238	A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	239	expression is inside the parentheses, but the substring matched by the group
				240	cannot be retrieved after performing a match or referenced later in the
				241	pattern.
				242
				243	``(?P<name>...)``
				244	Similar to regular parentheses, but the substring matched by the group is
Benjamin Peterson	d23f822	2009-04-05 19:13:16 +0000	[diff] [blame]	245	accessible within the rest of the regular expression via the symbolic group
				246	name name. Group names must be valid Python identifiers, and each group
				247	name must be defined only once within a regular expression. A symbolic group
				248	is also a numbered group, just as if the group were not named. So the group
				249	named ``id`` in the example below can also be referenced as the numbered group
				250	``1``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	251
				252	For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
				253	referenced by its name in arguments to methods of match objects, such as
Benjamin Peterson	d23f822	2009-04-05 19:13:16 +0000	[diff] [blame]	254	``m.group('id')`` or ``m.end('id')``, and also by name in the regular
				255	expression itself (using ``(?P=id)``) and replacement text given to
				256	``.sub()`` (using ``\g<id>``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	257
				258	``(?P=name)``
				259	Matches whatever text was matched by the earlier group named name.
				260
				261	``(?#...)``
				262	A comment; the contents of the parentheses are simply ignored.
				263
				264	``(?=...)``
				265	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				266	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				267	``'Isaac '`` only if it's followed by ``'Asimov'``.
				268
				269	``(?!...)``
				270	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				271	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				272	followed by ``'Asimov'``.
				273
				274	``(?<=...)``
				275	Matches if the current position in the string is preceded by a match for ``...``
				276	that ends at the current position. This is called a :dfn:`positive lookbehind
				277	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				278	lookbehind will back up 3 characters and check if the contained pattern matches.
				279	The contained pattern must only match strings of some fixed length, meaning that
				280	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
Ezio Melotti	0a6b541	2012-04-29 07:34:46 +0300	[diff] [blame]	281	patterns which start with positive lookbehind assertions will not match at the
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	282	beginning of the string being searched; you will most likely want to use the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	283	:func:`search` function rather than the :func:`match` function:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	284
				285	>>> import re
				286	>>> m = re.search('(?<=abc)def', 'abcdef')
				287	>>> m.group(0)
				288	'def'
				289
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	290	This example looks for a word following a hyphen:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	291
				292	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				293	>>> m.group(0)
				294	'egg'
				295
				296	``(?<!...)``
				297	Matches if the current position in the string is not preceded by a match for
				298	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				299	positive lookbehind assertions, the contained pattern must only match strings of
				300	some fixed length. Patterns which start with negative lookbehind assertions may
				301	match at the beginning of the string being searched.
				302
				303	``(?(id/name)yes-pattern\|no-pattern)``
orsenthil@gmail.com	476021b	2011-03-12 10:46:25 +0800	[diff] [blame]	304	Will try to match with ``yes-pattern`` if the group with given id or
				305	name exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
				306	optional and can be omitted. For example,
				307	``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>\|$)`` is a poor email matching pattern, which
				308	will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
				309	not with ``'<user@host.com'`` nor ``'user@host.com>'`` .
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	310
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	311
				312	The special sequences consist of ``'\'`` and a character from the list below.
				313	If the ordinary character is not on the list, then the resulting RE will match
				314	the second character. For example, ``\$`` matches the character ``'$'``.
				315
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	316	``\number``
				317	Matches the contents of the group of the same number. Groups are numbered
				318	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
				319	but not ``'the end'`` (note the space after the group). This special sequence
				320	can only be used to match one of the first 99 groups. If the first digit of
				321	number is 0, or number is 3 octal digits long, it will not be interpreted as
				322	a group match, but as the character with octal value number. Inside the
				323	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				324	characters.
				325
				326	``\A``
				327	Matches only at the start of the string.
				328
				329	``\b``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	330	Matches the empty string, but only at the beginning or end of a word.
				331	A word is defined as a sequence of Unicode alphanumeric or underscore
				332	characters, so the end of a word is indicated by whitespace or a
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	333	non-alphanumeric, non-underscore Unicode character. Note that formally,
				334	``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
				335	(or vice versa), or between ``\w`` and the beginning/end of the string.
				336	This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
				337	``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
				338
				339	By default Unicode alphanumerics are the ones used, but this can be changed
				340	by using the :const:`ASCII` flag. Inside a character range, ``\b``
				341	represents the backspace character, for compatibility with Python's string
				342	literals.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	343
				344	``\B``
Ezio Melotti	5a045b9	2012-02-29 11:48:44 +0200	[diff] [blame]	345	Matches the empty string, but only when it is not at the beginning or end
				346	of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
				347	``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
				348	``\B`` is just the opposite of ``\b``, so word characters are
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	349	Unicode alphanumerics or the underscore, although this can be changed
				350	by using the :const:`ASCII` flag.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	351
				352	``\d``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	353	For Unicode (str) patterns:
Mark Dickinson	1f26828	2009-07-28 17:22:36 +0000	[diff] [blame]	354	Matches any Unicode decimal digit (that is, any character in
				355	Unicode character category [Nd]). This includes ``[0-9]``, and
				356	also many other digit characters. If the :const:`ASCII` flag is
				357	used only ``[0-9]`` is matched (but the flag affects the entire
				358	regular expression, so in such cases using an explicit ``[0-9]``
				359	may be a better choice).
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	360	For 8-bit (bytes) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	361	Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	362
				363	``\D``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	364	Matches any character which is not a Unicode decimal digit. This is
				365	the opposite of ``\d``. If the :const:`ASCII` flag is used this
				366	becomes the equivalent of ``[^0-9]`` (but the flag affects the entire
				367	regular expression, so in such cases using an explicit ``[^0-9]`` may
				368	be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	369
				370	``\s``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	371	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	372	Matches Unicode whitespace characters (which includes
				373	``[ \t\n\r\f\v]``, and also many other characters, for example the
				374	non-breaking spaces mandated by typography rules in many
				375	languages). If the :const:`ASCII` flag is used, only
				376	``[ \t\n\r\f\v]`` is matched (but the flag affects the entire
				377	regular expression, so in such cases using an explicit
				378	``[ \t\n\r\f\v]`` may be a better choice).
				379
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	380	For 8-bit (bytes) patterns:
				381	Matches characters considered whitespace in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	382	this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	383
				384	``\S``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	385	Matches any character which is not a Unicode whitespace character. This is
				386	the opposite of ``\s``. If the :const:`ASCII` flag is used this
				387	becomes the equivalent of ``[^ \t\n\r\f\v]`` (but the flag affects the entire
				388	regular expression, so in such cases using an explicit ``[^ \t\n\r\f\v]`` may
				389	be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	390
				391	``\w``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	392	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	393	Matches Unicode word characters; this includes most characters
				394	that can be part of a word in any language, as well as numbers and
				395	the underscore. If the :const:`ASCII` flag is used, only
				396	``[a-zA-Z0-9_]`` is matched (but the flag affects the entire
				397	regular expression, so in such cases using an explicit
				398	``[a-zA-Z0-9_]`` may be a better choice).
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	399	For 8-bit (bytes) patterns:
				400	Matches characters considered alphanumeric in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	401	this is equivalent to ``[a-zA-Z0-9_]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	402
				403	``\W``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	404	Matches any character which is not a Unicode word character. This is
				405	the opposite of ``\w``. If the :const:`ASCII` flag is used this
				406	becomes the equivalent of ``[^a-zA-Z0-9_]`` (but the flag affects the
				407	entire regular expression, so in such cases using an explicit
				408	``[^a-zA-Z0-9_]`` may be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	409
				410	``\Z``
				411	Matches only at the end of the string.
				412
				413	Most of the standard escapes supported by Python string literals are also
				414	accepted by the regular expression parser::
				415
				416	\a \b \f \n
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	417	\r \t \u \U
				418	\v \x \\
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	419
Ezio Melotti	285e51b	2012-04-29 04:52:30 +0300	[diff] [blame]	420	(Note that ``\b`` is used to represent word boundaries, and means "backspace"
				421	only inside character classes.)
				422
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	423	``'\u'`` and ``'\U'`` escape sequences are only recognized in Unicode
				424	patterns. In bytes patterns they are not treated specially.
				425
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	426	Octal escapes are included in a limited form. If the first digit is a 0, or if
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	427	there are three octal digits, it is considered an octal escape. Otherwise, it is
				428	a group reference. As for string literals, octal escapes are always at most
				429	three digits in length.
				430
Antoine Pitrou	463badf	2012-06-23 13:29:19 +0200	[diff] [blame]	431	.. versionchanged:: 3.3
				432	The ``'\u'`` and ``'\U'`` escape sequences have been added.
				433
				434
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	435
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	436	.. _contents-of-module-re:
				437
				438	Module Contents
				439	---------------
				440
				441	The module defines several functions, constants, and an exception. Some of the
				442	functions are simplified versions of the full featured methods for compiled
				443	regular expressions. Most non-trivial applications always use the compiled
				444	form.
				445
				446
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	447	.. function:: compile(pattern, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	448
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	449	Compile a regular expression pattern into a regular expression object, which
				450	can be used for matching using its :func:`match` and :func:`search` methods,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	451	described below.
				452
				453	The expression's behaviour can be modified by specifying a flags value.
				454	Values can be any of the following variables, combined using bitwise OR (the
				455	``\|`` operator).
				456
				457	The sequence ::
				458
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	459	prog = re.compile(pattern)
				460	result = prog.match(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	461
				462	is equivalent to ::
				463
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	464	result = re.match(pattern, string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	465
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	466	but using :func:`re.compile` and saving the resulting regular expression
				467	object for reuse is more efficient when the expression will be used several
				468	times in a single program.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	469
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	470	.. note::
				471
				472	The compiled versions of the most recent patterns passed to
				473	:func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
				474	programs that use only a few regular expressions at a time needn't worry
				475	about compiling regular expressions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	476
				477
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	478	.. data:: A
				479	ASCII
				480
Georg Brandl	4049ce0	2009-06-08 07:49:54 +0000	[diff] [blame]	481	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
				482	perform ASCII-only matching instead of full Unicode matching. This is only
				483	meaningful for Unicode patterns, and is ignored for byte patterns.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	484
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	485	Note that for backward compatibility, the :const:`re.U` flag still
				486	exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	487	counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	488	matches are Unicode by default for strings (and Unicode matching
				489	isn't allowed for bytes).
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	490
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	491
Sandro Tosi	da785fd	2012-01-01 12:55:20 +0100	[diff] [blame]	492	.. data:: DEBUG
				493
				494	Display debug information about compiled expression.
				495
				496
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	497	.. data:: I
				498	IGNORECASE
				499
				500	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
Mark Summerfield	8676534	2008-08-20 07:40:18 +0000	[diff] [blame]	501	lowercase letters, too. This is not affected by the current locale
				502	and works for Unicode characters as expected.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	503
				504
				505	.. data:: L
				506	LOCALE
				507
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	508	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	509	current locale. The use of this flag is discouraged as the locale mechanism
				510	is very unreliable, and it only handles one "culture" at a time anyway;
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	511	you should use Unicode matching instead, which is the default in Python 3
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	512	for Unicode (str) patterns.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	513
				514
				515	.. data:: M
				516	MULTILINE
				517
				518	When specified, the pattern character ``'^'`` matches at the beginning of the
				519	string and at the beginning of each line (immediately following each newline);
				520	and the pattern character ``'$'`` matches at the end of the string and at the
				521	end of each line (immediately preceding each newline). By default, ``'^'``
				522	matches only at the beginning of the string, and ``'$'`` only at the end of the
				523	string and immediately before the newline (if any) at the end of the string.
				524
				525
				526	.. data:: S
				527	DOTALL
				528
				529	Make the ``'.'`` special character match any character at all, including a
				530	newline; without this flag, ``'.'`` will match anything except a newline.
				531
				532
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	533	.. data:: X
				534	VERBOSE
				535
				536	This flag allows you to write regular expressions that look nicer. Whitespace
				537	within the pattern is ignored, except when in a character class or preceded by
				538	an unescaped backslash, and, when a line contains a ``'#'`` neither in a
				539	character class or preceded by an unescaped backslash, all characters from the
				540	leftmost such ``'#'`` through the end of the line are ignored.
				541
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	542	That means that the two following regular expression objects that match a
				543	decimal number are functionally equal::
Georg Brandl	81ac1ce	2007-08-31 17:17:17 +0000	[diff] [blame]	544
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	545	a = re.compile(r"""\d + # the integral part
				546	\. # the decimal point
				547	\d * # some fractional digits""", re.X)
				548	b = re.compile(r"\d+\.\d*")
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	549
				550
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	551
				552
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	553	.. function:: search(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	554
				555	Scan through string looking for a location where the regular expression
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	556	pattern produces a match, and return a corresponding :ref:`match object
				557	<match-objects>`. Return ``None`` if no position in the string matches the
				558	pattern; note that this is different from finding a zero-length match at some
				559	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	560
				561
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	562	.. function:: match(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	563
				564	If zero or more characters at the beginning of string match the regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	565	expression pattern, return a corresponding :ref:`match object
				566	<match-objects>`. Return ``None`` if the string does not match the pattern;
				567	note that this is different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	568
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	569	Note that even in :const:`MULTILINE` mode, :func:`re.match` will only match
				570	at the beginning of the string and not at the beginning of each line.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	571
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	572	If you want to locate a match anywhere in string, use :func:`search`
				573	instead (see also :ref:`search-vs-match`).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	574
				575
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	576	.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	577
				578	Split string by the occurrences of pattern. If capturing parentheses are
				579	used in pattern, then the text of all groups in the pattern are also returned
				580	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				581	splits occur, and the remainder of the string is returned as the final element
Georg Brandl	9647389	2008-03-06 07:09:43 +0000	[diff] [blame]	582	of the list. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	583
				584	>>> re.split('\W+', 'Words, words, words.')
				585	['Words', 'words', 'words', '']
				586	>>> re.split('(\W+)', 'Words, words, words.')
				587	['Words', ', ', 'words', ', ', 'words', '.', '']
				588	>>> re.split('\W+', 'Words, words, words.', 1)
				589	['Words', 'words, words.']
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	590	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
				591	['0', '3', '9']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	592
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	593	If there are capturing groups in the separator and it matches at the start of
				594	the string, the result will start with an empty string. The same holds for
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	595	the end of the string:
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	596
				597	>>> re.split('(\W+)', '...words, words...')
				598	['', '...', 'words', ', ', 'words', '...', '']
				599
				600	That way, separator components are always found at the same relative
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	601	indices within the result list.
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	602
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	603	Note that split will never split a string on an empty pattern match.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	604	For example:
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	605
				606	>>> re.split('x*', 'foo')
				607	['foo']
				608	>>> re.split("(?m)^$", "foo\n\nbar\n")
				609	['foo\n\nbar\n']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	610
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	611	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	612	Added the optional flags argument.
				613
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	614
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	615	.. function:: findall(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	616
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	617	Return all non-overlapping matches of pattern in string, as a list of
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	618	strings. The string is scanned left-to-right, and matches are returned in
				619	the order found. If one or more groups are present in the pattern, return a
				620	list of groups; this will be a list of tuples if the pattern has more than
				621	one group. Empty matches are included in the result unless they touch the
				622	beginning of another match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	623
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	624
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	625	.. function:: finditer(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	626
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	627	Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
				628	all non-overlapping matches for the RE pattern in string. The string
				629	is scanned left-to-right, and matches are returned in the order found. Empty
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	630	matches are included in the result unless they touch the beginning of another
				631	match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	632
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	633
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	634	.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	635
				636	Return the string obtained by replacing the leftmost non-overlapping occurrences
				637	of pattern in string by the replacement repl. If the pattern isn't found,
				638	string is returned unchanged. repl can be a string or a function; if it is
				639	a string, any backslash escapes in it are processed. That is, ``\n`` is
Sandro Tosi	6a633bb	2011-08-19 22:54:50 +0200	[diff] [blame]	640	converted to a single newline character, ``\r`` is converted to a carriage return, and
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	641	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				642	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	643	For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	644
				645	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				646	... r'static PyObject*\npy_\1(void)\n{',
				647	... 'def myfunc():')
				648	'static PyObject*\npy_myfunc(void)\n{'
				649
				650	If repl is a function, it is called for every non-overlapping occurrence of
				651	pattern. The function takes a single match object argument, and returns the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	652	replacement string. For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	653
				654	>>> def dashrepl(matchobj):
				655	... if matchobj.group(0) == '-': return ' '
				656	... else: return '-'
				657	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				658	'pro--gram files'
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	659	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
				660	'Baked Beans & Spam'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	661
Georg Brandl	1b5ab45	2009-08-13 07:56:35 +0000	[diff] [blame]	662	The pattern may be a string or an RE object.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	663
				664	The optional argument count is the maximum number of pattern occurrences to be
				665	replaced; count must be a non-negative integer. If omitted or zero, all
				666	occurrences will be replaced. Empty matches for the pattern are replaced only
				667	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				668	``'-a-b-c-'``.
				669
				670	In addition to character escapes and backreferences as described above,
				671	``\g<name>`` will use the substring matched by the group named ``name``, as
				672	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				673	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				674	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				675	reference to group 20, not a reference to group 2 followed by the literal
				676	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				677	substring matched by the RE.
				678
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	679	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	680	Added the optional flags argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	681
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	682
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	683	.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	684
				685	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				686	number_of_subs_made)``.
				687
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	688	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	689	Added the optional flags argument.
				690
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	691
				692	.. function:: escape(string)
				693
Ezio Melotti	88fdeb4	2011-04-10 12:59:16 +0300	[diff] [blame]	694	Escape all the characters in pattern except ASCII letters, numbers and ``'_'``.
				695	This is useful if you want to match an arbitrary literal string that may
				696	have regular expression metacharacters in it.
				697
				698	.. versionchanged:: 3.3
				699	The ``'_'`` character is no longer escaped.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	700
				701
R. David Murray	522c32a	2010-07-10 14:23:36 +0000	[diff] [blame]	702	.. function:: purge()
				703
				704	Clear the regular expression cache.
				705
				706
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	707	.. exception:: error
				708
				709	Exception raised when a string passed to one of the functions here is not a
				710	valid regular expression (for example, it might contain unmatched parentheses)
				711	or when some other error occurs during compilation or matching. It is never an
				712	error if a string contains no match for a pattern.
				713
				714
				715	.. _re-objects:
				716
				717	Regular Expression Objects
				718	--------------------------
				719
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	720	Compiled regular expression objects support the following methods and
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	721	attributes:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	722
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	723	.. method:: regex.search(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	724
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	725	Scan through string looking for a location where this regular expression
				726	produces a match, and return a corresponding :ref:`match object
				727	<match-objects>`. Return ``None`` if no position in the string matches the
				728	pattern; note that this is different from finding a zero-length match at some
				729	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	730
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	731	The optional second parameter pos gives an index in the string where the
				732	search is to start; it defaults to ``0``. This is not completely equivalent to
				733	slicing the string; the ``'^'`` pattern character matches at the real beginning
				734	of the string and at positions just after a newline, but not necessarily at the
				735	index where the search is to start.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	736
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	737	The optional parameter endpos limits how far the string will be searched; it
				738	will be as if the string is endpos characters long, so only the characters
				739	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	740	than pos, no match will be found; otherwise, if rx is a compiled regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	741	expression object, ``rx.search(string, 0, 50)`` is equivalent to
				742	``rx.search(string[:50], 0)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	743
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	744	>>> pattern = re.compile("d")
				745	>>> pattern.search("dog") # Match at index 0
				746	<_sre.SRE_Match object at ...>
				747	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	748
				749
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	750	.. method:: regex.match(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	751
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	752	If zero or more characters at the beginning of string match this regular
				753	expression, return a corresponding :ref:`match object <match-objects>`.
				754	Return ``None`` if the string does not match the pattern; note that this is
				755	different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	756
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	757	The optional pos and endpos parameters have the same meaning as for the
				758	:meth:`~regex.search` method.
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	759
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	760	>>> pattern = re.compile("o")
				761	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
				762	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				763	<_sre.SRE_Match object at ...>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	764
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	765	If you want to locate a match anywhere in string, use
				766	:meth:`~regex.search` instead (see also :ref:`search-vs-match`).
				767
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	768
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	769	.. method:: regex.split(string, maxsplit=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	770
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	771	Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	772
				773
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	774	.. method:: regex.findall(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	775
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	776	Similar to the :func:`findall` function, using the compiled pattern, but
				777	also accepts optional pos and endpos parameters that limit the search
				778	region like for :meth:`match`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	779
				780
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	781	.. method:: regex.finditer(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	782
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	783	Similar to the :func:`finditer` function, using the compiled pattern, but
				784	also accepts optional pos and endpos parameters that limit the search
				785	region like for :meth:`match`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	786
				787
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	788	.. method:: regex.sub(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	789
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	790	Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	791
				792
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	793	.. method:: regex.subn(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	794
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	795	Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	796
				797
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	798	.. attribute:: regex.flags
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	799
Georg Brandl	3a19e54	2012-03-17 17:29:27 +0100	[diff] [blame]	800	The regex matching flags. This is a combination of the flags given to
				801	:func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
				802	flags such as :data:`UNICODE` if the pattern is a Unicode string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	803
				804
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	805	.. attribute:: regex.groups
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	806
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	807	The number of capturing groups in the pattern.
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	808
				809
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	810	.. attribute:: regex.groupindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	811
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	812	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				813	numbers. The dictionary is empty if no symbolic groups were used in the
				814	pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	815
				816
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	817	.. attribute:: regex.pattern
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	818
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	819	The pattern string from which the RE object was compiled.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	820
				821
				822	.. _match-objects:
				823
				824	Match Objects
				825	-------------
				826
Ezio Melotti	b87f82f	2012-11-04 06:59:22 +0200	[diff] [blame]	827	Match objects always have a boolean value of ``True``.
				828	Since :meth:`~regex.match` and :meth:`~regex.search` return ``None``
				829	when there is no match, you can test whether there was a match with a simple
				830	``if`` statement::
				831
				832	match = re.search(pattern, string)
				833	if match:
				834	process(match)
				835
				836	Match objects support the following methods and attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	837
				838
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	839	.. method:: match.expand(template)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	840
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	841	Return the string obtained by doing backslash substitution on the template
				842	string template, as done by the :meth:`~regex.sub` method.
				843	Escapes such as ``\n`` are converted to the appropriate characters,
				844	and numeric backreferences (``\1``, ``\2``) and named backreferences
				845	(``\g<1>``, ``\g<name>``) are replaced by the contents of the
				846	corresponding group.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	847
				848
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	849	.. method:: match.group([group1, ...])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	850
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	851	Returns one or more subgroups of the match. If there is a single argument, the
				852	result is a single string; if there are multiple arguments, the result is a
				853	tuple with one item per argument. Without arguments, group1 defaults to zero
				854	(the whole match is returned). If a groupN argument is zero, the corresponding
				855	return value is the entire matching string; if it is in the inclusive range
				856	[1..99], it is the string matching the corresponding parenthesized group. If a
				857	group number is negative or larger than the number of groups defined in the
				858	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				859	part of the pattern that did not match, the corresponding result is ``None``.
				860	If a group is contained in a part of the pattern that matched multiple times,
				861	the last match is returned.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	862
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	863	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				864	>>> m.group(0) # The entire match
				865	'Isaac Newton'
				866	>>> m.group(1) # The first parenthesized subgroup.
				867	'Isaac'
				868	>>> m.group(2) # The second parenthesized subgroup.
				869	'Newton'
				870	>>> m.group(1, 2) # Multiple arguments give us a tuple.
				871	('Isaac', 'Newton')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	872
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	873	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				874	arguments may also be strings identifying groups by their group name. If a
				875	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				876	exception is raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	877
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	878	A moderately complicated example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	879
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	880	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				881	>>> m.group('first_name')
				882	'Malcolm'
				883	>>> m.group('last_name')
				884	'Reynolds'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	885
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	886	Named groups can also be referred to by their index:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	887
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	888	>>> m.group(1)
				889	'Malcolm'
				890	>>> m.group(2)
				891	'Reynolds'
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	892
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	893	If a group matches multiple times, only the last match is accessible:
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	894
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	895	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				896	>>> m.group(1) # Returns only the last match.
				897	'c3'
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	898
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	899
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	900	.. method:: match.groups(default=None)
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	901
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	902	Return a tuple containing all the subgroups of the match, from 1 up to however
				903	many groups are in the pattern. The default argument is used for groups that
				904	did not participate in the match; it defaults to ``None``.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	905
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	906	For example:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	907
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	908	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				909	>>> m.groups()
				910	('24', '1632')
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	911
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	912	If we make the decimal place and everything after it optional, not all groups
				913	might participate in the match. These groups will default to ``None`` unless
				914	the default argument is given:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	915
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	916	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				917	>>> m.groups() # Second group defaults to None.
				918	('24', None)
				919	>>> m.groups('0') # Now, the second group defaults to '0'.
				920	('24', '0')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	921
				922
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	923	.. method:: match.groupdict(default=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	924
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	925	Return a dictionary containing all the named subgroups of the match, keyed by
				926	the subgroup name. The default argument is used for groups that did not
				927	participate in the match; it defaults to ``None``. For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	928
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	929	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				930	>>> m.groupdict()
				931	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	932
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	933
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	934	.. method:: match.start([group])
				935	match.end([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	936
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	937	Return the indices of the start and end of the substring matched by group;
				938	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				939	group exists but did not contribute to the match. For a match object m, and
				940	a group g that did contribute to the match, the substring matched by group g
				941	(equivalent to ``m.group(g)``) is ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	942
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	943	m.string[m.start(g):m.end(g)]
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	944
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	945	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				946	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				947	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				948	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	949
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	950	An example that will remove remove_this from email addresses:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	951
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	952	>>> email = "tony@tiremove_thisger.net"
				953	>>> m = re.search("remove_this", email)
				954	>>> email[:m.start()] + email[m.end():]
				955	'tony@tiger.net'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	956
				957
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	958	.. method:: match.span([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	959
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	960	For a match m, return the 2-tuple ``(m.start(group), m.end(group))``. Note
				961	that if group did not contribute to the match, this is ``(-1, -1)``.
				962	group defaults to zero, the entire match.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	963
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	964
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	965	.. attribute:: match.pos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	966
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	967	The value of pos which was passed to the :meth:`~regex.search` or
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	968	:meth:`~regex.match` method of a :ref:`regex object <re-objects>`. This is
				969	the index into the string at which the RE engine started looking for a match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	970
				971
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	972	.. attribute:: match.endpos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	973
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	974	The value of endpos which was passed to the :meth:`~regex.search` or
Georg Brandl	69c7a69	2012-03-14 08:02:43 +0100	[diff] [blame]	975	:meth:`~regex.match` method of a :ref:`regex object <re-objects>`. This is
				976	the index into the string beyond which the RE engine will not go.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	977
				978
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	979	.. attribute:: match.lastindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	980
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	981	The integer index of the last matched capturing group, or ``None`` if no group
				982	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				983	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				984	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				985	string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	986
				987
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	988	.. attribute:: match.lastgroup
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	989
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	990	The name of the last matched capturing group, or ``None`` if the group didn't
				991	have a name, or if no group was matched at all.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	992
				993
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	994	.. attribute:: match.re
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	995
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	996	The regular expression object whose :meth:`~regex.match` or
				997	:meth:`~regex.search` method produced this match instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	998
				999
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1000	.. attribute:: match.string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1001
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1002	The string passed to :meth:`~regex.match` or :meth:`~regex.search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1003
				1004
Raymond Hettinger	1fa7682	2010-12-06 23:31:36 +0000	[diff] [blame]	1005	.. _re-examples:
				1006
				1007	Regular Expression Examples
				1008	---------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1009
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1010
Raymond Hettinger	5768e0c	2011-10-19 14:10:07 -0700	[diff] [blame]	1011	Checking for a Pair
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1012	^^^^^^^^^^^^^^^^^^^
				1013
				1014	In this example, we'll use the following helper function to display match
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1015	objects a little more gracefully:
				1016
				1017	.. testcode::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1018
				1019	def displaymatch(match):
				1020	if match is None:
				1021	return None
				1022	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				1023
				1024	Suppose you are writing a poker program where a player's hand is represented as
				1025	a 5-character string with each character representing a card, "a" for ace, "k"
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1026	for king, "q" for queen, "j" for jack, "t" for 10, and "2" through "9"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1027	representing the card with that value.
				1028
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1029	To see if a given string is a valid hand, one could do the following:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1030
Ezio Melotti	e5b2ac8	2011-12-17 01:17:17 +0200	[diff] [blame]	1031	>>> valid = re.compile(r"^[a2-9tjqk]{5}$")
				1032	>>> displaymatch(valid.match("akt5q")) # Valid.
				1033	"<Match: 'akt5q', groups=()>"
				1034	>>> displaymatch(valid.match("akt5e")) # Invalid.
				1035	>>> displaymatch(valid.match("akt")) # Invalid.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1036	>>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1037	"<Match: '727ak', groups=()>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1038
				1039	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1040	To match this with a regular expression, one could use backreferences as such:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1041
				1042	>>> pair = re.compile(r".(.).\1")
				1043	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1044	"<Match: '717', groups=('7',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1045	>>> displaymatch(pair.match("718ak")) # No pairs.
				1046	>>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1047	"<Match: '354aa', groups=('a',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1048
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	1049	To find out what card the pair consists of, one could use the
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1050	:meth:`~match.group` method of the match object in the following manner:
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1051
				1052	.. doctest::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1053
				1054	>>> pair.match("717ak").group(1)
				1055	'7'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1056
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1057	# Error because re.match() returns None, which doesn't have a group() method:
				1058	>>> pair.match("718ak").group(1)
				1059	Traceback (most recent call last):
				1060	File "<pyshell#23>", line 1, in <module>
				1061	re.match(r".(.).\1", "718ak").group(1)
				1062	AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1063
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1064	>>> pair.match("354aa").group(1)
				1065	'a'
				1066
				1067
				1068	Simulating scanf()
				1069	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1070
				1071	.. index:: single: scanf()
				1072
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1073	Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1074	expressions are generally more powerful, though also more verbose, than
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1075	:c:func:`scanf` format strings. The table below offers some more-or-less
				1076	equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1077	expressions.
				1078
				1079	+--------------------------------+---------------------------------------------+
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1080	\| :c:func:`scanf` Token \| Regular Expression \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1081	+================================+=============================================+
				1082	\| ``%c`` \| ``.`` \|
				1083	+--------------------------------+---------------------------------------------+
				1084	\| ``%5c`` \| ``.{5}`` \|
				1085	+--------------------------------+---------------------------------------------+
				1086	\| ``%d`` \| ``[-+]?\d+`` \|
				1087	+--------------------------------+---------------------------------------------+
				1088	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1089	+--------------------------------+---------------------------------------------+
				1090	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1091	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1092	\| ``%o`` \| ``[-+]?[0-7]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1093	+--------------------------------+---------------------------------------------+
				1094	\| ``%s`` \| ``\S+`` \|
				1095	+--------------------------------+---------------------------------------------+
				1096	\| ``%u`` \| ``\d+`` \|
				1097	+--------------------------------+---------------------------------------------+
Ezio Melotti	a0b1d1e	2012-04-29 11:47:28 +0300	[diff] [blame]	1098	\| ``%x``, ``%X`` \| ``[-+]?(0[xX])?[\dA-Fa-f]+`` \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1099	+--------------------------------+---------------------------------------------+
				1100
				1101	To extract the filename and numbers from a string like ::
				1102
				1103	/usr/sbin/sendmail - 0 errors, 4 warnings
				1104
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1105	you would use a :c:func:`scanf` format like ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1106
				1107	%s - %d errors, %d warnings
				1108
				1109	The equivalent regular expression would be ::
				1110
				1111	(\S+) - (\d+) errors, (\d+) warnings
				1112
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1113
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1114	.. _search-vs-match:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1115
				1116	search() vs. match()
				1117	^^^^^^^^^^^^^^^^^^^^
				1118
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1119	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1120
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1121	Python offers two different primitive operations based on regular expressions:
				1122	:func:`re.match` checks for a match only at the beginning of the string, while
				1123	:func:`re.search` checks for a match anywhere in the string (this is what Perl
				1124	does by default).
				1125
				1126	For example::
				1127
				1128	>>> re.match("c", "abcdef") # No match
				1129	>>> re.search("c", "abcdef") # Match
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1130	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1131
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1132	Regular expressions beginning with ``'^'`` can be used with :func:`search` to
				1133	restrict the match at the beginning of the string::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1134
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1135	>>> re.match("c", "abcdef") # No match
				1136	>>> re.search("^c", "abcdef") # No match
				1137	>>> re.search("^a", "abcdef") # Match
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1138	<_sre.SRE_Match object at ...>
Ezio Melotti	443f000	2012-02-29 13:39:05 +0200	[diff] [blame]	1139
				1140	Note however that in :const:`MULTILINE` mode :func:`match` only matches at the
				1141	beginning of the string, whereas using :func:`search` with a regular expression
				1142	beginning with ``'^'`` will match at the beginning of each line.
				1143
				1144	>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match
				1145	>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match
				1146	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1147
				1148
				1149	Making a Phonebook
				1150	^^^^^^^^^^^^^^^^^^
				1151
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1152	:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1153	method is invaluable for converting textual data into data structures that can be
				1154	easily read and modified by Python as demonstrated in the following example that
				1155	creates a phonebook.
				1156
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1157	First, here is the input. Normally it may come from a file, here we are using
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1158	triple-quoted string syntax:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1159
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1160	>>> text = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1161	...
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1162	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1163	... Frank Burger: 925.541.7625 662 South Dogwood Way
				1164	...
				1165	...
				1166	... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1167
				1168	The entries are separated by one or more newlines. Now we convert the string
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1169	into a list with each nonempty line having its own entry:
				1170
				1171	.. doctest::
				1172	:options: +NORMALIZE_WHITESPACE
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1173
Georg Brandl	557a3ec	2012-03-17 17:26:27 +0100	[diff] [blame]	1174	>>> entries = re.split("\n+", text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1175	>>> entries
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1176	['Ross McFluff: 834.345.1254 155 Elm Street',
				1177	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
				1178	'Frank Burger: 925.541.7625 662 South Dogwood Way',
				1179	'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1180
				1181	Finally, split each entry into a list with first name, last name, telephone
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1182	number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1183	because the address has spaces, our splitting pattern, in it:
				1184
				1185	.. doctest::
				1186	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1187
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1188	>>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1189	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1190	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1191	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1192	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1193
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1194	The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1195	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1196	house number from the street name:
				1197
				1198	.. doctest::
				1199	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1200
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1201	>>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1202	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1203	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1204	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1205	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1206
				1207
				1208	Text Munging
				1209	^^^^^^^^^^^^
				1210
				1211	:func:`sub` replaces every occurrence of a pattern with a string or the
				1212	result of a function. This example demonstrates using :func:`sub` with
				1213	a function to "munge" text, or randomize the order of all the characters
				1214	in each word of a sentence except for the first and last characters::
				1215
				1216	>>> def repl(m):
				1217	... inner_word = list(m.group(2))
				1218	... random.shuffle(inner_word)
				1219	... return m.group(1) + "".join(inner_word) + m.group(3)
				1220	>>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1221	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1222	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1223	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1224	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1225
				1226
				1227	Finding all Adverbs
				1228	^^^^^^^^^^^^^^^^^^^
				1229
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1230	:func:`findall` matches all occurrences of a pattern, not just the first
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1231	one as :func:`search` does. For example, if one was a writer and wanted to
				1232	find all of the adverbs in some text, he or she might use :func:`findall` in
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1233	the following manner:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1234
				1235	>>> text = "He was carefully disguised but captured quickly by police."
				1236	>>> re.findall(r"\w+ly", text)
				1237	['carefully', 'quickly']
				1238
				1239
				1240	Finding all Adverbs and their Positions
				1241	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1242
				1243	If one wants more information about all matches of a pattern than the matched
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1244	text, :func:`finditer` is useful as it provides :ref:`match objects
				1245	<match-objects>` instead of strings. Continuing with the previous example, if
				1246	one was a writer who wanted to find all of the adverbs and their positions in
				1247	some text, he or she would use :func:`finditer` in the following manner:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1248
				1249	>>> text = "He was carefully disguised but captured quickly by police."
				1250	>>> for m in re.finditer(r"\w+ly", text):
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1251	... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1252	07-16: carefully
				1253	40-47: quickly
				1254
				1255
				1256	Raw String Notation
				1257	^^^^^^^^^^^^^^^^^^^
				1258
				1259	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1260	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1261	another one to escape it. For example, the two following lines of code are
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1262	functionally identical:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1263
				1264	>>> re.match(r"\W(.)\1\W", " ff ")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1265	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1266	>>> re.match("\\W(.)\\1\\W", " ff ")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1267	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1268
				1269	When one wants to match a literal backslash, it must be escaped in the regular
				1270	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1271	notation, one must use ``"\\\\"``, making the following lines of code
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1272	functionally identical:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1273
				1274	>>> re.match(r"\\", r"\\")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1275	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1276	>>> re.match("\\\\", r"\\")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1277	<_sre.SRE_Match object at ...>
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1278
				1279
				1280	Writing a Tokenizer
				1281	^^^^^^^^^^^^^^^^^^^
				1282
				1283	A `tokenizer or scanner <http://en.wikipedia.org/wiki/Lexical_analysis>`_
				1284	analyzes a string to categorize groups of characters. This is a useful first
				1285	step in writing a compiler or interpreter.
				1286
				1287	The text categories are specified with regular expressions. The technique is
				1288	to combine those into a single master regular expression and to loop over
				1289	successive matches::
				1290
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1291	import collections
				1292	import re
				1293
				1294	Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1295
				1296	def tokenize(s):
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1297	keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
				1298	token_specification = [
				1299	('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
				1300	('ASSIGN', r':='), # Assignment operator
				1301	('END', r';'), # Statement terminator
				1302	('ID', r'[A-Za-z]+'), # Identifiers
				1303	('OP', r'[+*\/\-]'), # Arithmetic operators
				1304	('NEWLINE', r'\n'), # Line endings
				1305	('SKIP', r'[ \t]'), # Skip over spaces and tabs
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1306	]
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1307	tok_regex = '\|'.join('(?P<%s>%s)' % pair for pair in token_specification)
				1308	get_token = re.compile(tok_regex).match
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1309	line = 1
				1310	pos = line_start = 0
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1311	mo = get_token(s)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1312	while mo is not None:
				1313	typ = mo.lastgroup
				1314	if typ == 'NEWLINE':
				1315	line_start = pos
				1316	line += 1
				1317	elif typ != 'SKIP':
Georg Brandl	325477e	2011-05-13 06:54:23 +0200	[diff] [blame]	1318	val = mo.group(typ)
Raymond Hettinger	c2c7c37	2010-12-07 09:44:21 +0000	[diff] [blame]	1319	if typ == 'ID' and val in keywords:
				1320	typ = val
Georg Brandl	325477e	2011-05-13 06:54:23 +0200	[diff] [blame]	1321	yield Token(typ, val, line, mo.start()-line_start)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1322	pos = mo.end()
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1323	mo = get_token(s, pos)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1324	if pos != len(s):
				1325	raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))
				1326
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1327	statements = '''
				1328	IF quantity THEN
				1329	total := total + price * quantity;
				1330	tax := price * 0.05;
				1331	ENDIF;
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1332	'''
Raymond Hettinger	23157e5	2011-05-13 01:38:31 -0700	[diff] [blame]	1333
				1334	for token in tokenize(statements):
				1335	print(token)
				1336
				1337	The tokenizer produces the following output::
Raymond Hettinger	9c47d77	2011-05-13 01:03:50 -0700	[diff] [blame]	1338
Raymond Hettinger	4b244ef	2011-05-23 12:45:34 -0700	[diff] [blame]	1339	Token(typ='IF', value='IF', line=2, column=5)
				1340	Token(typ='ID', value='quantity', line=2, column=8)
				1341	Token(typ='THEN', value='THEN', line=2, column=17)
				1342	Token(typ='ID', value='total', line=3, column=9)
				1343	Token(typ='ASSIGN', value=':=', line=3, column=15)
				1344	Token(typ='ID', value='total', line=3, column=18)
				1345	Token(typ='OP', value='+', line=3, column=24)
				1346	Token(typ='ID', value='price', line=3, column=26)
				1347	Token(typ='OP', value='*', line=3, column=32)
				1348	Token(typ='ID', value='quantity', line=3, column=34)
				1349	Token(typ='END', value=';', line=3, column=42)
				1350	Token(typ='ID', value='tax', line=4, column=9)
				1351	Token(typ='ASSIGN', value=':=', line=4, column=13)
				1352	Token(typ='ID', value='price', line=4, column=16)
				1353	Token(typ='OP', value='*', line=4, column=22)
				1354	Token(typ='NUMBER', value='0.05', line=4, column=24)
				1355	Token(typ='END', value=';', line=4, column=28)
				1356	Token(typ='ENDIF', value='ENDIF', line=5, column=5)
				1357	Token(typ='END', value=';', line=5, column=10)