Blame - Doc/library/re.rst - platform/external/python/cpython2

blob: 6df310659352ba657df0f122b77fdb9740e77ba4 [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1	:mod:`re` --- Regular expression operations
				2	===========================================
				3
				4	.. module:: re
				5	:synopsis: Regular expression operations.
				6	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				7	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				8
				9
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	10	This module provides regular expression matching operations similar to
Georg Brandl	ed2a1db	2009-06-08 07:48:27 +0000	[diff] [blame]	11	those found in Perl.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	12
				13	Both patterns and strings to be searched can be Unicode strings as well as
				14	8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed:
				15	that is, you cannot match an Unicode string with a byte pattern or
Georg Brandl	ae2dbe2	2009-03-13 19:04:40 +0000	[diff] [blame]	16	vice-versa; similarly, when asking for a substitution, the replacement
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	17	string must be of the same type as both the pattern and the search string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
				19	Regular expressions use the backslash character (``'\'``) to indicate
				20	special forms or to allow special characters to be used without invoking
				21	their special meaning. This collides with Python's usage of the same
				22	character for the same purpose in string literals; for example, to match
				23	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				24	string, because the regular expression must be ``\\``, and each
				25	backslash must be expressed as ``\\`` inside a regular Python string
				26	literal.
				27
				28	The solution is to use Python's raw string notation for regular expression
				29	patterns; backslashes are not handled in any special way in a string literal
				30	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				31	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	32	newline. Usually patterns will be expressed in Python code using this raw
				33	string notation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	34
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	35	It is important to note that most regular expression operations are available as
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	36	module-level functions and methods on
				37	:ref:`compiled regular expressions <re-objects>`. The functions are shortcuts
				38	that don't require you to compile a regex object first, but miss some
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	39	fine-tuning parameters.
				40
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	41	.. seealso::
				42
				43	Mastering Regular Expressions
				44	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	45	second edition of the book no longer covers Python at all, but the first
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	46	edition covered writing good regular expression patterns in great detail.
				47
				48
				49	.. _re-syntax:
				50
				51	Regular Expression Syntax
				52	-------------------------
				53
				54	A regular expression (or RE) specifies a set of strings that matches it; the
				55	functions in this module let you check if a particular string matches a given
				56	regular expression (or if a given regular expression matches a particular
				57	string, which comes down to the same thing).
				58
				59	Regular expressions can be concatenated to form new regular expressions; if A
				60	and B are both regular expressions, then AB is also a regular expression.
				61	In general, if a string p matches A and another string q matches B, the
				62	string pq will match AB. This holds unless A or B contain low precedence
				63	operations; boundary conditions between A and B; or have numbered group
				64	references. Thus, complex expressions can easily be constructed from simpler
				65	primitive expressions like the ones described here. For details of the theory
				66	and implementation of regular expressions, consult the Friedl book referenced
				67	above, or almost any textbook about compiler construction.
				68
				69	A brief explanation of the format of regular expressions follows. For further
Christian Heimes	2202f87	2008-02-06 14:31:34 +0000	[diff] [blame]	70	information and a gentler presentation, consult the :ref:`regex-howto`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	71
				72	Regular expressions can contain both special and ordinary characters. Most
				73	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				74	expressions; they simply match themselves. You can concatenate ordinary
				75	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				76	section, we'll write RE's in ``this special style``, usually without quotes, and
				77	strings to be matched ``'in single quotes'``.)
				78
				79	Some characters, like ``'\|'`` or ``'('``, are special. Special
				80	characters either stand for classes of ordinary characters, or affect
				81	how the regular expressions around them are interpreted. Regular
				82	expression pattern strings may not contain null bytes, but can specify
				83	the null byte using the ``\number`` notation, e.g., ``'\x00'``.
				84
				85
				86	The special characters are:
				87
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	88	``'.'``
				89	(Dot.) In the default mode, this matches any character except a newline. If
				90	the :const:`DOTALL` flag has been specified, this matches any character
				91	including a newline.
				92
				93	``'^'``
				94	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				95	matches immediately after each newline.
				96
				97	``'$'``
				98	Matches the end of the string or just before the newline at the end of the
				99	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				100	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				101	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes	25bb783	2008-01-11 16:17:00 +0000	[diff] [blame]	102	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				103	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				104	the newline, and one at the end of the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	105
				106	``'*'``
				107	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				108	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				109	by any number of 'b's.
				110
				111	``'+'``
				112	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				113	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				114	match just 'a'.
				115
				116	``'?'``
				117	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				118	``ab?`` will match either 'a' or 'ab'.
				119
				120	``*?``, ``+?``, ``??``
				121	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				122	as much text as possible. Sometimes this behaviour isn't desired; if the RE
				123	``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
				124	string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
				125	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
				126	characters as possible will be matched. Using ``.*?`` in the previous
				127	expression will match only ``'<H1>'``.
				128
				129	``{m}``
				130	Specifies that exactly m copies of the previous RE should be matched; fewer
				131	matches cause the entire RE not to match. For example, ``a{6}`` will match
				132	exactly six ``'a'`` characters, but not five.
				133
				134	``{m,n}``
				135	Causes the resulting RE to match from m to n repetitions of the preceding
				136	RE, attempting to match as many repetitions as possible. For example,
				137	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				138	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				139	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				140	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				141	modifier would be confused with the previously described form.
				142
				143	``{m,n}?``
				144	Causes the resulting RE to match from m to n repetitions of the preceding
				145	RE, attempting to match as few repetitions as possible. This is the
				146	non-greedy version of the previous qualifier. For example, on the
				147	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				148	while ``a{3,5}?`` will only match 3 characters.
				149
				150	``'\'``
				151	Either escapes special characters (permitting you to match characters like
				152	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				153	sequences are discussed below.
				154
				155	If you're not using a raw string to express the pattern, remember that Python
				156	also uses the backslash as an escape sequence in string literals; if the escape
				157	sequence isn't recognized by Python's parser, the backslash and subsequent
				158	character are included in the resulting string. However, if Python would
				159	recognize the resulting sequence, the backslash should be repeated twice. This
				160	is complicated and hard to understand, so it's highly recommended that you use
				161	raw strings for all but the simplest expressions.
				162
				163	``[]``
				164	Used to indicate a set of characters. Characters can be listed individually, or
				165	a range of characters can be indicated by giving two characters and separating
				166	them by a ``'-'``. Special characters are not active inside sets. For example,
				167	``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
				168	``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
				169	``[a-zA-Z0-9]`` matches any letter or digit. Character classes such
				170	as ``\w`` or ``\S`` (defined below) are also acceptable inside a
Mark Summerfield	8676534	2008-08-20 07:40:18 +0000	[diff] [blame]	171	range, although the characters they match depends on whether
				172	:const:`ASCII` or :const:`LOCALE` mode is in force. If you want to
				173	include a ``']'`` or a ``'-'`` inside a set, precede it with a
				174	backslash, or place it as the first character. The pattern ``[]]``
				175	will match ``']'``, for example.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	176
				177	You can match the characters not within a range by :dfn:`complementing` the set.
				178	This is indicated by including a ``'^'`` as the first character of the set;
				179	``'^'`` elsewhere will simply match the ``'^'`` character. For example,
				180	``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
				181	character except ``'^'``.
				182
Mark Summerfield	9e670c2	2008-05-31 13:05:34 +0000	[diff] [blame]	183	Note that inside ``[]`` the special forms and special characters lose
				184	their meanings and only the syntaxes described here are valid. For
				185	example, ``+``, ``*``, ``(``, ``)``, and so on are treated as
				186	literals inside ``[]``, and backreferences cannot be used inside
				187	``[]``.
				188
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	189	``'\|'``
				190	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				191	will match either A or B. An arbitrary number of REs can be separated by the
				192	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				193	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				194	right. When one pattern completely matches, that branch is accepted. This means
				195	that once ``A`` matches, ``B`` will not be tested further, even if it would
				196	produce a longer overall match. In other words, the ``'\|'`` operator is never
				197	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				198	character class, as in ``[\|]``.
				199
				200	``(...)``
				201	Matches whatever regular expression is inside the parentheses, and indicates the
				202	start and end of a group; the contents of a group can be retrieved after a match
				203	has been performed, and can be matched later in the string with the ``\number``
				204	special sequence, described below. To match the literals ``'('`` or ``')'``,
				205	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				206
				207	``(?...)``
				208	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				209	otherwise). The first character after the ``'?'`` determines what the meaning
				210	and further syntax of the construct is. Extensions usually do not create a new
				211	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				212	currently supported extensions.
				213
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	214	``(?aiLmsux)``
				215	(One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
				216	``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
Andrew M. Kuchling	1c50e86	2009-06-01 00:11:36 +0000	[diff] [blame]	217	letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	218	:const:`re.I` (ignore case), :const:`re.L` (locale dependent),
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	219	:const:`re.M` (multi-line), :const:`re.S` (dot matches all),
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	220	and :const:`re.X` (verbose), for the entire regular expression. (The
				221	flags are described in :ref:`contents-of-module-re`.) This
				222	is useful if you wish to include the flags as part of the regular
				223	expression, instead of passing a flag argument to the
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	224	:func:`re.compile` function.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	225
				226	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				227	used first in the expression string, or after one or more whitespace characters.
				228	If there are non-whitespace characters before the flag, the results are
				229	undefined.
				230
				231	``(?:...)``
Georg Brandl	3122ce3	2010-10-29 06:17:38 +0000	[diff] [blame]	232	A non-capturing version of regular parentheses. Matches whatever regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	233	expression is inside the parentheses, but the substring matched by the group
				234	cannot be retrieved after performing a match or referenced later in the
				235	pattern.
				236
				237	``(?P<name>...)``
				238	Similar to regular parentheses, but the substring matched by the group is
Benjamin Peterson	d23f822	2009-04-05 19:13:16 +0000	[diff] [blame]	239	accessible within the rest of the regular expression via the symbolic group
				240	name name. Group names must be valid Python identifiers, and each group
				241	name must be defined only once within a regular expression. A symbolic group
				242	is also a numbered group, just as if the group were not named. So the group
				243	named ``id`` in the example below can also be referenced as the numbered group
				244	``1``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	245
				246	For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
				247	referenced by its name in arguments to methods of match objects, such as
Benjamin Peterson	d23f822	2009-04-05 19:13:16 +0000	[diff] [blame]	248	``m.group('id')`` or ``m.end('id')``, and also by name in the regular
				249	expression itself (using ``(?P=id)``) and replacement text given to
				250	``.sub()`` (using ``\g<id>``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	251
				252	``(?P=name)``
				253	Matches whatever text was matched by the earlier group named name.
				254
				255	``(?#...)``
				256	A comment; the contents of the parentheses are simply ignored.
				257
				258	``(?=...)``
				259	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				260	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				261	``'Isaac '`` only if it's followed by ``'Asimov'``.
				262
				263	``(?!...)``
				264	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				265	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				266	followed by ``'Asimov'``.
				267
				268	``(?<=...)``
				269	Matches if the current position in the string is preceded by a match for ``...``
				270	that ends at the current position. This is called a :dfn:`positive lookbehind
				271	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				272	lookbehind will back up 3 characters and check if the contained pattern matches.
				273	The contained pattern must only match strings of some fixed length, meaning that
				274	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
				275	patterns which start with positive lookbehind assertions will never match at the
				276	beginning of the string being searched; you will most likely want to use the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	277	:func:`search` function rather than the :func:`match` function:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	278
				279	>>> import re
				280	>>> m = re.search('(?<=abc)def', 'abcdef')
				281	>>> m.group(0)
				282	'def'
				283
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	284	This example looks for a word following a hyphen:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	285
				286	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				287	>>> m.group(0)
				288	'egg'
				289
				290	``(?<!...)``
				291	Matches if the current position in the string is not preceded by a match for
				292	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				293	positive lookbehind assertions, the contained pattern must only match strings of
				294	some fixed length. Patterns which start with negative lookbehind assertions may
				295	match at the beginning of the string being searched.
				296
				297	``(?(id/name)yes-pattern\|no-pattern)``
Senthil Kumaran	abd4a05	2011-03-12 11:40:25 +0800	[diff] [blame]	298	Will try to match with ``yes-pattern`` if the group with given id or
				299	name exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is
				300	optional and can be omitted. For example,
				301	``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>\|$)`` is a poor email matching pattern, which
				302	will match with ``'<user@host.com>'`` as well as ``'user@host.com'``, but
				303	not with ``'<user@host.com'`` nor ``'user@host.com>'`` .
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	304
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	305
				306	The special sequences consist of ``'\'`` and a character from the list below.
				307	If the ordinary character is not on the list, then the resulting RE will match
				308	the second character. For example, ``\$`` matches the character ``'$'``.
				309
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	310	``\number``
				311	Matches the contents of the group of the same number. Groups are numbered
				312	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
				313	but not ``'the end'`` (note the space after the group). This special sequence
				314	can only be used to match one of the first 99 groups. If the first digit of
				315	number is 0, or number is 3 octal digits long, it will not be interpreted as
				316	a group match, but as the character with octal value number. Inside the
				317	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				318	characters.
				319
				320	``\A``
				321	Matches only at the start of the string.
				322
				323	``\b``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	324	Matches the empty string, but only at the beginning or end of a word.
				325	A word is defined as a sequence of Unicode alphanumeric or underscore
				326	characters, so the end of a word is indicated by whitespace or a
				327	non-alphanumeric, non-underscore Unicode character. Note that
				328	formally, ``\b`` is defined as the boundary between a ``\w`` and a
				329	``\W`` character (or vice versa). By default Unicode alphanumerics
				330	are the ones used, but this can be changed by using the :const:`ASCII`
				331	flag. Inside a character range, ``\b`` represents the backspace
				332	character, for compatibility with Python's string literals.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	333
				334	``\B``
				335	Matches the empty string, but only when it is not at the beginning or end of a
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	336	word. This is just the opposite of ``\b``, so word characters are
				337	Unicode alphanumerics or the underscore, although this can be changed
				338	by using the :const:`ASCII` flag.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	339
				340	``\d``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	341	For Unicode (str) patterns:
Mark Dickinson	1f26828	2009-07-28 17:22:36 +0000	[diff] [blame]	342	Matches any Unicode decimal digit (that is, any character in
				343	Unicode character category [Nd]). This includes ``[0-9]``, and
				344	also many other digit characters. If the :const:`ASCII` flag is
				345	used only ``[0-9]`` is matched (but the flag affects the entire
				346	regular expression, so in such cases using an explicit ``[0-9]``
				347	may be a better choice).
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	348	For 8-bit (bytes) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	349	Matches any decimal digit; this is equivalent to ``[0-9]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	350
				351	``\D``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	352	Matches any character which is not a Unicode decimal digit. This is
				353	the opposite of ``\d``. If the :const:`ASCII` flag is used this
				354	becomes the equivalent of ``[^0-9]`` (but the flag affects the entire
				355	regular expression, so in such cases using an explicit ``[^0-9]`` may
				356	be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	357
				358	``\s``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	359	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	360	Matches Unicode whitespace characters (which includes
				361	``[ \t\n\r\f\v]``, and also many other characters, for example the
				362	non-breaking spaces mandated by typography rules in many
				363	languages). If the :const:`ASCII` flag is used, only
				364	``[ \t\n\r\f\v]`` is matched (but the flag affects the entire
				365	regular expression, so in such cases using an explicit
				366	``[ \t\n\r\f\v]`` may be a better choice).
				367
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	368	For 8-bit (bytes) patterns:
				369	Matches characters considered whitespace in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	370	this is equivalent to ``[ \t\n\r\f\v]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	371
				372	``\S``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	373	Matches any character which is not a Unicode whitespace character. This is
				374	the opposite of ``\s``. If the :const:`ASCII` flag is used this
				375	becomes the equivalent of ``[^ \t\n\r\f\v]`` (but the flag affects the entire
				376	regular expression, so in such cases using an explicit ``[^ \t\n\r\f\v]`` may
				377	be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	378
				379	``\w``
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	380	For Unicode (str) patterns:
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	381	Matches Unicode word characters; this includes most characters
				382	that can be part of a word in any language, as well as numbers and
				383	the underscore. If the :const:`ASCII` flag is used, only
				384	``[a-zA-Z0-9_]`` is matched (but the flag affects the entire
				385	regular expression, so in such cases using an explicit
				386	``[a-zA-Z0-9_]`` may be a better choice).
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	387	For 8-bit (bytes) patterns:
				388	Matches characters considered alphanumeric in the ASCII character set;
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	389	this is equivalent to ``[a-zA-Z0-9_]``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	390
				391	``\W``
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	392	Matches any character which is not a Unicode word character. This is
				393	the opposite of ``\w``. If the :const:`ASCII` flag is used this
				394	becomes the equivalent of ``[^a-zA-Z0-9_]`` (but the flag affects the
				395	entire regular expression, so in such cases using an explicit
				396	``[^a-zA-Z0-9_]`` may be a better choice).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	397
				398	``\Z``
				399	Matches only at the end of the string.
				400
				401	Most of the standard escapes supported by Python string literals are also
				402	accepted by the regular expression parser::
				403
				404	\a \b \f \n
				405	\r \t \v \x
				406	\\
				407
				408	Octal escapes are included in a limited form: If the first digit is a 0, or if
				409	there are three octal digits, it is considered an octal escape. Otherwise, it is
				410	a group reference. As for string literals, octal escapes are always at most
				411	three digits in length.
				412
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	413
				414	.. _matching-searching:
				415
				416	Matching vs Searching
				417	---------------------
				418
				419	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				420
				421
				422	Python offers two different primitive operations based on regular expressions:
Guido van Rossum	04110fb	2007-08-24 16:32:05 +0000	[diff] [blame]	423	match checks for a match only at the beginning of the string, while
				424	search checks for a match anywhere in the string (this is what Perl does
				425	by default).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	426
Guido van Rossum	04110fb	2007-08-24 16:32:05 +0000	[diff] [blame]	427	Note that match may differ from search even when using a regular expression
				428	beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	429	:const:`MULTILINE` mode also immediately following a newline. The "match"
				430	operation succeeds only if the pattern matches at the start of the string
				431	regardless of mode, or at the starting position given by the optional pos
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	432	argument regardless of whether a newline precedes it.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	433
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	434	>>> re.match("c", "abcdef") # No match
				435	>>> re.search("c", "abcdef") # Match
				436	<_sre.SRE_Match object at ...>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	437
				438
				439	.. _contents-of-module-re:
				440
				441	Module Contents
				442	---------------
				443
				444	The module defines several functions, constants, and an exception. Some of the
				445	functions are simplified versions of the full featured methods for compiled
				446	regular expressions. Most non-trivial applications always use the compiled
				447	form.
				448
				449
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	450	.. function:: compile(pattern, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	451
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	452	Compile a regular expression pattern into a regular expression object, which
				453	can be used for matching using its :func:`match` and :func:`search` methods,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	454	described below.
				455
				456	The expression's behaviour can be modified by specifying a flags value.
				457	Values can be any of the following variables, combined using bitwise OR (the
				458	``\|`` operator).
				459
				460	The sequence ::
				461
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	462	prog = re.compile(pattern)
				463	result = prog.match(string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	464
				465	is equivalent to ::
				466
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	467	result = re.match(pattern, string)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	468
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	469	but using :func:`re.compile` and saving the resulting regular expression
				470	object for reuse is more efficient when the expression will be used several
				471	times in a single program.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	472
Gregory P. Smith	4221c74	2009-03-02 05:04:04 +0000	[diff] [blame]	473	.. note::
				474
				475	The compiled versions of the most recent patterns passed to
				476	:func:`re.match`, :func:`re.search` or :func:`re.compile` are cached, so
				477	programs that use only a few regular expressions at a time needn't worry
				478	about compiling regular expressions.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	479
				480
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	481	.. data:: A
				482	ASCII
				483
Georg Brandl	4049ce0	2009-06-08 07:49:54 +0000	[diff] [blame]	484	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
				485	perform ASCII-only matching instead of full Unicode matching. This is only
				486	meaningful for Unicode patterns, and is ignored for byte patterns.
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	487
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	488	Note that for backward compatibility, the :const:`re.U` flag still
				489	exists (as well as its synonym :const:`re.UNICODE` and its embedded
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	490	counterpart ``(?u)``), but these are redundant in Python 3 since
Mark Summerfield	6c4f617	2008-08-20 07:34:41 +0000	[diff] [blame]	491	matches are Unicode by default for strings (and Unicode matching
				492	isn't allowed for bytes).
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	493
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	494
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	495	.. data:: I
				496	IGNORECASE
				497
				498	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
Mark Summerfield	8676534	2008-08-20 07:40:18 +0000	[diff] [blame]	499	lowercase letters, too. This is not affected by the current locale
				500	and works for Unicode characters as expected.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	501
				502
				503	.. data:: L
				504	LOCALE
				505
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	506	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	507	current locale. The use of this flag is discouraged as the locale mechanism
				508	is very unreliable, and it only handles one "culture" at a time anyway;
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	509	you should use Unicode matching instead, which is the default in Python 3
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	510	for Unicode (str) patterns.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	511
				512
				513	.. data:: M
				514	MULTILINE
				515
				516	When specified, the pattern character ``'^'`` matches at the beginning of the
				517	string and at the beginning of each line (immediately following each newline);
				518	and the pattern character ``'$'`` matches at the end of the string and at the
				519	end of each line (immediately preceding each newline). By default, ``'^'``
				520	matches only at the beginning of the string, and ``'$'`` only at the end of the
				521	string and immediately before the newline (if any) at the end of the string.
				522
				523
				524	.. data:: S
				525	DOTALL
				526
				527	Make the ``'.'`` special character match any character at all, including a
				528	newline; without this flag, ``'.'`` will match anything except a newline.
				529
				530
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	531	.. data:: X
				532	VERBOSE
				533
				534	This flag allows you to write regular expressions that look nicer. Whitespace
				535	within the pattern is ignored, except when in a character class or preceded by
				536	an unescaped backslash, and, when a line contains a ``'#'`` neither in a
				537	character class or preceded by an unescaped backslash, all characters from the
				538	leftmost such ``'#'`` through the end of the line are ignored.
				539
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	540	That means that the two following regular expression objects that match a
				541	decimal number are functionally equal::
Georg Brandl	81ac1ce	2007-08-31 17:17:17 +0000	[diff] [blame]	542
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	543	a = re.compile(r"""\d + # the integral part
				544	\. # the decimal point
				545	\d * # some fractional digits""", re.X)
				546	b = re.compile(r"\d+\.\d*")
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	547
				548
Antoine Pitrou	fd03645	2008-08-19 17:56:33 +0000	[diff] [blame]	549
				550
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	551	.. function:: search(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	552
				553	Scan through string looking for a location where the regular expression
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	554	pattern produces a match, and return a corresponding :ref:`match object
				555	<match-objects>`. Return ``None`` if no position in the string matches the
				556	pattern; note that this is different from finding a zero-length match at some
				557	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	558
				559
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	560	.. function:: match(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	561
				562	If zero or more characters at the beginning of string match the regular
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	563	expression pattern, return a corresponding :ref:`match object
				564	<match-objects>`. Return ``None`` if the string does not match the pattern;
				565	note that this is different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	566
				567	.. note::
				568
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	569	If you want to locate a match anywhere in string, use :func:`search`
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	570	instead.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	571
				572
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	573	.. function:: split(pattern, string, maxsplit=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	574
				575	Split string by the occurrences of pattern. If capturing parentheses are
				576	used in pattern, then the text of all groups in the pattern are also returned
				577	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				578	splits occur, and the remainder of the string is returned as the final element
Georg Brandl	9647389	2008-03-06 07:09:43 +0000	[diff] [blame]	579	of the list. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	580
				581	>>> re.split('\W+', 'Words, words, words.')
				582	['Words', 'words', 'words', '']
				583	>>> re.split('(\W+)', 'Words, words, words.')
				584	['Words', ', ', 'words', ', ', 'words', '.', '']
				585	>>> re.split('\W+', 'Words, words, words.', 1)
				586	['Words', 'words, words.']
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	587	>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
				588	['0', '3', '9']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	589
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	590	If there are capturing groups in the separator and it matches at the start of
				591	the string, the result will start with an empty string. The same holds for
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	592	the end of the string:
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	593
				594	>>> re.split('(\W+)', '...words, words...')
				595	['', '...', 'words', ', ', 'words', '...', '']
				596
				597	That way, separator components are always found at the same relative
				598	indices within the result list (e.g., if there's one capturing group
				599	in the separator, the 0th, the 2nd and so forth).
				600
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	601	Note that split will never split a string on an empty pattern match.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	602	For example:
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	603
				604	>>> re.split('x*', 'foo')
				605	['foo']
				606	>>> re.split("(?m)^$", "foo\n\nbar\n")
				607	['foo\n\nbar\n']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	608
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	609	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	610	Added the optional flags argument.
				611
Christian Heimes	dd15f6c	2008-03-16 00:07:10 +0000	[diff] [blame]	612
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	613	.. function:: findall(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	614
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	615	Return all non-overlapping matches of pattern in string, as a list of
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	616	strings. The string is scanned left-to-right, and matches are returned in
				617	the order found. If one or more groups are present in the pattern, return a
				618	list of groups; this will be a list of tuples if the pattern has more than
				619	one group. Empty matches are included in the result unless they touch the
				620	beginning of another match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	621
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	622
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	623	.. function:: finditer(pattern, string, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	624
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	625	Return an :term:`iterator` yielding :ref:`match objects <match-objects>` over
				626	all non-overlapping matches for the RE pattern in string. The string
				627	is scanned left-to-right, and matches are returned in the order found. Empty
Georg Brandl	3dbca81	2008-07-23 16:10:53 +0000	[diff] [blame]	628	matches are included in the result unless they touch the beginning of another
				629	match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	630
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	631
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	632	.. function:: sub(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	633
				634	Return the string obtained by replacing the leftmost non-overlapping occurrences
				635	of pattern in string by the replacement repl. If the pattern isn't found,
				636	string is returned unchanged. repl can be a string or a function; if it is
				637	a string, any backslash escapes in it are processed. That is, ``\n`` is
				638	converted to a single newline character, ``\r`` is converted to a linefeed, and
				639	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				640	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	641	For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	642
				643	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				644	... r'static PyObject*\npy_\1(void)\n{',
				645	... 'def myfunc():')
				646	'static PyObject*\npy_myfunc(void)\n{'
				647
				648	If repl is a function, it is called for every non-overlapping occurrence of
				649	pattern. The function takes a single match object argument, and returns the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	650	replacement string. For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	651
				652	>>> def dashrepl(matchobj):
				653	... if matchobj.group(0) == '-': return ' '
				654	... else: return '-'
				655	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				656	'pro--gram files'
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	657	>>> re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)
				658	'Baked Beans & Spam'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	659
Georg Brandl	1b5ab45	2009-08-13 07:56:35 +0000	[diff] [blame]	660	The pattern may be a string or an RE object.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	661
				662	The optional argument count is the maximum number of pattern occurrences to be
				663	replaced; count must be a non-negative integer. If omitted or zero, all
				664	occurrences will be replaced. Empty matches for the pattern are replaced only
				665	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				666	``'-a-b-c-'``.
				667
				668	In addition to character escapes and backreferences as described above,
				669	``\g<name>`` will use the substring matched by the group named ``name``, as
				670	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				671	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				672	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				673	reference to group 20, not a reference to group 2 followed by the literal
				674	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				675	substring matched by the RE.
				676
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	677	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	678	Added the optional flags argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	679
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	680
Georg Brandl	1824415	2009-09-02 20:34:52 +0000	[diff] [blame]	681	.. function:: subn(pattern, repl, string, count=0, flags=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	682
				683	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				684	number_of_subs_made)``.
				685
Jeroen Ruigrok van der Werven	b70ccc3	2009-04-27 08:07:12 +0000	[diff] [blame]	686	.. versionchanged:: 3.1
Gregory P. Smith	ccc5ae7	2009-03-02 05:21:55 +0000	[diff] [blame]	687	Added the optional flags argument.
				688
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	689
				690	.. function:: escape(string)
				691
				692	Return string with all non-alphanumerics backslashed; this is useful if you
				693	want to match an arbitrary literal string that may have regular expression
				694	metacharacters in it.
				695
				696
R. David Murray	522c32a	2010-07-10 14:23:36 +0000	[diff] [blame]	697	.. function:: purge()
				698
				699	Clear the regular expression cache.
				700
				701
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	702	.. exception:: error
				703
				704	Exception raised when a string passed to one of the functions here is not a
				705	valid regular expression (for example, it might contain unmatched parentheses)
				706	or when some other error occurs during compilation or matching. It is never an
				707	error if a string contains no match for a pattern.
				708
				709
				710	.. _re-objects:
				711
				712	Regular Expression Objects
				713	--------------------------
				714
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	715	Compiled regular expression objects support the following methods and
				716	attributes.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	717
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	718	.. method:: regex.search(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	719
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	720	Scan through string looking for a location where this regular expression
				721	produces a match, and return a corresponding :ref:`match object
				722	<match-objects>`. Return ``None`` if no position in the string matches the
				723	pattern; note that this is different from finding a zero-length match at some
				724	point in the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	725
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	726	The optional second parameter pos gives an index in the string where the
				727	search is to start; it defaults to ``0``. This is not completely equivalent to
				728	slicing the string; the ``'^'`` pattern character matches at the real beginning
				729	of the string and at positions just after a newline, but not necessarily at the
				730	index where the search is to start.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	731
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	732	The optional parameter endpos limits how far the string will be searched; it
				733	will be as if the string is endpos characters long, so only the characters
				734	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
				735	than pos, no match will be found, otherwise, if rx is a compiled regular
				736	expression object, ``rx.search(string, 0, 50)`` is equivalent to
				737	``rx.search(string[:50], 0)``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	738
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	739	>>> pattern = re.compile("d")
				740	>>> pattern.search("dog") # Match at index 0
				741	<_sre.SRE_Match object at ...>
				742	>>> pattern.search("dog", 1) # No match; search doesn't include the "d"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	743
				744
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	745	.. method:: regex.match(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	746
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	747	If zero or more characters at the beginning of string match this regular
				748	expression, return a corresponding :ref:`match object <match-objects>`.
				749	Return ``None`` if the string does not match the pattern; note that this is
				750	different from a zero-length match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	751
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	752	The optional pos and endpos parameters have the same meaning as for the
				753	:meth:`~regex.search` method.
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	754
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	755	.. note::
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	756
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	757	If you want to locate a match anywhere in string, use
				758	:meth:`~regex.search` instead.
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	759
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	760	>>> pattern = re.compile("o")
				761	>>> pattern.match("dog") # No match as "o" is not at the start of "dog".
				762	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				763	<_sre.SRE_Match object at ...>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	764
				765
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	766	.. method:: regex.split(string, maxsplit=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	767
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	768	Identical to the :func:`split` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	769
				770
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	771	.. method:: regex.findall(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	772
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	773	Similar to the :func:`findall` function, using the compiled pattern, but
				774	also accepts optional pos and endpos parameters that limit the search
				775	region like for :meth:`match`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	776
				777
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	778	.. method:: regex.finditer(string[, pos[, endpos]])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	779
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	780	Similar to the :func:`finditer` function, using the compiled pattern, but
				781	also accepts optional pos and endpos parameters that limit the search
				782	region like for :meth:`match`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	783
				784
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	785	.. method:: regex.sub(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	786
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	787	Identical to the :func:`sub` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	788
				789
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	790	.. method:: regex.subn(repl, string, count=0)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	791
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	792	Identical to the :func:`subn` function, using the compiled pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	793
				794
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	795	.. attribute:: regex.flags
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	796
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	797	The flags argument used when the RE object was compiled, or ``0`` if no flags
				798	were provided.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	799
				800
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	801	.. attribute:: regex.groups
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	802
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	803	The number of capturing groups in the pattern.
Georg Brandl	af265f4	2008-12-07 15:06:20 +0000	[diff] [blame]	804
				805
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	806	.. attribute:: regex.groupindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	807
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	808	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				809	numbers. The dictionary is empty if no symbolic groups were used in the
				810	pattern.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	811
				812
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	813	.. attribute:: regex.pattern
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	814
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	815	The pattern string from which the RE object was compiled.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	816
				817
				818	.. _match-objects:
				819
				820	Match Objects
				821	-------------
				822
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	823	Match objects always have a boolean value of :const:`True`, so that you can test
				824	whether e.g. :func:`match` resulted in a match with a simple if statement. They
				825	support the following methods and attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	826
				827
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	828	.. method:: match.expand(template)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	829
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	830	Return the string obtained by doing backslash substitution on the template
				831	string template, as done by the :meth:`~regex.sub` method.
				832	Escapes such as ``\n`` are converted to the appropriate characters,
				833	and numeric backreferences (``\1``, ``\2``) and named backreferences
				834	(``\g<1>``, ``\g<name>``) are replaced by the contents of the
				835	corresponding group.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	836
				837
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	838	.. method:: match.group([group1, ...])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	839
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	840	Returns one or more subgroups of the match. If there is a single argument, the
				841	result is a single string; if there are multiple arguments, the result is a
				842	tuple with one item per argument. Without arguments, group1 defaults to zero
				843	(the whole match is returned). If a groupN argument is zero, the corresponding
				844	return value is the entire matching string; if it is in the inclusive range
				845	[1..99], it is the string matching the corresponding parenthesized group. If a
				846	group number is negative or larger than the number of groups defined in the
				847	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				848	part of the pattern that did not match, the corresponding result is ``None``.
				849	If a group is contained in a part of the pattern that matched multiple times,
				850	the last match is returned.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	851
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	852	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				853	>>> m.group(0) # The entire match
				854	'Isaac Newton'
				855	>>> m.group(1) # The first parenthesized subgroup.
				856	'Isaac'
				857	>>> m.group(2) # The second parenthesized subgroup.
				858	'Newton'
				859	>>> m.group(1, 2) # Multiple arguments give us a tuple.
				860	('Isaac', 'Newton')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	861
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	862	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				863	arguments may also be strings identifying groups by their group name. If a
				864	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				865	exception is raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	866
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	867	A moderately complicated example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	868
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	869	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				870	>>> m.group('first_name')
				871	'Malcolm'
				872	>>> m.group('last_name')
				873	'Reynolds'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	874
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	875	Named groups can also be referred to by their index:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	876
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	877	>>> m.group(1)
				878	'Malcolm'
				879	>>> m.group(2)
				880	'Reynolds'
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	881
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	882	If a group matches multiple times, only the last match is accessible:
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	883
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	884	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				885	>>> m.group(1) # Returns only the last match.
				886	'c3'
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	887
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	888
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	889	.. method:: match.groups(default=None)
Brian Curtin	48f16f9	2010-04-08 13:55:29 +0000	[diff] [blame]	890
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	891	Return a tuple containing all the subgroups of the match, from 1 up to however
				892	many groups are in the pattern. The default argument is used for groups that
				893	did not participate in the match; it defaults to ``None``.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	894
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	895	For example:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	896
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	897	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				898	>>> m.groups()
				899	('24', '1632')
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	900
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	901	If we make the decimal place and everything after it optional, not all groups
				902	might participate in the match. These groups will default to ``None`` unless
				903	the default argument is given:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	904
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	905	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				906	>>> m.groups() # Second group defaults to None.
				907	('24', None)
				908	>>> m.groups('0') # Now, the second group defaults to '0'.
				909	('24', '0')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	910
				911
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	912	.. method:: match.groupdict(default=None)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	913
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	914	Return a dictionary containing all the named subgroups of the match, keyed by
				915	the subgroup name. The default argument is used for groups that did not
				916	participate in the match; it defaults to ``None``. For example:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	917
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	918	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")
				919	>>> m.groupdict()
				920	{'first_name': 'Malcolm', 'last_name': 'Reynolds'}
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	921
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	922
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	923	.. method:: match.start([group])
				924	match.end([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	925
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	926	Return the indices of the start and end of the substring matched by group;
				927	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				928	group exists but did not contribute to the match. For a match object m, and
				929	a group g that did contribute to the match, the substring matched by group g
				930	(equivalent to ``m.group(g)``) is ::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	931
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	932	m.string[m.start(g):m.end(g)]
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	933
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	934	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				935	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				936	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				937	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	938
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	939	An example that will remove remove_this from email addresses:
Brian Curtin	027e478	2010-03-26 00:39:56 +0000	[diff] [blame]	940
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	941	>>> email = "tony@tiremove_thisger.net"
				942	>>> m = re.search("remove_this", email)
				943	>>> email[:m.start()] + email[m.end():]
				944	'tony@tiger.net'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	945
				946
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	947	.. method:: match.span([group])
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	948
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	949	For a match m, return the 2-tuple ``(m.start(group), m.end(group))``. Note
				950	that if group did not contribute to the match, this is ``(-1, -1)``.
				951	group defaults to zero, the entire match.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	952
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	953
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	954	.. attribute:: match.pos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	955
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	956	The value of pos which was passed to the :meth:`~regex.search` or
				957	:meth:`~regex.match` method of a :ref:`match object <match-objects>`. This
				958	is the index into the string at which the RE engine started looking for a
				959	match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	960
				961
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	962	.. attribute:: match.endpos
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	963
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	964	The value of endpos which was passed to the :meth:`~regex.search` or
				965	:meth:`~regex.match` method of a :ref:`match object <match-objects>`. This
				966	is the index into the string beyond which the RE engine will not go.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	967
				968
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	969	.. attribute:: match.lastindex
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	970
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	971	The integer index of the last matched capturing group, or ``None`` if no group
				972	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				973	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				974	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				975	string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	976
				977
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	978	.. attribute:: match.lastgroup
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	979
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	980	The name of the last matched capturing group, or ``None`` if the group didn't
				981	have a name, or if no group was matched at all.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	982
				983
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	984	.. attribute:: match.re
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	985
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	986	The regular expression object whose :meth:`~regex.match` or
				987	:meth:`~regex.search` method produced this match instance.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	988
				989
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	990	.. attribute:: match.string
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	991
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	992	The string passed to :meth:`~regex.match` or :meth:`~regex.search`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	993
				994
Raymond Hettinger	1fa7682	2010-12-06 23:31:36 +0000	[diff] [blame]	995	.. _re-examples:
				996
				997	Regular Expression Examples
				998	---------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	999
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1000
				1001	Checking For a Pair
				1002	^^^^^^^^^^^^^^^^^^^
				1003
				1004	In this example, we'll use the following helper function to display match
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1005	objects a little more gracefully:
				1006
				1007	.. testcode::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1008
				1009	def displaymatch(match):
				1010	if match is None:
				1011	return None
				1012	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				1013
				1014	Suppose you are writing a poker program where a player's hand is represented as
				1015	a 5-character string with each character representing a card, "a" for ace, "k"
				1016	for king, "q" for queen, j for jack, "0" for 10, and "1" through "9"
				1017	representing the card with that value.
				1018
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1019	To see if a given string is a valid hand, one could do the following:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1020
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1021	>>> valid = re.compile(r"[0-9akqj]{5}$")
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1022	>>> displaymatch(valid.match("ak05q")) # Valid.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1023	"<Match: 'ak05q', groups=()>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1024	>>> displaymatch(valid.match("ak05e")) # Invalid.
				1025	>>> displaymatch(valid.match("ak0")) # Invalid.
				1026	>>> displaymatch(valid.match("727ak")) # Valid.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1027	"<Match: '727ak', groups=()>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1028
				1029	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1030	To match this with a regular expression, one could use backreferences as such:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1031
				1032	>>> pair = re.compile(r".(.).\1")
				1033	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1034	"<Match: '717', groups=('7',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1035	>>> displaymatch(pair.match("718ak")) # No pairs.
				1036	>>> displaymatch(pair.match("354aa")) # Pair of aces.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1037	"<Match: '354aa', groups=('a',)>"
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1038
Georg Brandl	f346ac0	2009-07-26 15:03:49 +0000	[diff] [blame]	1039	To find out what card the pair consists of, one could use the
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1040	:meth:`~match.group` method of the match object in the following manner:
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1041
				1042	.. doctest::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1043
				1044	>>> pair.match("717ak").group(1)
				1045	'7'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1046
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1047	# Error because re.match() returns None, which doesn't have a group() method:
				1048	>>> pair.match("718ak").group(1)
				1049	Traceback (most recent call last):
				1050	File "<pyshell#23>", line 1, in <module>
				1051	re.match(r".(.).\1", "718ak").group(1)
				1052	AttributeError: 'NoneType' object has no attribute 'group'
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1053
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1054	>>> pair.match("354aa").group(1)
				1055	'a'
				1056
				1057
				1058	Simulating scanf()
				1059	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1060
				1061	.. index:: single: scanf()
				1062
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1063	Python does not currently have an equivalent to :c:func:`scanf`. Regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1064	expressions are generally more powerful, though also more verbose, than
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1065	:c:func:`scanf` format strings. The table below offers some more-or-less
				1066	equivalent mappings between :c:func:`scanf` format tokens and regular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1067	expressions.
				1068
				1069	+--------------------------------+---------------------------------------------+
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1070	\| :c:func:`scanf` Token \| Regular Expression \|
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1071	+================================+=============================================+
				1072	\| ``%c`` \| ``.`` \|
				1073	+--------------------------------+---------------------------------------------+
				1074	\| ``%5c`` \| ``.{5}`` \|
				1075	+--------------------------------+---------------------------------------------+
				1076	\| ``%d`` \| ``[-+]?\d+`` \|
				1077	+--------------------------------+---------------------------------------------+
				1078	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1079	+--------------------------------+---------------------------------------------+
				1080	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1081	+--------------------------------+---------------------------------------------+
				1082	\| ``%o`` \| ``0[0-7]*`` \|
				1083	+--------------------------------+---------------------------------------------+
				1084	\| ``%s`` \| ``\S+`` \|
				1085	+--------------------------------+---------------------------------------------+
				1086	\| ``%u`` \| ``\d+`` \|
				1087	+--------------------------------+---------------------------------------------+
				1088	\| ``%x``, ``%X`` \| ``0[xX][\dA-Fa-f]+`` \|
				1089	+--------------------------------+---------------------------------------------+
				1090
				1091	To extract the filename and numbers from a string like ::
				1092
				1093	/usr/sbin/sendmail - 0 errors, 4 warnings
				1094
Georg Brandl	60203b4	2010-10-06 10:11:56 +0000	[diff] [blame]	1095	you would use a :c:func:`scanf` format like ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1096
				1097	%s - %d errors, %d warnings
				1098
				1099	The equivalent regular expression would be ::
				1100
				1101	(\S+) - (\d+) errors, (\d+) warnings
				1102
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1103
				1104	Avoiding recursion
				1105	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1106
				1107	If you create regular expressions that require the engine to perform a lot of
				1108	recursion, you may encounter a :exc:`RuntimeError` exception with the message
				1109	``maximum recursion limit`` exceeded. For example, ::
				1110
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1111	>>> s = 'Begin ' + 1000*'a very long string ' + 'end'
				1112	>>> re.match('Begin (\w\| )*? end', s).end()
				1113	Traceback (most recent call last):
				1114	File "<stdin>", line 1, in ?
Georg Brandl	ebeb44d	2010-07-29 11:15:36 +0000	[diff] [blame]	1115	File "/usr/local/lib/python3.2/re.py", line 132, in match
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1116	return _compile(pattern, flags).match(string)
				1117	RuntimeError: maximum recursion limit exceeded
				1118
				1119	You can often restructure your regular expression to avoid recursion.
				1120
Georg Brandl	e6bcc91	2008-05-12 18:05:20 +0000	[diff] [blame]	1121	Simple uses of the ``*?`` pattern are special-cased to avoid recursion. Thus,
				1122	the above regular expression can avoid recursion by being recast as ``Begin
				1123	[a-zA-Z0-9_ ]*?end``. As a further benefit, such regular expressions will run
				1124	faster than their recursive equivalents.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1125
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1126
				1127	search() vs. match()
				1128	^^^^^^^^^^^^^^^^^^^^
				1129
				1130	In a nutshell, :func:`match` only attempts to match a pattern at the beginning
				1131	of a string where :func:`search` will match a pattern anywhere in a string.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1132	For example:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1133
				1134	>>> re.match("o", "dog") # No match as "o" is not the first letter of "dog".
				1135	>>> re.search("o", "dog") # Match as search() looks everywhere in the string.
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1136	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1137
				1138	.. note::
				1139
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1140	The following applies only to regular expression objects like those created
				1141	with ``re.compile("pattern")``, not the primitives ``re.match(pattern,
				1142	string)`` or ``re.search(pattern, string)``.
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1143
				1144	:func:`match` has an optional second parameter that gives an index in the string
Benjamin Peterson	f07d002	2009-03-21 17:31:58 +0000	[diff] [blame]	1145	where the search is to start::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1146
				1147	>>> pattern = re.compile("o")
				1148	>>> pattern.match("dog") # No match as "o" is not at the start of "dog."
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1149
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1150	# Equivalent to the above expression as 0 is the default starting index:
				1151	>>> pattern.match("dog", 0)
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1152
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1153	# Match as "o" is the 2nd character of "dog" (index 0 is the first):
				1154	>>> pattern.match("dog", 1)
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1155	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1156	>>> pattern.match("dog", 2) # No match as "o" is not the 3rd character of "dog."
				1157
				1158
				1159	Making a Phonebook
				1160	^^^^^^^^^^^^^^^^^^
				1161
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1162	:func:`split` splits a string into a list delimited by the passed pattern. The
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1163	method is invaluable for converting textual data into data structures that can be
				1164	easily read and modified by Python as demonstrated in the following example that
				1165	creates a phonebook.
				1166
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1167	First, here is the input. Normally it may come from a file, here we are using
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1168	triple-quoted string syntax:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1169
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1170	>>> input = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	48310cd	2009-01-03 21:18:54 +0000	[diff] [blame]	1171	...
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1172	... Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1173	... Frank Burger: 925.541.7625 662 South Dogwood Way
				1174	...
				1175	...
				1176	... Heather Albrecht: 548.326.4584 919 Park Place"""
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1177
				1178	The entries are separated by one or more newlines. Now we convert the string
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1179	into a list with each nonempty line having its own entry:
				1180
				1181	.. doctest::
				1182	:options: +NORMALIZE_WHITESPACE
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1183
				1184	>>> entries = re.split("\n+", input)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1185	>>> entries
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1186	['Ross McFluff: 834.345.1254 155 Elm Street',
				1187	'Ronald Heathmore: 892.345.3428 436 Finley Avenue',
				1188	'Frank Burger: 925.541.7625 662 South Dogwood Way',
				1189	'Heather Albrecht: 548.326.4584 919 Park Place']
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1190
				1191	Finally, split each entry into a list with first name, last name, telephone
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1192	number, and address. We use the ``maxsplit`` parameter of :func:`split`
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1193	because the address has spaces, our splitting pattern, in it:
				1194
				1195	.. doctest::
				1196	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1197
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1198	>>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1199	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1200	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1201	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1202	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1203
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1204	The ``:?`` pattern matches the colon after the last name, so that it does not
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1205	occur in the result list. With a ``maxsplit`` of ``4``, we could separate the
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1206	house number from the street name:
				1207
				1208	.. doctest::
				1209	:options: +NORMALIZE_WHITESPACE
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1210
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1211	>>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1212	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1213	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1214	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1215	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1216
				1217
				1218	Text Munging
				1219	^^^^^^^^^^^^
				1220
				1221	:func:`sub` replaces every occurrence of a pattern with a string or the
				1222	result of a function. This example demonstrates using :func:`sub` with
				1223	a function to "munge" text, or randomize the order of all the characters
				1224	in each word of a sentence except for the first and last characters::
				1225
				1226	>>> def repl(m):
				1227	... inner_word = list(m.group(2))
				1228	... random.shuffle(inner_word)
				1229	... return m.group(1) + "".join(inner_word) + m.group(3)
				1230	>>> text = "Professor Abdolmalek, please report your absences promptly."
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1231	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1232	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
Georg Brandl	db4e939	2010-07-12 09:06:13 +0000	[diff] [blame]	1233	>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1234	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1235
				1236
				1237	Finding all Adverbs
				1238	^^^^^^^^^^^^^^^^^^^
				1239
Christian Heimes	c3f30c4	2008-02-22 16:37:40 +0000	[diff] [blame]	1240	:func:`findall` matches all occurrences of a pattern, not just the first
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1241	one as :func:`search` does. For example, if one was a writer and wanted to
				1242	find all of the adverbs in some text, he or she might use :func:`findall` in
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1243	the following manner:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1244
				1245	>>> text = "He was carefully disguised but captured quickly by police."
				1246	>>> re.findall(r"\w+ly", text)
				1247	['carefully', 'quickly']
				1248
				1249
				1250	Finding all Adverbs and their Positions
				1251	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1252
				1253	If one wants more information about all matches of a pattern than the matched
Georg Brandl	c62a704	2010-07-29 11:49:05 +0000	[diff] [blame]	1254	text, :func:`finditer` is useful as it provides :ref:`match objects
				1255	<match-objects>` instead of strings. Continuing with the previous example, if
				1256	one was a writer who wanted to find all of the adverbs and their positions in
				1257	some text, he or she would use :func:`finditer` in the following manner:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1258
				1259	>>> text = "He was carefully disguised but captured quickly by police."
				1260	>>> for m in re.finditer(r"\w+ly", text):
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1261	... print('%02d-%02d: %s' % (m.start(), m.end(), m.group(0)))
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1262	07-16: carefully
				1263	40-47: quickly
				1264
				1265
				1266	Raw String Notation
				1267	^^^^^^^^^^^^^^^^^^^
				1268
				1269	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1270	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1271	another one to escape it. For example, the two following lines of code are
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1272	functionally identical:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1273
				1274	>>> re.match(r"\W(.)\1\W", " ff ")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1275	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1276	>>> re.match("\\W(.)\\1\\W", " ff ")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1277	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1278
				1279	When one wants to match a literal backslash, it must be escaped in the regular
				1280	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1281	notation, one must use ``"\\\\"``, making the following lines of code
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1282	functionally identical:
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1283
				1284	>>> re.match(r"\\", r"\\")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1285	<_sre.SRE_Match object at ...>
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1286	>>> re.match("\\\\", r"\\")
Christian Heimes	fe337bf	2008-03-23 21:54:12 +0000	[diff] [blame]	1287	<_sre.SRE_Match object at ...>
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1288
				1289
				1290	Writing a Tokenizer
				1291	^^^^^^^^^^^^^^^^^^^
				1292
				1293	A `tokenizer or scanner <http://en.wikipedia.org/wiki/Lexical_analysis>`_
				1294	analyzes a string to categorize groups of characters. This is a useful first
				1295	step in writing a compiler or interpreter.
				1296
				1297	The text categories are specified with regular expressions. The technique is
				1298	to combine those into a single master regular expression and to loop over
				1299	successive matches::
				1300
				1301	Token = collections.namedtuple('Token', 'typ value line column')
				1302
				1303	def tokenize(s):
Raymond Hettinger	c2c7c37	2010-12-07 09:44:21 +0000	[diff] [blame]	1304	keywords = {'IF', 'THEN', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1305	tok_spec = [
Raymond Hettinger	8f5dbc8	2010-09-17 06:26:45 +0000	[diff] [blame]	1306	('NUMBER', r'\d+(\.\d*)?'), # Integer or decimal number
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1307	('ASSIGN', r':='), # Assignment operator
				1308	('END', ';'), # Statement terminator
				1309	('ID', r'[A-Za-z]+'), # Identifiers
				1310	('OP', r'[+*\/\-]'), # Arithmetic operators
				1311	('NEWLINE', r'\n'), # Line endings
				1312	('SKIP', r'[ \t]'), # Skip over spaces and tabs
				1313	]
				1314	tok_re = '\|'.join('(?P<%s>%s)' % pair for pair in tok_spec)
				1315	gettok = re.compile(tok_re).match
				1316	line = 1
				1317	pos = line_start = 0
				1318	mo = gettok(s)
				1319	while mo is not None:
				1320	typ = mo.lastgroup
				1321	if typ == 'NEWLINE':
				1322	line_start = pos
				1323	line += 1
				1324	elif typ != 'SKIP':
Georg Brandl	325477e	2011-05-13 06:54:23 +0200	[diff] [blame]	1325	val = mo.group(typ)
Raymond Hettinger	c2c7c37	2010-12-07 09:44:21 +0000	[diff] [blame]	1326	if typ == 'ID' and val in keywords:
				1327	typ = val
Georg Brandl	325477e	2011-05-13 06:54:23 +0200	[diff] [blame]	1328	yield Token(typ, val, line, mo.start()-line_start)
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1329	pos = mo.end()
				1330	mo = gettok(s, pos)
				1331	if pos != len(s):
				1332	raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))
				1333
				1334	>>> statements = '''\
				1335	total := total + price * quantity;
				1336	tax := price * 0.05;
				1337	'''
				1338	>>> for token in tokenize(statements):
Raymond Hettinger	9c47d77	2011-05-13 01:03:50 -0700	[diff] [blame^]	1339	print(token)
				1340
Raymond Hettinger	37ade9c	2010-09-16 12:02:17 +0000	[diff] [blame]	1341	Token(typ='ID', value='total', line=1, column=8)
				1342	Token(typ='ASSIGN', value=':=', line=1, column=14)
				1343	Token(typ='ID', value='total', line=1, column=17)
				1344	Token(typ='OP', value='+', line=1, column=23)
				1345	Token(typ='ID', value='price', line=1, column=25)
				1346	Token(typ='OP', value='*', line=1, column=31)
				1347	Token(typ='ID', value='quantity', line=1, column=33)
				1348	Token(typ='END', value=';', line=1, column=41)
				1349	Token(typ='ID', value='tax', line=2, column=9)
				1350	Token(typ='ASSIGN', value=':=', line=2, column=13)
				1351	Token(typ='ID', value='price', line=2, column=16)
				1352	Token(typ='OP', value='*', line=2, column=22)
				1353	Token(typ='NUMBER', value='0.05', line=2, column=24)
				1354	Token(typ='END', value=';', line=2, column=28)