Blame - Doc/library/re.rst - platform/external/python/cpython3

blob: 7de088a82553f6978ffe4f084cf40b9b365edcf6 [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1
				2	:mod:`re` --- Regular expression operations
				3	===========================================
				4
				5	.. module:: re
				6	:synopsis: Regular expression operations.
				7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
				10
				11
				12
				13	This module provides regular expression matching operations similar to
				14	those found in Perl. Both patterns and strings to be searched can be
				15	Unicode strings as well as 8-bit strings. The :mod:`re` module is
				16	always available.
				17
				18	Regular expressions use the backslash character (``'\'``) to indicate
				19	special forms or to allow special characters to be used without invoking
				20	their special meaning. This collides with Python's usage of the same
				21	character for the same purpose in string literals; for example, to match
				22	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				23	string, because the regular expression must be ``\\``, and each
				24	backslash must be expressed as ``\\`` inside a regular Python string
				25	literal.
				26
				27	The solution is to use Python's raw string notation for regular expression
				28	patterns; backslashes are not handled in any special way in a string literal
				29	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				30	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	31	newline. Usually patterns will be expressed in Python code using this raw
				32	string notation.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	33
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	34	It is important to note that most regular expression operations are available as
				35	module-level functions and :class:`RegexObject` methods. The functions are
				36	shortcuts that don't require you to compile a regex object first, but miss some
				37	fine-tuning parameters.
				38
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	39	.. seealso::
				40
				41	Mastering Regular Expressions
				42	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	43	second edition of the book no longer covers Python at all, but the first
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	44	edition covered writing good regular expression patterns in great detail.
				45
				46
				47	.. _re-syntax:
				48
				49	Regular Expression Syntax
				50	-------------------------
				51
				52	A regular expression (or RE) specifies a set of strings that matches it; the
				53	functions in this module let you check if a particular string matches a given
				54	regular expression (or if a given regular expression matches a particular
				55	string, which comes down to the same thing).
				56
				57	Regular expressions can be concatenated to form new regular expressions; if A
				58	and B are both regular expressions, then AB is also a regular expression.
				59	In general, if a string p matches A and another string q matches B, the
				60	string pq will match AB. This holds unless A or B contain low precedence
				61	operations; boundary conditions between A and B; or have numbered group
				62	references. Thus, complex expressions can easily be constructed from simpler
				63	primitive expressions like the ones described here. For details of the theory
				64	and implementation of regular expressions, consult the Friedl book referenced
				65	above, or almost any textbook about compiler construction.
				66
				67	A brief explanation of the format of regular expressions follows. For further
				68	information and a gentler presentation, consult the Regular Expression HOWTO,
				69	accessible from http://www.python.org/doc/howto/.
				70
				71	Regular expressions can contain both special and ordinary characters. Most
				72	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				73	expressions; they simply match themselves. You can concatenate ordinary
				74	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				75	section, we'll write RE's in ``this special style``, usually without quotes, and
				76	strings to be matched ``'in single quotes'``.)
				77
				78	Some characters, like ``'\|'`` or ``'('``, are special. Special
				79	characters either stand for classes of ordinary characters, or affect
				80	how the regular expressions around them are interpreted. Regular
				81	expression pattern strings may not contain null bytes, but can specify
				82	the null byte using the ``\number`` notation, e.g., ``'\x00'``.
				83
				84
				85	The special characters are:
				86
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	87	``'.'``
				88	(Dot.) In the default mode, this matches any character except a newline. If
				89	the :const:`DOTALL` flag has been specified, this matches any character
				90	including a newline.
				91
				92	``'^'``
				93	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				94	matches immediately after each newline.
				95
				96	``'$'``
				97	Matches the end of the string or just before the newline at the end of the
				98	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				99	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				100	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
Christian Heimes	25bb783	2008-01-11 16:17:00 +0000	[diff] [blame^]	101	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
				102	a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
				103	the newline, and one at the end of the string.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	104
				105	``'*'``
				106	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				107	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				108	by any number of 'b's.
				109
				110	``'+'``
				111	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				112	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				113	match just 'a'.
				114
				115	``'?'``
				116	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				117	``ab?`` will match either 'a' or 'ab'.
				118
				119	``*?``, ``+?``, ``??``
				120	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				121	as much text as possible. Sometimes this behaviour isn't desired; if the RE
				122	``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
				123	string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
				124	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
				125	characters as possible will be matched. Using ``.*?`` in the previous
				126	expression will match only ``'<H1>'``.
				127
				128	``{m}``
				129	Specifies that exactly m copies of the previous RE should be matched; fewer
				130	matches cause the entire RE not to match. For example, ``a{6}`` will match
				131	exactly six ``'a'`` characters, but not five.
				132
				133	``{m,n}``
				134	Causes the resulting RE to match from m to n repetitions of the preceding
				135	RE, attempting to match as many repetitions as possible. For example,
				136	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				137	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				138	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				139	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				140	modifier would be confused with the previously described form.
				141
				142	``{m,n}?``
				143	Causes the resulting RE to match from m to n repetitions of the preceding
				144	RE, attempting to match as few repetitions as possible. This is the
				145	non-greedy version of the previous qualifier. For example, on the
				146	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				147	while ``a{3,5}?`` will only match 3 characters.
				148
				149	``'\'``
				150	Either escapes special characters (permitting you to match characters like
				151	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				152	sequences are discussed below.
				153
				154	If you're not using a raw string to express the pattern, remember that Python
				155	also uses the backslash as an escape sequence in string literals; if the escape
				156	sequence isn't recognized by Python's parser, the backslash and subsequent
				157	character are included in the resulting string. However, if Python would
				158	recognize the resulting sequence, the backslash should be repeated twice. This
				159	is complicated and hard to understand, so it's highly recommended that you use
				160	raw strings for all but the simplest expressions.
				161
				162	``[]``
				163	Used to indicate a set of characters. Characters can be listed individually, or
				164	a range of characters can be indicated by giving two characters and separating
				165	them by a ``'-'``. Special characters are not active inside sets. For example,
				166	``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
				167	``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
				168	``[a-zA-Z0-9]`` matches any letter or digit. Character classes such
				169	as ``\w`` or ``\S`` (defined below) are also acceptable inside a
				170	range, although the characters they match depends on whether :const:`LOCALE`
				171	or :const:`UNICODE` mode is in force. If you want to include a
				172	``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
				173	place it as the first character. The pattern ``[]]`` will match
				174	``']'``, for example.
				175
				176	You can match the characters not within a range by :dfn:`complementing` the set.
				177	This is indicated by including a ``'^'`` as the first character of the set;
				178	``'^'`` elsewhere will simply match the ``'^'`` character. For example,
				179	``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
				180	character except ``'^'``.
				181
				182	``'\|'``
				183	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				184	will match either A or B. An arbitrary number of REs can be separated by the
				185	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				186	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				187	right. When one pattern completely matches, that branch is accepted. This means
				188	that once ``A`` matches, ``B`` will not be tested further, even if it would
				189	produce a longer overall match. In other words, the ``'\|'`` operator is never
				190	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				191	character class, as in ``[\|]``.
				192
				193	``(...)``
				194	Matches whatever regular expression is inside the parentheses, and indicates the
				195	start and end of a group; the contents of a group can be retrieved after a match
				196	has been performed, and can be matched later in the string with the ``\number``
				197	special sequence, described below. To match the literals ``'('`` or ``')'``,
				198	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				199
				200	``(?...)``
				201	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				202	otherwise). The first character after the ``'?'`` determines what the meaning
				203	and further syntax of the construct is. Extensions usually do not create a new
				204	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				205	currently supported extensions.
				206
				207	``(?iLmsux)``
				208	(One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
				209	``'u'``, ``'x'``.) The group matches the empty string; the letters
				210	set the corresponding flags: :const:`re.I` (ignore case),
				211	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				212	:const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
				213	and :const:`re.X` (verbose), for the entire regular expression. (The
				214	flags are described in :ref:`contents-of-module-re`.) This
				215	is useful if you wish to include the flags as part of the regular
				216	expression, instead of passing a flag argument to the
				217	:func:`compile` function.
				218
				219	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				220	used first in the expression string, or after one or more whitespace characters.
				221	If there are non-whitespace characters before the flag, the results are
				222	undefined.
				223
				224	``(?:...)``
				225	A non-grouping version of regular parentheses. Matches whatever regular
				226	expression is inside the parentheses, but the substring matched by the group
				227	cannot be retrieved after performing a match or referenced later in the
				228	pattern.
				229
				230	``(?P<name>...)``
				231	Similar to regular parentheses, but the substring matched by the group is
				232	accessible via the symbolic group name name. Group names must be valid Python
				233	identifiers, and each group name must be defined only once within a regular
				234	expression. A symbolic group is also a numbered group, just as if the group
				235	were not named. So the group named 'id' in the example below can also be
				236	referenced as the numbered group 1.
				237
				238	For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
				239	referenced by its name in arguments to methods of match objects, such as
				240	``m.group('id')`` or ``m.end('id')``, and also by name in pattern text (for
				241	example, ``(?P=id)``) and replacement text (such as ``\g<id>``).
				242
				243	``(?P=name)``
				244	Matches whatever text was matched by the earlier group named name.
				245
				246	``(?#...)``
				247	A comment; the contents of the parentheses are simply ignored.
				248
				249	``(?=...)``
				250	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				251	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				252	``'Isaac '`` only if it's followed by ``'Asimov'``.
				253
				254	``(?!...)``
				255	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				256	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				257	followed by ``'Asimov'``.
				258
				259	``(?<=...)``
				260	Matches if the current position in the string is preceded by a match for ``...``
				261	that ends at the current position. This is called a :dfn:`positive lookbehind
				262	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				263	lookbehind will back up 3 characters and check if the contained pattern matches.
				264	The contained pattern must only match strings of some fixed length, meaning that
				265	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
				266	patterns which start with positive lookbehind assertions will never match at the
				267	beginning of the string being searched; you will most likely want to use the
				268	:func:`search` function rather than the :func:`match` function::
				269
				270	>>> import re
				271	>>> m = re.search('(?<=abc)def', 'abcdef')
				272	>>> m.group(0)
				273	'def'
				274
				275	This example looks for a word following a hyphen::
				276
				277	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				278	>>> m.group(0)
				279	'egg'
				280
				281	``(?<!...)``
				282	Matches if the current position in the string is not preceded by a match for
				283	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				284	positive lookbehind assertions, the contained pattern must only match strings of
				285	some fixed length. Patterns which start with negative lookbehind assertions may
				286	match at the beginning of the string being searched.
				287
				288	``(?(id/name)yes-pattern\|no-pattern)``
				289	Will try to match with ``yes-pattern`` if the group with given id or name
				290	exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
				291	can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
				292	matching pattern, which will match with ``'<user@host.com>'`` as well as
				293	``'user@host.com'``, but not with ``'<user@host.com'``.
				294
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	295
				296	The special sequences consist of ``'\'`` and a character from the list below.
				297	If the ordinary character is not on the list, then the resulting RE will match
				298	the second character. For example, ``\$`` matches the character ``'$'``.
				299
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	300	``\number``
				301	Matches the contents of the group of the same number. Groups are numbered
				302	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
				303	but not ``'the end'`` (note the space after the group). This special sequence
				304	can only be used to match one of the first 99 groups. If the first digit of
				305	number is 0, or number is 3 octal digits long, it will not be interpreted as
				306	a group match, but as the character with octal value number. Inside the
				307	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				308	characters.
				309
				310	``\A``
				311	Matches only at the start of the string.
				312
				313	``\b``
				314	Matches the empty string, but only at the beginning or end of a word. A word is
				315	defined as a sequence of alphanumeric or underscore characters, so the end of a
				316	word is indicated by whitespace or a non-alphanumeric, non-underscore character.
				317	Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
				318	precise set of characters deemed to be alphanumeric depends on the values of the
				319	``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
				320	the backspace character, for compatibility with Python's string literals.
				321
				322	``\B``
				323	Matches the empty string, but only when it is not at the beginning or end of a
				324	word. This is just the opposite of ``\b``, so is also subject to the settings
				325	of ``LOCALE`` and ``UNICODE``.
				326
				327	``\d``
				328	When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
				329	is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
				330	whatever is classified as a digit in the Unicode character properties database.
				331
				332	``\D``
				333	When the :const:`UNICODE` flag is not specified, matches any non-digit
				334	character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
				335	will match anything other than character marked as digits in the Unicode
				336	character properties database.
				337
				338	``\s``
				339	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				340	any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
				341	:const:`LOCALE`, it will match this set plus whatever characters are defined as
				342	space for the current locale. If :const:`UNICODE` is set, this will match the
				343	characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
				344	character properties database.
				345
				346	``\S``
				347	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				348	any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
				349	With :const:`LOCALE`, it will match any character not in this set, and not
				350	defined as space in the current locale. If :const:`UNICODE` is set, this will
				351	match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
				352	the Unicode character properties database.
				353
				354	``\w``
				355	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				356	any alphanumeric character and the underscore; this is equivalent to the set
				357	``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
				358	whatever characters are defined as alphanumeric for the current locale. If
				359	:const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
				360	is classified as alphanumeric in the Unicode character properties database.
				361
				362	``\W``
				363	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				364	any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
				365	With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
				366	not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
				367	this will match anything other than ``[0-9_]`` and characters marked as
				368	alphanumeric in the Unicode character properties database.
				369
				370	``\Z``
				371	Matches only at the end of the string.
				372
				373	Most of the standard escapes supported by Python string literals are also
				374	accepted by the regular expression parser::
				375
				376	\a \b \f \n
				377	\r \t \v \x
				378	\\
				379
				380	Octal escapes are included in a limited form: If the first digit is a 0, or if
				381	there are three octal digits, it is considered an octal escape. Otherwise, it is
				382	a group reference. As for string literals, octal escapes are always at most
				383	three digits in length.
				384
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	385
				386	.. _matching-searching:
				387
				388	Matching vs Searching
				389	---------------------
				390
				391	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				392
				393
				394	Python offers two different primitive operations based on regular expressions:
Guido van Rossum	04110fb	2007-08-24 16:32:05 +0000	[diff] [blame]	395	match checks for a match only at the beginning of the string, while
				396	search checks for a match anywhere in the string (this is what Perl does
				397	by default).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	398
Guido van Rossum	04110fb	2007-08-24 16:32:05 +0000	[diff] [blame]	399	Note that match may differ from search even when using a regular expression
				400	beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	401	:const:`MULTILINE` mode also immediately following a newline. The "match"
				402	operation succeeds only if the pattern matches at the start of the string
				403	regardless of mode, or at the starting position given by the optional pos
Christian Heimes	5b5e81c	2007-12-31 16:14:33 +0000	[diff] [blame]	404	argument regardless of whether a newline precedes it. ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	405
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	406	>>> re.match("c", "abcdef") # No match
				407	>>> re.search("c", "abcdef")
Christian Heimes	5b5e81c	2007-12-31 16:14:33 +0000	[diff] [blame]	408	<_sre.SRE_Match object at 0x827e9c0> # Match
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	409
				410
				411	.. _contents-of-module-re:
				412
				413	Module Contents
				414	---------------
				415
				416	The module defines several functions, constants, and an exception. Some of the
				417	functions are simplified versions of the full featured methods for compiled
				418	regular expressions. Most non-trivial applications always use the compiled
				419	form.
				420
				421
				422	.. function:: compile(pattern[, flags])
				423
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	424	Compile a regular expression pattern into a regular expression object, which
				425	can be used for matching using its :func:`match` and :func:`search` methods,
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	426	described below.
				427
				428	The expression's behaviour can be modified by specifying a flags value.
				429	Values can be any of the following variables, combined using bitwise OR (the
				430	``\|`` operator).
				431
				432	The sequence ::
				433
				434	prog = re.compile(pat)
				435	result = prog.match(str)
				436
				437	is equivalent to ::
				438
				439	result = re.match(pat, str)
				440
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	441	but the version using :func:`compile` is more efficient when the expression
				442	will be used several times in a single program.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	443
Christian Heimes	5b5e81c	2007-12-31 16:14:33 +0000	[diff] [blame]	444	.. (The compiled version of the last pattern passed to :func:`re.match` or
				445	:func:`re.search` is cached, so programs that use only a single regular
				446	expression at a time needn't worry about compiling regular expressions.)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	447
				448
				449	.. data:: I
				450	IGNORECASE
				451
				452	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
				453	lowercase letters, too. This is not affected by the current locale.
				454
				455
				456	.. data:: L
				457	LOCALE
				458
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	459	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
				460	current locale.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	461
				462
				463	.. data:: M
				464	MULTILINE
				465
				466	When specified, the pattern character ``'^'`` matches at the beginning of the
				467	string and at the beginning of each line (immediately following each newline);
				468	and the pattern character ``'$'`` matches at the end of the string and at the
				469	end of each line (immediately preceding each newline). By default, ``'^'``
				470	matches only at the beginning of the string, and ``'$'`` only at the end of the
				471	string and immediately before the newline (if any) at the end of the string.
				472
				473
				474	.. data:: S
				475	DOTALL
				476
				477	Make the ``'.'`` special character match any character at all, including a
				478	newline; without this flag, ``'.'`` will match anything except a newline.
				479
				480
				481	.. data:: U
				482	UNICODE
				483
				484	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
				485	on the Unicode character properties database.
				486
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	487
				488	.. data:: X
				489	VERBOSE
				490
				491	This flag allows you to write regular expressions that look nicer. Whitespace
				492	within the pattern is ignored, except when in a character class or preceded by
				493	an unescaped backslash, and, when a line contains a ``'#'`` neither in a
				494	character class or preceded by an unescaped backslash, all characters from the
				495	leftmost such ``'#'`` through the end of the line are ignored.
				496
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	497	That means that the two following regular expression objects that match a
				498	decimal number are functionally equal::
Georg Brandl	81ac1ce	2007-08-31 17:17:17 +0000	[diff] [blame]	499
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	500	a = re.compile(r"""\d + # the integral part
				501	\. # the decimal point
				502	\d * # some fractional digits""", re.X)
				503	b = re.compile(r"\d+\.\d*")
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	504
				505
				506	.. function:: search(pattern, string[, flags])
				507
				508	Scan through string looking for a location where the regular expression
				509	pattern produces a match, and return a corresponding :class:`MatchObject`
				510	instance. Return ``None`` if no position in the string matches the pattern; note
				511	that this is different from finding a zero-length match at some point in the
				512	string.
				513
				514
				515	.. function:: match(pattern, string[, flags])
				516
				517	If zero or more characters at the beginning of string match the regular
				518	expression pattern, return a corresponding :class:`MatchObject` instance.
				519	Return ``None`` if the string does not match the pattern; note that this is
				520	different from a zero-length match.
				521
				522	.. note::
				523
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	524	If you want to locate a match anywhere in string, use :meth:`search`
				525	instead.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	526
				527
				528	.. function:: split(pattern, string[, maxsplit=0])
				529
				530	Split string by the occurrences of pattern. If capturing parentheses are
				531	used in pattern, then the text of all groups in the pattern are also returned
				532	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				533	splits occur, and the remainder of the string is returned as the final element
				534	of the list. (Incompatibility note: in the original Python 1.5 release,
				535	maxsplit was ignored. This has been fixed in later releases.) ::
				536
				537	>>> re.split('\W+', 'Words, words, words.')
				538	['Words', 'words', 'words', '']
				539	>>> re.split('(\W+)', 'Words, words, words.')
				540	['Words', ', ', 'words', ', ', 'words', '.', '']
				541	>>> re.split('\W+', 'Words, words, words.', 1)
				542	['Words', 'words, words.']
				543
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	544	Note that split will never split a string on an empty pattern match.
				545	For example ::
				546
				547	>>> re.split('x*', 'foo')
				548	['foo']
				549	>>> re.split("(?m)^$", "foo\n\nbar\n")
				550	['foo\n\nbar\n']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	551
				552	.. function:: findall(pattern, string[, flags])
				553
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	554	Return all non-overlapping matches of pattern in string, as a list of
				555	strings. If one or more groups are present in the pattern, return a list of
				556	groups; this will be a list of tuples if the pattern has more than one group.
				557	Empty matches are included in the result unless they touch the beginning of
				558	another match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	559
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	560
				561	.. function:: finditer(pattern, string[, flags])
				562
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	563	Return an :term:`iterator` yielding :class:`MatchObject` instances over all
				564	non-overlapping matches for the RE pattern in string. Empty matches are
				565	included in the result unless they touch the beginning of another match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	566
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	567
				568	.. function:: sub(pattern, repl, string[, count])
				569
				570	Return the string obtained by replacing the leftmost non-overlapping occurrences
				571	of pattern in string by the replacement repl. If the pattern isn't found,
				572	string is returned unchanged. repl can be a string or a function; if it is
				573	a string, any backslash escapes in it are processed. That is, ``\n`` is
				574	converted to a single newline character, ``\r`` is converted to a linefeed, and
				575	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				576	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
				577	For example::
				578
				579	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				580	... r'static PyObject*\npy_\1(void)\n{',
				581	... 'def myfunc():')
				582	'static PyObject*\npy_myfunc(void)\n{'
				583
				584	If repl is a function, it is called for every non-overlapping occurrence of
				585	pattern. The function takes a single match object argument, and returns the
				586	replacement string. For example::
				587
				588	>>> def dashrepl(matchobj):
				589	... if matchobj.group(0) == '-': return ' '
				590	... else: return '-'
				591	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				592	'pro--gram files'
				593
				594	The pattern may be a string or an RE object; if you need to specify regular
				595	expression flags, you must use a RE object, or use embedded modifiers in a
				596	pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
				597
				598	The optional argument count is the maximum number of pattern occurrences to be
				599	replaced; count must be a non-negative integer. If omitted or zero, all
				600	occurrences will be replaced. Empty matches for the pattern are replaced only
				601	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				602	``'-a-b-c-'``.
				603
				604	In addition to character escapes and backreferences as described above,
				605	``\g<name>`` will use the substring matched by the group named ``name``, as
				606	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				607	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				608	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				609	reference to group 20, not a reference to group 2 followed by the literal
				610	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				611	substring matched by the RE.
				612
				613
				614	.. function:: subn(pattern, repl, string[, count])
				615
				616	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				617	number_of_subs_made)``.
				618
				619
				620	.. function:: escape(string)
				621
				622	Return string with all non-alphanumerics backslashed; this is useful if you
				623	want to match an arbitrary literal string that may have regular expression
				624	metacharacters in it.
				625
				626
				627	.. exception:: error
				628
				629	Exception raised when a string passed to one of the functions here is not a
				630	valid regular expression (for example, it might contain unmatched parentheses)
				631	or when some other error occurs during compilation or matching. It is never an
				632	error if a string contains no match for a pattern.
				633
				634
				635	.. _re-objects:
				636
				637	Regular Expression Objects
				638	--------------------------
				639
				640	Compiled regular expression objects support the following methods and
				641	attributes:
				642
				643
				644	.. method:: RegexObject.match(string[, pos[, endpos]])
				645
				646	If zero or more characters at the beginning of string match this regular
				647	expression, return a corresponding :class:`MatchObject` instance. Return
				648	``None`` if the string does not match the pattern; note that this is different
				649	from a zero-length match.
				650
				651	.. note::
				652
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	653	If you want to locate a match anywhere in string, use :meth:`search`
				654	instead.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	655
				656	The optional second parameter pos gives an index in the string where the
				657	search is to start; it defaults to ``0``. This is not completely equivalent to
				658	slicing the string; the ``'^'`` pattern character matches at the real beginning
				659	of the string and at positions just after a newline, but not necessarily at the
				660	index where the search is to start.
				661
				662	The optional parameter endpos limits how far the string will be searched; it
				663	will be as if the string is endpos characters long, so only the characters
				664	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
				665	than pos, no match will be found, otherwise, if rx is a compiled regular
				666	expression object, ``rx.match(string, 0, 50)`` is equivalent to
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	667	``rx.match(string[:50], 0)``. ::
				668
				669	>>> pattern = re.compile("o")
				670	>>> pattern.match("dog") # No match as "o" is not at the start of "dog."
				671	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				672	<_sre.SRE_Match object at 0x827eb10>
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	673
				674
				675	.. method:: RegexObject.search(string[, pos[, endpos]])
				676
				677	Scan through string looking for a location where this regular expression
				678	produces a match, and return a corresponding :class:`MatchObject` instance.
				679	Return ``None`` if no position in the string matches the pattern; note that this
				680	is different from finding a zero-length match at some point in the string.
				681
				682	The optional pos and endpos parameters have the same meaning as for the
				683	:meth:`match` method.
				684
				685
				686	.. method:: RegexObject.split(string[, maxsplit=0])
				687
				688	Identical to the :func:`split` function, using the compiled pattern.
				689
				690
				691	.. method:: RegexObject.findall(string[, pos[, endpos]])
				692
				693	Identical to the :func:`findall` function, using the compiled pattern.
				694
				695
				696	.. method:: RegexObject.finditer(string[, pos[, endpos]])
				697
				698	Identical to the :func:`finditer` function, using the compiled pattern.
				699
				700
				701	.. method:: RegexObject.sub(repl, string[, count=0])
				702
				703	Identical to the :func:`sub` function, using the compiled pattern.
				704
				705
				706	.. method:: RegexObject.subn(repl, string[, count=0])
				707
				708	Identical to the :func:`subn` function, using the compiled pattern.
				709
				710
				711	.. attribute:: RegexObject.flags
				712
				713	The flags argument used when the RE object was compiled, or ``0`` if no flags
				714	were provided.
				715
				716
				717	.. attribute:: RegexObject.groupindex
				718
				719	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				720	numbers. The dictionary is empty if no symbolic groups were used in the
				721	pattern.
				722
				723
				724	.. attribute:: RegexObject.pattern
				725
				726	The pattern string from which the RE object was compiled.
				727
				728
				729	.. _match-objects:
				730
				731	Match Objects
				732	-------------
				733
Georg Brandl	9afde1c	2007-11-01 20:32:30 +0000	[diff] [blame]	734	Match objects always have a boolean value of :const:`True`, so that you can test
				735	whether e.g. :func:`match` resulted in a match with a simple if statement. They
				736	support the following methods and attributes:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	737
				738
				739	.. method:: MatchObject.expand(template)
				740
				741	Return the string obtained by doing backslash substitution on the template
				742	string template, as done by the :meth:`sub` method. Escapes such as ``\n`` are
				743	converted to the appropriate characters, and numeric backreferences (``\1``,
				744	``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
				745	contents of the corresponding group.
				746
				747
				748	.. method:: MatchObject.group([group1, ...])
				749
				750	Returns one or more subgroups of the match. If there is a single argument, the
				751	result is a single string; if there are multiple arguments, the result is a
				752	tuple with one item per argument. Without arguments, group1 defaults to zero
				753	(the whole match is returned). If a groupN argument is zero, the corresponding
				754	return value is the entire matching string; if it is in the inclusive range
				755	[1..99], it is the string matching the corresponding parenthesized group. If a
				756	group number is negative or larger than the number of groups defined in the
				757	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				758	part of the pattern that did not match, the corresponding result is ``None``.
				759	If a group is contained in a part of the pattern that matched multiple times,
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	760	the last match is returned. ::
				761
				762	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				763	>>> m.group(0)
				764	'Isaac Newton' # The entire match
				765	>>> m.group(1)
				766	'Isaac' # The first parenthesized subgroup.
				767	>>> m.group(2)
				768	'Newton' # The second parenthesized subgroup.
				769	>>> m.group(1, 2)
				770	('Isaac', 'Newton') # Multiple arguments give us a tuple.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	771
				772	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				773	arguments may also be strings identifying groups by their group name. If a
				774	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				775	exception is raised.
				776
				777	A moderately complicated example::
				778
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	779	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
				780	>>> m.group('first_name')
				781	'Malcom'
				782	>>> m.group('last_name')
				783	'Reynolds'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	784
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	785	Named groups can also be referred to by their index::
				786
				787	>>> m.group(1)
				788	'Malcom'
				789	>>> m.group(2)
				790	'Reynolds'
				791
				792	If a group matches multiple times, only the last match is accessible::
				793	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				794	>>> m.group(1) # Returns only the last match.
				795	'c3'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	796
				797
				798	.. method:: MatchObject.groups([default])
				799
				800	Return a tuple containing all the subgroups of the match, from 1 up to however
				801	many groups are in the pattern. The default argument is used for groups that
				802	did not participate in the match; it defaults to ``None``. (Incompatibility
				803	note: in the original Python 1.5 release, if the tuple was one element long, a
				804	string would be returned instead. In later versions (from 1.5.1 on), a
				805	singleton tuple is returned in such cases.)
				806
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	807	For example::
				808
				809	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				810	>>> m.groups()
				811	('24', '1632')
				812
				813	If we make the decimal place and everything after it optional, not all groups
				814	might participate in the match. These groups will default to ``None`` unless
				815	the default argument is given::
				816
				817	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				818	>>> m.groups()
				819	('24', None) # Second group defaults to None.
				820	>>> m.groups('0')
				821	('24', '0') # Now, the second group defaults to '0'.
				822
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	823
				824	.. method:: MatchObject.groupdict([default])
				825
				826	Return a dictionary containing all the named subgroups of the match, keyed by
				827	the subgroup name. The default argument is used for groups that did not
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	828	participate in the match; it defaults to ``None``. For example::
				829
				830	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
				831	>>> m.groupdict()
				832	{'first_name': 'Malcom', 'last_name': 'Reynolds'}
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	833
				834
				835	.. method:: MatchObject.start([group])
				836	MatchObject.end([group])
				837
				838	Return the indices of the start and end of the substring matched by group;
				839	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				840	group exists but did not contribute to the match. For a match object m, and
				841	a group g that did contribute to the match, the substring matched by group g
				842	(equivalent to ``m.group(g)``) is ::
				843
				844	m.string[m.start(g):m.end(g)]
				845
				846	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				847	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				848	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				849	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
				850
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	851	An example that will remove remove_this from email addresses::
				852
				853	>>> email = "tony@tiremove_thisger.net"
				854	>>> m = re.search("remove_this", email)
				855	>>> email[:m.start()] + email[m.end():]
				856	'tony@tiger.net'
				857
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	858
				859	.. method:: MatchObject.span([group])
				860
				861	For :class:`MatchObject` m, return the 2-tuple ``(m.start(group),
				862	m.end(group))``. Note that if group did not contribute to the match, this is
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	863	``(-1, -1)``. group defaults to zero, the entire match.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	864
				865
				866	.. attribute:: MatchObject.pos
				867
				868	The value of pos which was passed to the :func:`search` or :func:`match`
				869	method of the :class:`RegexObject`. This is the index into the string at which
				870	the RE engine started looking for a match.
				871
				872
				873	.. attribute:: MatchObject.endpos
				874
				875	The value of endpos which was passed to the :func:`search` or :func:`match`
				876	method of the :class:`RegexObject`. This is the index into the string beyond
				877	which the RE engine will not go.
				878
				879
				880	.. attribute:: MatchObject.lastindex
				881
				882	The integer index of the last matched capturing group, or ``None`` if no group
				883	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				884	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				885	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				886	string.
				887
				888
				889	.. attribute:: MatchObject.lastgroup
				890
				891	The name of the last matched capturing group, or ``None`` if the group didn't
				892	have a name, or if no group was matched at all.
				893
				894
				895	.. attribute:: MatchObject.re
				896
				897	The regular expression object whose :meth:`match` or :meth:`search` method
				898	produced this :class:`MatchObject` instance.
				899
				900
				901	.. attribute:: MatchObject.string
				902
				903	The string passed to :func:`match` or :func:`search`.
				904
				905
				906	Examples
				907	--------
				908
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	909
				910	Checking For a Pair
				911	^^^^^^^^^^^^^^^^^^^
				912
				913	In this example, we'll use the following helper function to display match
				914	objects a little more gracefully::
				915
				916	def displaymatch(match):
				917	if match is None:
				918	return None
				919	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				920
				921	Suppose you are writing a poker program where a player's hand is represented as
				922	a 5-character string with each character representing a card, "a" for ace, "k"
				923	for king, "q" for queen, j for jack, "0" for 10, and "1" through "9"
				924	representing the card with that value.
				925
				926	To see if a given string is a valid hand, one could do the following::
				927
				928	>>> valid = re.compile(r"[0-9akqj]{5}$"
				929	>>> displaymatch(valid.match("ak05q")) # Valid.
				930	<Match: 'ak05q', groups=()>
				931	>>> displaymatch(valid.match("ak05e")) # Invalid.
				932	>>> displaymatch(valid.match("ak0")) # Invalid.
				933	>>> displaymatch(valid.match("727ak")) # Valid.
				934	<Match: '727ak', groups=()>
				935
				936	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
				937	To match this with a regular expression, one could use backreferences as such::
				938
				939	>>> pair = re.compile(r".(.).\1")
				940	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
				941	<Match: '717', groups=('7',)>
				942	>>> displaymatch(pair.match("718ak")) # No pairs.
				943	>>> displaymatch(pair.match("354aa")) # Pair of aces.
				944	<Match: '345aa', groups=('a',)>
				945
				946	To find out what card the pair consists of, one could use the :func:`group`
				947	method of :class:`MatchObject` in the following manner::
				948
				949	>>> pair.match("717ak").group(1)
				950	'7'
				951
				952	# Error because re.match() returns None, which doesn't have a group() method:
				953	>>> pair.match("718ak").group(1)
				954	Traceback (most recent call last):
				955	File "<pyshell#23>", line 1, in <module>
				956	re.match(r".(.).\1", "718ak").group(1)
				957	AttributeError: 'NoneType' object has no attribute 'group'
				958
				959	>>> pair.match("354aa").group(1)
				960	'a'
				961
				962
				963	Simulating scanf()
				964	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	965
				966	.. index:: single: scanf()
				967
				968	Python does not currently have an equivalent to :cfunc:`scanf`. Regular
				969	expressions are generally more powerful, though also more verbose, than
				970	:cfunc:`scanf` format strings. The table below offers some more-or-less
				971	equivalent mappings between :cfunc:`scanf` format tokens and regular
				972	expressions.
				973
				974	+--------------------------------+---------------------------------------------+
				975	\| :cfunc:`scanf` Token \| Regular Expression \|
				976	+================================+=============================================+
				977	\| ``%c`` \| ``.`` \|
				978	+--------------------------------+---------------------------------------------+
				979	\| ``%5c`` \| ``.{5}`` \|
				980	+--------------------------------+---------------------------------------------+
				981	\| ``%d`` \| ``[-+]?\d+`` \|
				982	+--------------------------------+---------------------------------------------+
				983	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				984	+--------------------------------+---------------------------------------------+
				985	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				986	+--------------------------------+---------------------------------------------+
				987	\| ``%o`` \| ``0[0-7]*`` \|
				988	+--------------------------------+---------------------------------------------+
				989	\| ``%s`` \| ``\S+`` \|
				990	+--------------------------------+---------------------------------------------+
				991	\| ``%u`` \| ``\d+`` \|
				992	+--------------------------------+---------------------------------------------+
				993	\| ``%x``, ``%X`` \| ``0[xX][\dA-Fa-f]+`` \|
				994	+--------------------------------+---------------------------------------------+
				995
				996	To extract the filename and numbers from a string like ::
				997
				998	/usr/sbin/sendmail - 0 errors, 4 warnings
				999
				1000	you would use a :cfunc:`scanf` format like ::
				1001
				1002	%s - %d errors, %d warnings
				1003
				1004	The equivalent regular expression would be ::
				1005
				1006	(\S+) - (\d+) errors, (\d+) warnings
				1007
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1008
				1009	Avoiding recursion
				1010	^^^^^^^^^^^^^^^^^^
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1011
				1012	If you create regular expressions that require the engine to perform a lot of
				1013	recursion, you may encounter a :exc:`RuntimeError` exception with the message
				1014	``maximum recursion limit`` exceeded. For example, ::
				1015
				1016	>>> import re
				1017	>>> s = 'Begin ' + 1000*'a very long string ' + 'end'
				1018	>>> re.match('Begin (\w\| )*? end', s).end()
				1019	Traceback (most recent call last):
				1020	File "<stdin>", line 1, in ?
				1021	File "/usr/local/lib/python2.5/re.py", line 132, in match
				1022	return _compile(pattern, flags).match(string)
				1023	RuntimeError: maximum recursion limit exceeded
				1024
				1025	You can often restructure your regular expression to avoid recursion.
				1026
				1027	Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
				1028	avoid recursion. Thus, the above regular expression can avoid recursion by
				1029	being recast as ``Begin [a-zA-Z0-9_ ]*?end``. As a further benefit, such
				1030	regular expressions will run faster than their recursive equivalents.
				1031
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1032
				1033	search() vs. match()
				1034	^^^^^^^^^^^^^^^^^^^^
				1035
				1036	In a nutshell, :func:`match` only attempts to match a pattern at the beginning
				1037	of a string where :func:`search` will match a pattern anywhere in a string.
				1038	For example::
				1039
				1040	>>> re.match("o", "dog") # No match as "o" is not the first letter of "dog".
				1041	>>> re.search("o", "dog") # Match as search() looks everywhere in the string.
				1042	<_sre.SRE_Match object at 0x827e9f8>
				1043
				1044	.. note::
				1045
				1046	The following applies only to regular expression objects like those created
				1047	with ``re.compile("pattern")``, not the primitives
				1048	``re.match(pattern, string)`` or ``re.search(pattern, string)``.
				1049
				1050	:func:`match` has an optional second parameter that gives an index in the string
				1051	where the search is to start::
				1052
				1053	>>> pattern = re.compile("o")
				1054	>>> pattern.match("dog") # No match as "o" is not at the start of "dog."
				1055	# Equivalent to the above expression as 0 is the default starting index:
				1056	>>> pattern.match("dog", 0)
				1057	# Match as "o" is the 2nd character of "dog" (index 0 is the first):
				1058	>>> pattern.match("dog", 1)
				1059	<_sre.SRE_Match object at 0x827eb10>
				1060	>>> pattern.match("dog", 2) # No match as "o" is not the 3rd character of "dog."
				1061
				1062
				1063	Making a Phonebook
				1064	^^^^^^^^^^^^^^^^^^
				1065
				1066	:func:`split` splits a string into a list delimited by the passed pattern. The
				1067	method is invaluable for converting textual data into data structures that can be
				1068	easily read and modified by Python as demonstrated in the following example that
				1069	creates a phonebook.
				1070
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1071	First, here is the input. Normally it may come from a file, here we are using
				1072	triple-quoted string syntax::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1073
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1074	>>> input = """Ross McFluff: 834.345.1254 155 Elm Street
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1075
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1076	Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1077	Frank Burger: 925.541.7625 662 South Dogwood Way
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1078
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1079
				1080	Heather Albrecht: 548.326.4584 919 Park Place"""
				1081
				1082	The entries are separated by one or more newlines. Now we convert the string
				1083	into a list with each nonempty line having its own entry::
				1084
				1085	>>> entries = re.split("\n+", input)
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1086	>>> entries
				1087	['Ross McFluff 834.345.1254 155 Elm Street',
				1088	'Ronald Heathmore 892.345.3428 436 Finley Avenue',
				1089	'Frank Burger 925.541.7625 662 South Dogwood Way',
				1090	'Heather Albrecht 548.326.4584 919 Park Place']
				1091
				1092	Finally, split each entry into a list with first name, last name, telephone
				1093	number, and address. We use the ``maxsplit`` paramater of :func:`split`
				1094	because the address has spaces, our splitting pattern, in it::
				1095
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1096	>>> [re.split(":? ", entry, 3) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1097	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1098	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1099	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1100	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1101
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1102	The ``:?`` pattern matches the colon after the last name, so that it does not
				1103	occur in the result list. With a ``maxsplit`` of ``4``, we could seperate the
				1104	house number from the street name::
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1105
Christian Heimes	255f53b	2007-12-08 15:33:56 +0000	[diff] [blame]	1106	>>> [re.split(":? ", entry, 4) for entry in entries]
Christian Heimes	b9eccbf	2007-12-05 20:18:38 +0000	[diff] [blame]	1107	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1108	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1109	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1110	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1111
				1112
				1113	Text Munging
				1114	^^^^^^^^^^^^
				1115
				1116	:func:`sub` replaces every occurrence of a pattern with a string or the
				1117	result of a function. This example demonstrates using :func:`sub` with
				1118	a function to "munge" text, or randomize the order of all the characters
				1119	in each word of a sentence except for the first and last characters::
				1120
				1121	>>> def repl(m):
				1122	... inner_word = list(m.group(2))
				1123	... random.shuffle(inner_word)
				1124	... return m.group(1) + "".join(inner_word) + m.group(3)
				1125	>>> text = "Professor Abdolmalek, please report your absences promptly."
				1126	>>> re.sub("(\w)(\w+)(\w)", repl, text)
				1127	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
				1128	>>> re.sub("(\w)(\w+)(\w)", repl, text)
				1129	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1130
				1131
				1132	Finding all Adverbs
				1133	^^^^^^^^^^^^^^^^^^^
				1134
				1135	:func:`findall` matches all occurences of a pattern, not just the first
				1136	one as :func:`search` does. For example, if one was a writer and wanted to
				1137	find all of the adverbs in some text, he or she might use :func:`findall` in
				1138	the following manner::
				1139
				1140	>>> text = "He was carefully disguised but captured quickly by police."
				1141	>>> re.findall(r"\w+ly", text)
				1142	['carefully', 'quickly']
				1143
				1144
				1145	Finding all Adverbs and their Positions
				1146	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1147
				1148	If one wants more information about all matches of a pattern than the matched
				1149	text, :func:`finditer` is useful as it provides instances of
				1150	:class:`MatchObject` instead of strings. Continuing with the previous example,
				1151	if one was a writer who wanted to find all of the adverbs and their positions
				1152	in some text, he or she would use :func:`finditer` in the following manner::
				1153
				1154	>>> text = "He was carefully disguised but captured quickly by police."
				1155	>>> for m in re.finditer(r"\w+ly", text):
				1156	print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
				1157	07-16: carefully
				1158	40-47: quickly
				1159
				1160
				1161	Raw String Notation
				1162	^^^^^^^^^^^^^^^^^^^
				1163
				1164	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1165	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1166	another one to escape it. For example, the two following lines of code are
				1167	functionally identical::
				1168
				1169	>>> re.match(r"\W(.)\1\W", " ff ")
				1170	<_sre.SRE_Match object at 0x8262760>
				1171	>>> re.match("\\W(.)\\1\\W", " ff ")
				1172	<_sre.SRE_Match object at 0x82627a0>
				1173
				1174	When one wants to match a literal backslash, it must be escaped in the regular
				1175	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1176	notation, one must use ``"\\\\"``, making the following lines of code
				1177	functionally identical::
				1178
				1179	>>> re.match(r"\\", r"\\")
				1180	<_sre.SRE_Match object at 0x827eb48>
				1181	>>> re.match("\\\\", r"\\")
				1182	<_sre.SRE_Match object at 0x827ec60>