Blame - Doc/library/re.rst - platform/external/python/cpython3

blob: dae765e2755305c35835ba1ab7ef76b314e8109e [file] [log] [blame]

Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1
				2	:mod:`re` --- Regular expression operations
				3	===========================================
				4
				5	.. module:: re
				6	:synopsis: Regular expression operations.
				7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
				10
				11
				12
				13	This module provides regular expression matching operations similar to
				14	those found in Perl. Both patterns and strings to be searched can be
				15	Unicode strings as well as 8-bit strings. The :mod:`re` module is
				16	always available.
				17
				18	Regular expressions use the backslash character (``'\'``) to indicate
				19	special forms or to allow special characters to be used without invoking
				20	their special meaning. This collides with Python's usage of the same
				21	character for the same purpose in string literals; for example, to match
				22	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				23	string, because the regular expression must be ``\\``, and each
				24	backslash must be expressed as ``\\`` inside a regular Python string
				25	literal.
				26
				27	The solution is to use Python's raw string notation for regular expression
				28	patterns; backslashes are not handled in any special way in a string literal
				29	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				30	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	31	newline. Usually patterns will be expressed in Python code using this raw
				32	string notation.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	33
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	34	It is important to note that most regular expression operations are available as
				35	module-level functions and :class:`RegexObject` methods. The functions are
				36	shortcuts that don't require you to compile a regex object first, but miss some
				37	fine-tuning parameters.
				38
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	39	.. seealso::
				40
				41	Mastering Regular Expressions
				42	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	43	second edition of the book no longer covers Python at all, but the first
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	44	edition covered writing good regular expression patterns in great detail.
				45
				46
				47	.. _re-syntax:
				48
				49	Regular Expression Syntax
				50	-------------------------
				51
				52	A regular expression (or RE) specifies a set of strings that matches it; the
				53	functions in this module let you check if a particular string matches a given
				54	regular expression (or if a given regular expression matches a particular
				55	string, which comes down to the same thing).
				56
				57	Regular expressions can be concatenated to form new regular expressions; if A
				58	and B are both regular expressions, then AB is also a regular expression.
				59	In general, if a string p matches A and another string q matches B, the
				60	string pq will match AB. This holds unless A or B contain low precedence
				61	operations; boundary conditions between A and B; or have numbered group
				62	references. Thus, complex expressions can easily be constructed from simpler
				63	primitive expressions like the ones described here. For details of the theory
				64	and implementation of regular expressions, consult the Friedl book referenced
				65	above, or almost any textbook about compiler construction.
				66
				67	A brief explanation of the format of regular expressions follows. For further
				68	information and a gentler presentation, consult the Regular Expression HOWTO,
				69	accessible from http://www.python.org/doc/howto/.
				70
				71	Regular expressions can contain both special and ordinary characters. Most
				72	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				73	expressions; they simply match themselves. You can concatenate ordinary
				74	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				75	section, we'll write RE's in ``this special style``, usually without quotes, and
				76	strings to be matched ``'in single quotes'``.)
				77
				78	Some characters, like ``'\|'`` or ``'('``, are special. Special
				79	characters either stand for classes of ordinary characters, or affect
				80	how the regular expressions around them are interpreted. Regular
				81	expression pattern strings may not contain null bytes, but can specify
				82	the null byte using the ``\number`` notation, e.g., ``'\x00'``.
				83
				84
				85	The special characters are:
				86
				87	.. %
				88
				89	``'.'``
				90	(Dot.) In the default mode, this matches any character except a newline. If
				91	the :const:`DOTALL` flag has been specified, this matches any character
				92	including a newline.
				93
				94	``'^'``
				95	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				96	matches immediately after each newline.
				97
				98	``'$'``
				99	Matches the end of the string or just before the newline at the end of the
				100	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				101	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				102	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
				103	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode.
				104
				105	``'*'``
				106	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				107	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				108	by any number of 'b's.
				109
				110	``'+'``
				111	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				112	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				113	match just 'a'.
				114
				115	``'?'``
				116	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				117	``ab?`` will match either 'a' or 'ab'.
				118
				119	``*?``, ``+?``, ``??``
				120	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				121	as much text as possible. Sometimes this behaviour isn't desired; if the RE
				122	``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
				123	string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
				124	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
				125	characters as possible will be matched. Using ``.*?`` in the previous
				126	expression will match only ``'<H1>'``.
				127
				128	``{m}``
				129	Specifies that exactly m copies of the previous RE should be matched; fewer
				130	matches cause the entire RE not to match. For example, ``a{6}`` will match
				131	exactly six ``'a'`` characters, but not five.
				132
				133	``{m,n}``
				134	Causes the resulting RE to match from m to n repetitions of the preceding
				135	RE, attempting to match as many repetitions as possible. For example,
				136	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				137	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				138	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				139	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				140	modifier would be confused with the previously described form.
				141
				142	``{m,n}?``
				143	Causes the resulting RE to match from m to n repetitions of the preceding
				144	RE, attempting to match as few repetitions as possible. This is the
				145	non-greedy version of the previous qualifier. For example, on the
				146	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				147	while ``a{3,5}?`` will only match 3 characters.
				148
				149	``'\'``
				150	Either escapes special characters (permitting you to match characters like
				151	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				152	sequences are discussed below.
				153
				154	If you're not using a raw string to express the pattern, remember that Python
				155	also uses the backslash as an escape sequence in string literals; if the escape
				156	sequence isn't recognized by Python's parser, the backslash and subsequent
				157	character are included in the resulting string. However, if Python would
				158	recognize the resulting sequence, the backslash should be repeated twice. This
				159	is complicated and hard to understand, so it's highly recommended that you use
				160	raw strings for all but the simplest expressions.
				161
				162	``[]``
				163	Used to indicate a set of characters. Characters can be listed individually, or
				164	a range of characters can be indicated by giving two characters and separating
				165	them by a ``'-'``. Special characters are not active inside sets. For example,
				166	``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
				167	``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
				168	``[a-zA-Z0-9]`` matches any letter or digit. Character classes such
				169	as ``\w`` or ``\S`` (defined below) are also acceptable inside a
				170	range, although the characters they match depends on whether :const:`LOCALE`
				171	or :const:`UNICODE` mode is in force. If you want to include a
				172	``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
				173	place it as the first character. The pattern ``[]]`` will match
				174	``']'``, for example.
				175
				176	You can match the characters not within a range by :dfn:`complementing` the set.
				177	This is indicated by including a ``'^'`` as the first character of the set;
				178	``'^'`` elsewhere will simply match the ``'^'`` character. For example,
				179	``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
				180	character except ``'^'``.
				181
				182	``'\|'``
				183	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				184	will match either A or B. An arbitrary number of REs can be separated by the
				185	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				186	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				187	right. When one pattern completely matches, that branch is accepted. This means
				188	that once ``A`` matches, ``B`` will not be tested further, even if it would
				189	produce a longer overall match. In other words, the ``'\|'`` operator is never
				190	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				191	character class, as in ``[\|]``.
				192
				193	``(...)``
				194	Matches whatever regular expression is inside the parentheses, and indicates the
				195	start and end of a group; the contents of a group can be retrieved after a match
				196	has been performed, and can be matched later in the string with the ``\number``
				197	special sequence, described below. To match the literals ``'('`` or ``')'``,
				198	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				199
				200	``(?...)``
				201	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				202	otherwise). The first character after the ``'?'`` determines what the meaning
				203	and further syntax of the construct is. Extensions usually do not create a new
				204	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				205	currently supported extensions.
				206
				207	``(?iLmsux)``
				208	(One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
				209	``'u'``, ``'x'``.) The group matches the empty string; the letters
				210	set the corresponding flags: :const:`re.I` (ignore case),
				211	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				212	:const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
				213	and :const:`re.X` (verbose), for the entire regular expression. (The
				214	flags are described in :ref:`contents-of-module-re`.) This
				215	is useful if you wish to include the flags as part of the regular
				216	expression, instead of passing a flag argument to the
				217	:func:`compile` function.
				218
				219	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				220	used first in the expression string, or after one or more whitespace characters.
				221	If there are non-whitespace characters before the flag, the results are
				222	undefined.
				223
				224	``(?:...)``
				225	A non-grouping version of regular parentheses. Matches whatever regular
				226	expression is inside the parentheses, but the substring matched by the group
				227	cannot be retrieved after performing a match or referenced later in the
				228	pattern.
				229
				230	``(?P<name>...)``
				231	Similar to regular parentheses, but the substring matched by the group is
				232	accessible via the symbolic group name name. Group names must be valid Python
				233	identifiers, and each group name must be defined only once within a regular
				234	expression. A symbolic group is also a numbered group, just as if the group
				235	were not named. So the group named 'id' in the example below can also be
				236	referenced as the numbered group 1.
				237
				238	For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
				239	referenced by its name in arguments to methods of match objects, such as
				240	``m.group('id')`` or ``m.end('id')``, and also by name in pattern text (for
				241	example, ``(?P=id)``) and replacement text (such as ``\g<id>``).
				242
				243	``(?P=name)``
				244	Matches whatever text was matched by the earlier group named name.
				245
				246	``(?#...)``
				247	A comment; the contents of the parentheses are simply ignored.
				248
				249	``(?=...)``
				250	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				251	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				252	``'Isaac '`` only if it's followed by ``'Asimov'``.
				253
				254	``(?!...)``
				255	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				256	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				257	followed by ``'Asimov'``.
				258
				259	``(?<=...)``
				260	Matches if the current position in the string is preceded by a match for ``...``
				261	that ends at the current position. This is called a :dfn:`positive lookbehind
				262	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				263	lookbehind will back up 3 characters and check if the contained pattern matches.
				264	The contained pattern must only match strings of some fixed length, meaning that
				265	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
				266	patterns which start with positive lookbehind assertions will never match at the
				267	beginning of the string being searched; you will most likely want to use the
				268	:func:`search` function rather than the :func:`match` function::
				269
				270	>>> import re
				271	>>> m = re.search('(?<=abc)def', 'abcdef')
				272	>>> m.group(0)
				273	'def'
				274
				275	This example looks for a word following a hyphen::
				276
				277	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				278	>>> m.group(0)
				279	'egg'
				280
				281	``(?<!...)``
				282	Matches if the current position in the string is not preceded by a match for
				283	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				284	positive lookbehind assertions, the contained pattern must only match strings of
				285	some fixed length. Patterns which start with negative lookbehind assertions may
				286	match at the beginning of the string being searched.
				287
				288	``(?(id/name)yes-pattern\|no-pattern)``
				289	Will try to match with ``yes-pattern`` if the group with given id or name
				290	exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
				291	can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
				292	matching pattern, which will match with ``'<user@host.com>'`` as well as
				293	``'user@host.com'``, but not with ``'<user@host.com'``.
				294
				295	.. versionadded:: 2.4
				296
				297	The special sequences consist of ``'\'`` and a character from the list below.
				298	If the ordinary character is not on the list, then the resulting RE will match
				299	the second character. For example, ``\$`` matches the character ``'$'``.
				300
				301	.. %
				302
				303	``\number``
				304	Matches the contents of the group of the same number. Groups are numbered
				305	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
				306	but not ``'the end'`` (note the space after the group). This special sequence
				307	can only be used to match one of the first 99 groups. If the first digit of
				308	number is 0, or number is 3 octal digits long, it will not be interpreted as
				309	a group match, but as the character with octal value number. Inside the
				310	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				311	characters.
				312
				313	``\A``
				314	Matches only at the start of the string.
				315
				316	``\b``
				317	Matches the empty string, but only at the beginning or end of a word. A word is
				318	defined as a sequence of alphanumeric or underscore characters, so the end of a
				319	word is indicated by whitespace or a non-alphanumeric, non-underscore character.
				320	Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
				321	precise set of characters deemed to be alphanumeric depends on the values of the
				322	``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
				323	the backspace character, for compatibility with Python's string literals.
				324
				325	``\B``
				326	Matches the empty string, but only when it is not at the beginning or end of a
				327	word. This is just the opposite of ``\b``, so is also subject to the settings
				328	of ``LOCALE`` and ``UNICODE``.
				329
				330	``\d``
				331	When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
				332	is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
				333	whatever is classified as a digit in the Unicode character properties database.
				334
				335	``\D``
				336	When the :const:`UNICODE` flag is not specified, matches any non-digit
				337	character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
				338	will match anything other than character marked as digits in the Unicode
				339	character properties database.
				340
				341	``\s``
				342	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				343	any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
				344	:const:`LOCALE`, it will match this set plus whatever characters are defined as
				345	space for the current locale. If :const:`UNICODE` is set, this will match the
				346	characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
				347	character properties database.
				348
				349	``\S``
				350	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				351	any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
				352	With :const:`LOCALE`, it will match any character not in this set, and not
				353	defined as space in the current locale. If :const:`UNICODE` is set, this will
				354	match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
				355	the Unicode character properties database.
				356
				357	``\w``
				358	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				359	any alphanumeric character and the underscore; this is equivalent to the set
				360	``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
				361	whatever characters are defined as alphanumeric for the current locale. If
				362	:const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
				363	is classified as alphanumeric in the Unicode character properties database.
				364
				365	``\W``
				366	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				367	any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
				368	With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
				369	not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
				370	this will match anything other than ``[0-9_]`` and characters marked as
				371	alphanumeric in the Unicode character properties database.
				372
				373	``\Z``
				374	Matches only at the end of the string.
				375
				376	Most of the standard escapes supported by Python string literals are also
				377	accepted by the regular expression parser::
				378
				379	\a \b \f \n
				380	\r \t \v \x
				381	\\
				382
				383	Octal escapes are included in a limited form: If the first digit is a 0, or if
				384	there are three octal digits, it is considered an octal escape. Otherwise, it is
				385	a group reference. As for string literals, octal escapes are always at most
				386	three digits in length.
				387
				388	.. % Note the lack of a period in the section title; it causes problems
				389	.. % with readers of the GNU info version. See http://www.python.org/sf/581414.
				390
				391
				392	.. _matching-searching:
				393
				394	Matching vs Searching
				395	---------------------
				396
				397	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				398
				399
				400	Python offers two different primitive operations based on regular expressions:
Georg Brandl	604c121	2007-08-23 21:36:05 +0000	[diff] [blame]	401	match checks for a match only at the beginning of the string, while
				402	search checks for a match anywhere in the string (this is what Perl does
				403	by default).
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	404
Georg Brandl	604c121	2007-08-23 21:36:05 +0000	[diff] [blame]	405	Note that match may differ from search even when using a regular expression
				406	beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	407	:const:`MULTILINE` mode also immediately following a newline. The "match"
				408	operation succeeds only if the pattern matches at the start of the string
				409	regardless of mode, or at the starting position given by the optional pos
				410	argument regardless of whether a newline precedes it.
				411
				412	.. % Examples from Tim Peters:
				413
				414	::
				415
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	416	>>> re.match("c", "abcdef") # No match
				417	>>> re.search("c", "abcdef")
				418	<_sre.SRE_Match object at 0x827e9c0> # Match
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	419
				420
				421	.. _contents-of-module-re:
				422
				423	Module Contents
				424	---------------
				425
				426	The module defines several functions, constants, and an exception. Some of the
				427	functions are simplified versions of the full featured methods for compiled
				428	regular expressions. Most non-trivial applications always use the compiled
				429	form.
				430
				431
				432	.. function:: compile(pattern[, flags])
				433
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	434	Compile a regular expression pattern into a regular expression object, which
				435	can be used for matching using its :func:`match` and :func:`search` methods,
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	436	described below.
				437
				438	The expression's behaviour can be modified by specifying a flags value.
				439	Values can be any of the following variables, combined using bitwise OR (the
				440	``\|`` operator).
				441
				442	The sequence ::
				443
				444	prog = re.compile(pat)
				445	result = prog.match(str)
				446
				447	is equivalent to ::
				448
				449	result = re.match(pat, str)
				450
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	451	but the version using :func:`compile` is more efficient when the expression
				452	will be used several times in a single program.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	453
				454	.. % (The compiled version of the last pattern passed to
				455	.. % \function{re.match()} or \function{re.search()} is cached, so
				456	.. % programs that use only a single regular expression at a time needn't
				457	.. % worry about compiling regular expressions.)
				458
				459
				460	.. data:: I
				461	IGNORECASE
				462
				463	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
				464	lowercase letters, too. This is not affected by the current locale.
				465
				466
				467	.. data:: L
				468	LOCALE
				469
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	470	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
				471	current locale.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	472
				473
				474	.. data:: M
				475	MULTILINE
				476
				477	When specified, the pattern character ``'^'`` matches at the beginning of the
				478	string and at the beginning of each line (immediately following each newline);
				479	and the pattern character ``'$'`` matches at the end of the string and at the
				480	end of each line (immediately preceding each newline). By default, ``'^'``
				481	matches only at the beginning of the string, and ``'$'`` only at the end of the
				482	string and immediately before the newline (if any) at the end of the string.
				483
				484
				485	.. data:: S
				486	DOTALL
				487
				488	Make the ``'.'`` special character match any character at all, including a
				489	newline; without this flag, ``'.'`` will match anything except a newline.
				490
				491
				492	.. data:: U
				493	UNICODE
				494
				495	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
				496	on the Unicode character properties database.
				497
				498	.. versionadded:: 2.0
				499
				500
				501	.. data:: X
				502	VERBOSE
				503
				504	This flag allows you to write regular expressions that look nicer. Whitespace
				505	within the pattern is ignored, except when in a character class or preceded by
				506	an unescaped backslash, and, when a line contains a ``'#'`` neither in a
				507	character class or preceded by an unescaped backslash, all characters from the
				508	leftmost such ``'#'`` through the end of the line are ignored.
				509
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	510	That means that the two following regular expression objects that match a
				511	decimal number are functionally equal::
				512
				513	a = re.compile(r"""\d + # the integral part
				514	\. # the decimal point
				515	\d * # some fractional digits""", re.X)
				516	b = re.compile(r"\d+\.\d*")
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	517
				518
				519	.. function:: search(pattern, string[, flags])
				520
				521	Scan through string looking for a location where the regular expression
				522	pattern produces a match, and return a corresponding :class:`MatchObject`
				523	instance. Return ``None`` if no position in the string matches the pattern; note
				524	that this is different from finding a zero-length match at some point in the
				525	string.
				526
				527
				528	.. function:: match(pattern, string[, flags])
				529
				530	If zero or more characters at the beginning of string match the regular
				531	expression pattern, return a corresponding :class:`MatchObject` instance.
				532	Return ``None`` if the string does not match the pattern; note that this is
				533	different from a zero-length match.
				534
				535	.. note::
				536
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	537	If you want to locate a match anywhere in string, use :meth:`search`
				538	instead.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	539
				540
				541	.. function:: split(pattern, string[, maxsplit=0])
				542
				543	Split string by the occurrences of pattern. If capturing parentheses are
				544	used in pattern, then the text of all groups in the pattern are also returned
				545	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				546	splits occur, and the remainder of the string is returned as the final element
				547	of the list. (Incompatibility note: in the original Python 1.5 release,
				548	maxsplit was ignored. This has been fixed in later releases.) ::
				549
				550	>>> re.split('\W+', 'Words, words, words.')
				551	['Words', 'words', 'words', '']
				552	>>> re.split('(\W+)', 'Words, words, words.')
				553	['Words', ', ', 'words', ', ', 'words', '.', '']
				554	>>> re.split('\W+', 'Words, words, words.', 1)
				555	['Words', 'words, words.']
				556
Skip Montanaro	222907d	2007-09-01 17:40:03 +0000	[diff] [blame]	557	Note that split will never split a string on an empty pattern match.
				558	For example ::
				559
				560	>>> re.split('x*', 'foo')
				561	['foo']
				562	>>> re.split("(?m)^$", "foo\n\nbar\n")
				563	['foo\n\nbar\n']
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	564
				565	.. function:: findall(pattern, string[, flags])
				566
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	567	Return all non-overlapping matches of pattern in string, as a list of
				568	strings. If one or more groups are present in the pattern, return a list of
				569	groups; this will be a list of tuples if the pattern has more than one group.
				570	Empty matches are included in the result unless they touch the beginning of
				571	another match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	572
				573	.. versionadded:: 1.5.2
				574
				575	.. versionchanged:: 2.4
				576	Added the optional flags argument.
				577
				578
				579	.. function:: finditer(pattern, string[, flags])
				580
Georg Brandl	e7a0990	2007-10-21 12:10:28 +0000	[diff] [blame]	581	Return an :term:`iterator` yielding :class:`MatchObject` instances over all
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	582	non-overlapping matches for the RE pattern in string. Empty matches are
				583	included in the result unless they touch the beginning of another match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	584
				585	.. versionadded:: 2.2
				586
				587	.. versionchanged:: 2.4
				588	Added the optional flags argument.
				589
				590
				591	.. function:: sub(pattern, repl, string[, count])
				592
				593	Return the string obtained by replacing the leftmost non-overlapping occurrences
				594	of pattern in string by the replacement repl. If the pattern isn't found,
				595	string is returned unchanged. repl can be a string or a function; if it is
				596	a string, any backslash escapes in it are processed. That is, ``\n`` is
				597	converted to a single newline character, ``\r`` is converted to a linefeed, and
				598	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				599	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
				600	For example::
				601
				602	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				603	... r'static PyObject*\npy_\1(void)\n{',
				604	... 'def myfunc():')
				605	'static PyObject*\npy_myfunc(void)\n{'
				606
				607	If repl is a function, it is called for every non-overlapping occurrence of
				608	pattern. The function takes a single match object argument, and returns the
				609	replacement string. For example::
				610
				611	>>> def dashrepl(matchobj):
				612	... if matchobj.group(0) == '-': return ' '
				613	... else: return '-'
				614	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				615	'pro--gram files'
				616
				617	The pattern may be a string or an RE object; if you need to specify regular
				618	expression flags, you must use a RE object, or use embedded modifiers in a
				619	pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
				620
				621	The optional argument count is the maximum number of pattern occurrences to be
				622	replaced; count must be a non-negative integer. If omitted or zero, all
				623	occurrences will be replaced. Empty matches for the pattern are replaced only
				624	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				625	``'-a-b-c-'``.
				626
				627	In addition to character escapes and backreferences as described above,
				628	``\g<name>`` will use the substring matched by the group named ``name``, as
				629	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				630	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				631	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				632	reference to group 20, not a reference to group 2 followed by the literal
				633	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				634	substring matched by the RE.
				635
				636
				637	.. function:: subn(pattern, repl, string[, count])
				638
				639	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				640	number_of_subs_made)``.
				641
				642
				643	.. function:: escape(string)
				644
				645	Return string with all non-alphanumerics backslashed; this is useful if you
				646	want to match an arbitrary literal string that may have regular expression
				647	metacharacters in it.
				648
				649
				650	.. exception:: error
				651
				652	Exception raised when a string passed to one of the functions here is not a
				653	valid regular expression (for example, it might contain unmatched parentheses)
				654	or when some other error occurs during compilation or matching. It is never an
				655	error if a string contains no match for a pattern.
				656
				657
				658	.. _re-objects:
				659
				660	Regular Expression Objects
				661	--------------------------
				662
				663	Compiled regular expression objects support the following methods and
				664	attributes:
				665
				666
				667	.. method:: RegexObject.match(string[, pos[, endpos]])
				668
				669	If zero or more characters at the beginning of string match this regular
				670	expression, return a corresponding :class:`MatchObject` instance. Return
				671	``None`` if the string does not match the pattern; note that this is different
				672	from a zero-length match.
				673
				674	.. note::
				675
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	676	If you want to locate a match anywhere in string, use :meth:`search`
				677	instead.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	678
				679	The optional second parameter pos gives an index in the string where the
				680	search is to start; it defaults to ``0``. This is not completely equivalent to
				681	slicing the string; the ``'^'`` pattern character matches at the real beginning
				682	of the string and at positions just after a newline, but not necessarily at the
				683	index where the search is to start.
				684
				685	The optional parameter endpos limits how far the string will be searched; it
				686	will be as if the string is endpos characters long, so only the characters
				687	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
				688	than pos, no match will be found, otherwise, if rx is a compiled regular
				689	expression object, ``rx.match(string, 0, 50)`` is equivalent to
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	690	``rx.match(string[:50], 0)``. ::
				691
				692	>>> pattern = re.compile("o")
				693	>>> pattern.match("dog") # No match as "o" is not at the start of "dog."
				694	>>> pattern.match("dog", 1) # Match as "o" is the 2nd character of "dog".
				695	<_sre.SRE_Match object at 0x827eb10>
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	696
				697
				698	.. method:: RegexObject.search(string[, pos[, endpos]])
				699
				700	Scan through string looking for a location where this regular expression
				701	produces a match, and return a corresponding :class:`MatchObject` instance.
				702	Return ``None`` if no position in the string matches the pattern; note that this
				703	is different from finding a zero-length match at some point in the string.
				704
				705	The optional pos and endpos parameters have the same meaning as for the
				706	:meth:`match` method.
				707
				708
				709	.. method:: RegexObject.split(string[, maxsplit=0])
				710
				711	Identical to the :func:`split` function, using the compiled pattern.
				712
				713
				714	.. method:: RegexObject.findall(string[, pos[, endpos]])
				715
				716	Identical to the :func:`findall` function, using the compiled pattern.
				717
				718
				719	.. method:: RegexObject.finditer(string[, pos[, endpos]])
				720
				721	Identical to the :func:`finditer` function, using the compiled pattern.
				722
				723
				724	.. method:: RegexObject.sub(repl, string[, count=0])
				725
				726	Identical to the :func:`sub` function, using the compiled pattern.
				727
				728
				729	.. method:: RegexObject.subn(repl, string[, count=0])
				730
				731	Identical to the :func:`subn` function, using the compiled pattern.
				732
				733
				734	.. attribute:: RegexObject.flags
				735
				736	The flags argument used when the RE object was compiled, or ``0`` if no flags
				737	were provided.
				738
				739
				740	.. attribute:: RegexObject.groupindex
				741
				742	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				743	numbers. The dictionary is empty if no symbolic groups were used in the
				744	pattern.
				745
				746
				747	.. attribute:: RegexObject.pattern
				748
				749	The pattern string from which the RE object was compiled.
				750
				751
				752	.. _match-objects:
				753
				754	Match Objects
				755	-------------
				756
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	757	Match objects always have a boolean value of :const:`True`, so that you can test
				758	whether e.g. :func:`match` resulted in a match with a simple if statement. They
				759	support the following methods and attributes:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	760
				761
				762	.. method:: MatchObject.expand(template)
				763
				764	Return the string obtained by doing backslash substitution on the template
				765	string template, as done by the :meth:`sub` method. Escapes such as ``\n`` are
				766	converted to the appropriate characters, and numeric backreferences (``\1``,
				767	``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
				768	contents of the corresponding group.
				769
				770
				771	.. method:: MatchObject.group([group1, ...])
				772
				773	Returns one or more subgroups of the match. If there is a single argument, the
				774	result is a single string; if there are multiple arguments, the result is a
				775	tuple with one item per argument. Without arguments, group1 defaults to zero
				776	(the whole match is returned). If a groupN argument is zero, the corresponding
				777	return value is the entire matching string; if it is in the inclusive range
				778	[1..99], it is the string matching the corresponding parenthesized group. If a
				779	group number is negative or larger than the number of groups defined in the
				780	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				781	part of the pattern that did not match, the corresponding result is ``None``.
				782	If a group is contained in a part of the pattern that matched multiple times,
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	783	the last match is returned. ::
				784
				785	>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
				786	>>> m.group(0)
				787	'Isaac Newton' # The entire match
				788	>>> m.group(1)
				789	'Isaac' # The first parenthesized subgroup.
				790	>>> m.group(2)
				791	'Newton' # The second parenthesized subgroup.
				792	>>> m.group(1, 2)
				793	('Isaac', 'Newton') # Multiple arguments give us a tuple.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	794
				795	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				796	arguments may also be strings identifying groups by their group name. If a
				797	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				798	exception is raised.
				799
				800	A moderately complicated example::
				801
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	802	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
				803	>>> m.group('first_name')
				804	'Malcom'
				805	>>> m.group('last_name')
				806	'Reynolds'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	807
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	808	Named groups can also be referred to by their index::
				809
				810	>>> m.group(1)
				811	'Malcom'
				812	>>> m.group(2)
				813	'Reynolds'
				814
				815	If a group matches multiple times, only the last match is accessible::
				816	>>> m = re.match(r"(..)+", "a1b2c3") # Matches 3 times.
				817	>>> m.group(1) # Returns only the last match.
				818	'c3'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	819
				820
				821	.. method:: MatchObject.groups([default])
				822
				823	Return a tuple containing all the subgroups of the match, from 1 up to however
				824	many groups are in the pattern. The default argument is used for groups that
				825	did not participate in the match; it defaults to ``None``. (Incompatibility
				826	note: in the original Python 1.5 release, if the tuple was one element long, a
				827	string would be returned instead. In later versions (from 1.5.1 on), a
				828	singleton tuple is returned in such cases.)
				829
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	830	For example::
				831
				832	>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")
				833	>>> m.groups()
				834	('24', '1632')
				835
				836	If we make the decimal place and everything after it optional, not all groups
				837	might participate in the match. These groups will default to ``None`` unless
				838	the default argument is given::
				839
				840	>>> m = re.match(r"(\d+)\.?(\d+)?", "24")
				841	>>> m.groups()
				842	('24', None) # Second group defaults to None.
				843	>>> m.groups('0')
				844	('24', '0') # Now, the second group defaults to '0'.
				845
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	846
				847	.. method:: MatchObject.groupdict([default])
				848
				849	Return a dictionary containing all the named subgroups of the match, keyed by
				850	the subgroup name. The default argument is used for groups that did not
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	851	participate in the match; it defaults to ``None``. For example::
				852
				853	>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcom Reynolds")
				854	>>> m.groupdict()
				855	{'first_name': 'Malcom', 'last_name': 'Reynolds'}
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	856
				857
				858	.. method:: MatchObject.start([group])
				859	MatchObject.end([group])
				860
				861	Return the indices of the start and end of the substring matched by group;
				862	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				863	group exists but did not contribute to the match. For a match object m, and
				864	a group g that did contribute to the match, the substring matched by group g
				865	(equivalent to ``m.group(g)``) is ::
				866
				867	m.string[m.start(g):m.end(g)]
				868
				869	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				870	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				871	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				872	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
				873
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	874	An example that will remove remove_this from email addresses::
				875
				876	>>> email = "tony@tiremove_thisger.net"
				877	>>> m = re.search("remove_this", email)
				878	>>> email[:m.start()] + email[m.end():]
				879	'tony@tiger.net'
				880
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	881
				882	.. method:: MatchObject.span([group])
				883
				884	For :class:`MatchObject` m, return the 2-tuple ``(m.start(group),
				885	m.end(group))``. Note that if group did not contribute to the match, this is
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	886	``(-1, -1)``. group defaults to zero, the entire match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	887
				888
				889	.. attribute:: MatchObject.pos
				890
				891	The value of pos which was passed to the :func:`search` or :func:`match`
				892	method of the :class:`RegexObject`. This is the index into the string at which
				893	the RE engine started looking for a match.
				894
				895
				896	.. attribute:: MatchObject.endpos
				897
				898	The value of endpos which was passed to the :func:`search` or :func:`match`
				899	method of the :class:`RegexObject`. This is the index into the string beyond
				900	which the RE engine will not go.
				901
				902
				903	.. attribute:: MatchObject.lastindex
				904
				905	The integer index of the last matched capturing group, or ``None`` if no group
				906	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				907	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				908	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				909	string.
				910
				911
				912	.. attribute:: MatchObject.lastgroup
				913
				914	The name of the last matched capturing group, or ``None`` if the group didn't
				915	have a name, or if no group was matched at all.
				916
				917
				918	.. attribute:: MatchObject.re
				919
				920	The regular expression object whose :meth:`match` or :meth:`search` method
				921	produced this :class:`MatchObject` instance.
				922
				923
				924	.. attribute:: MatchObject.string
				925
				926	The string passed to :func:`match` or :func:`search`.
				927
				928
				929	Examples
				930	--------
				931
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	932
				933	Checking For a Pair
				934	^^^^^^^^^^^^^^^^^^^
				935
				936	In this example, we'll use the following helper function to display match
				937	objects a little more gracefully::
				938
				939	def displaymatch(match):
				940	if match is None:
				941	return None
				942	return '<Match: %r, groups=%r>' % (match.group(), match.groups())
				943
				944	Suppose you are writing a poker program where a player's hand is represented as
				945	a 5-character string with each character representing a card, "a" for ace, "k"
				946	for king, "q" for queen, j for jack, "0" for 10, and "1" through "9"
				947	representing the card with that value.
				948
				949	To see if a given string is a valid hand, one could do the following::
				950
				951	>>> valid = re.compile(r"[0-9akqj]{5}$"
				952	>>> displaymatch(valid.match("ak05q")) # Valid.
				953	<Match: 'ak05q', groups=()>
				954	>>> displaymatch(valid.match("ak05e")) # Invalid.
				955	>>> displaymatch(valid.match("ak0")) # Invalid.
				956	>>> displaymatch(valid.match("727ak")) # Valid.
				957	<Match: '727ak', groups=()>
				958
				959	That last hand, ``"727ak"``, contained a pair, or two of the same valued cards.
				960	To match this with a regular expression, one could use backreferences as such::
				961
				962	>>> pair = re.compile(r".(.).\1")
				963	>>> displaymatch(pair.match("717ak")) # Pair of 7s.
				964	<Match: '717', groups=('7',)>
				965	>>> displaymatch(pair.match("718ak")) # No pairs.
				966	>>> displaymatch(pair.match("354aa")) # Pair of aces.
				967	<Match: '345aa', groups=('a',)>
				968
				969	To find out what card the pair consists of, one could use the :func:`group`
				970	method of :class:`MatchObject` in the following manner::
				971
				972	>>> pair.match("717ak").group(1)
				973	'7'
				974
				975	# Error because re.match() returns None, which doesn't have a group() method:
				976	>>> pair.match("718ak").group(1)
				977	Traceback (most recent call last):
				978	File "<pyshell#23>", line 1, in <module>
				979	re.match(r".(.).\1", "718ak").group(1)
				980	AttributeError: 'NoneType' object has no attribute 'group'
				981
				982	>>> pair.match("354aa").group(1)
				983	'a'
				984
				985
				986	Simulating scanf()
				987	^^^^^^^^^^^^^^^^^^
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	988
				989	.. index:: single: scanf()
				990
				991	Python does not currently have an equivalent to :cfunc:`scanf`. Regular
				992	expressions are generally more powerful, though also more verbose, than
				993	:cfunc:`scanf` format strings. The table below offers some more-or-less
				994	equivalent mappings between :cfunc:`scanf` format tokens and regular
				995	expressions.
				996
				997	+--------------------------------+---------------------------------------------+
				998	\| :cfunc:`scanf` Token \| Regular Expression \|
				999	+================================+=============================================+
				1000	\| ``%c`` \| ``.`` \|
				1001	+--------------------------------+---------------------------------------------+
				1002	\| ``%5c`` \| ``.{5}`` \|
				1003	+--------------------------------+---------------------------------------------+
				1004	\| ``%d`` \| ``[-+]?\d+`` \|
				1005	+--------------------------------+---------------------------------------------+
				1006	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				1007	+--------------------------------+---------------------------------------------+
				1008	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				1009	+--------------------------------+---------------------------------------------+
				1010	\| ``%o`` \| ``0[0-7]*`` \|
				1011	+--------------------------------+---------------------------------------------+
				1012	\| ``%s`` \| ``\S+`` \|
				1013	+--------------------------------+---------------------------------------------+
				1014	\| ``%u`` \| ``\d+`` \|
				1015	+--------------------------------+---------------------------------------------+
				1016	\| ``%x``, ``%X`` \| ``0[xX][\dA-Fa-f]+`` \|
				1017	+--------------------------------+---------------------------------------------+
				1018
				1019	To extract the filename and numbers from a string like ::
				1020
				1021	/usr/sbin/sendmail - 0 errors, 4 warnings
				1022
				1023	you would use a :cfunc:`scanf` format like ::
				1024
				1025	%s - %d errors, %d warnings
				1026
				1027	The equivalent regular expression would be ::
				1028
				1029	(\S+) - (\d+) errors, (\d+) warnings
				1030
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1031
				1032	Avoiding recursion
				1033	^^^^^^^^^^^^^^^^^^
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1034
				1035	If you create regular expressions that require the engine to perform a lot of
				1036	recursion, you may encounter a :exc:`RuntimeError` exception with the message
				1037	``maximum recursion limit`` exceeded. For example, ::
				1038
				1039	>>> import re
				1040	>>> s = 'Begin ' + 1000*'a very long string ' + 'end'
				1041	>>> re.match('Begin (\w\| )*? end', s).end()
				1042	Traceback (most recent call last):
				1043	File "<stdin>", line 1, in ?
				1044	File "/usr/local/lib/python2.5/re.py", line 132, in match
				1045	return _compile(pattern, flags).match(string)
				1046	RuntimeError: maximum recursion limit exceeded
				1047
				1048	You can often restructure your regular expression to avoid recursion.
				1049
				1050	Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
				1051	avoid recursion. Thus, the above regular expression can avoid recursion by
				1052	being recast as ``Begin [a-zA-Z0-9_ ]*?end``. As a further benefit, such
				1053	regular expressions will run faster than their recursive equivalents.
				1054
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1055
				1056	search() vs. match()
				1057	^^^^^^^^^^^^^^^^^^^^
				1058
				1059	In a nutshell, :func:`match` only attempts to match a pattern at the beginning
				1060	of a string where :func:`search` will match a pattern anywhere in a string.
				1061	For example::
				1062
				1063	>>> re.match("o", "dog") # No match as "o" is not the first letter of "dog".
				1064	>>> re.search("o", "dog") # Match as search() looks everywhere in the string.
				1065	<_sre.SRE_Match object at 0x827e9f8>
				1066
				1067	.. note::
				1068
				1069	The following applies only to regular expression objects like those created
				1070	with ``re.compile("pattern")``, not the primitives
				1071	``re.match(pattern, string)`` or ``re.search(pattern, string)``.
				1072
				1073	:func:`match` has an optional second parameter that gives an index in the string
				1074	where the search is to start::
				1075
				1076	>>> pattern = re.compile("o")
				1077	>>> pattern.match("dog") # No match as "o" is not at the start of "dog."
				1078	# Equivalent to the above expression as 0 is the default starting index:
				1079	>>> pattern.match("dog", 0)
				1080	# Match as "o" is the 2nd character of "dog" (index 0 is the first):
				1081	>>> pattern.match("dog", 1)
				1082	<_sre.SRE_Match object at 0x827eb10>
				1083	>>> pattern.match("dog", 2) # No match as "o" is not the 3rd character of "dog."
				1084
				1085
				1086	Making a Phonebook
				1087	^^^^^^^^^^^^^^^^^^
				1088
				1089	:func:`split` splits a string into a list delimited by the passed pattern. The
				1090	method is invaluable for converting textual data into data structures that can be
				1091	easily read and modified by Python as demonstrated in the following example that
				1092	creates a phonebook.
				1093
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame^]	1094	First, here is the input. Normally it may come from a file, here we are using
				1095	triple-quoted string syntax::
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1096
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame^]	1097	>>> input = """Ross McFluff: 834.345.1254 155 Elm Street
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1098
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame^]	1099	Ronald Heathmore: 892.345.3428 436 Finley Avenue
				1100	Frank Burger: 925.541.7625 662 South Dogwood Way
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1101
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame^]	1102
				1103	Heather Albrecht: 548.326.4584 919 Park Place"""
				1104
				1105	The entries are separated by one or more newlines. Now we convert the string
				1106	into a list with each nonempty line having its own entry::
				1107
				1108	>>> entries = re.split("\n+", input)
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1109	>>> entries
				1110	['Ross McFluff 834.345.1254 155 Elm Street',
				1111	'Ronald Heathmore 892.345.3428 436 Finley Avenue',
				1112	'Frank Burger 925.541.7625 662 South Dogwood Way',
				1113	'Heather Albrecht 548.326.4584 919 Park Place']
				1114
				1115	Finally, split each entry into a list with first name, last name, telephone
				1116	number, and address. We use the ``maxsplit`` paramater of :func:`split`
				1117	because the address has spaces, our splitting pattern, in it::
				1118
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame^]	1119	>>> [re.split(":? ", entry, 3) for entry in entries]
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1120	[['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],
				1121	['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],
				1122	['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],
				1123	['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]
				1124
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame^]	1125	The ``:?`` pattern matches the colon after the last name, so that it does not
				1126	occur in the result list. With a ``maxsplit`` of ``4``, we could seperate the
				1127	house number from the street name::
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1128
Georg Brandl	d6b20dc	2007-12-06 09:45:39 +0000	[diff] [blame^]	1129	>>> [re.split(":? ", entry, 4) for entry in entries]
Georg Brandl	b8df156	2007-12-05 18:30:48 +0000	[diff] [blame]	1130	[['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],
				1131	['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],
				1132	['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],
				1133	['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]
				1134
				1135
				1136	Text Munging
				1137	^^^^^^^^^^^^
				1138
				1139	:func:`sub` replaces every occurrence of a pattern with a string or the
				1140	result of a function. This example demonstrates using :func:`sub` with
				1141	a function to "munge" text, or randomize the order of all the characters
				1142	in each word of a sentence except for the first and last characters::
				1143
				1144	>>> def repl(m):
				1145	... inner_word = list(m.group(2))
				1146	... random.shuffle(inner_word)
				1147	... return m.group(1) + "".join(inner_word) + m.group(3)
				1148	>>> text = "Professor Abdolmalek, please report your absences promptly."
				1149	>>> re.sub("(\w)(\w+)(\w)", repl, text)
				1150	'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
				1151	>>> re.sub("(\w)(\w+)(\w)", repl, text)
				1152	'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'
				1153
				1154
				1155	Finding all Adverbs
				1156	^^^^^^^^^^^^^^^^^^^
				1157
				1158	:func:`findall` matches all occurences of a pattern, not just the first
				1159	one as :func:`search` does. For example, if one was a writer and wanted to
				1160	find all of the adverbs in some text, he or she might use :func:`findall` in
				1161	the following manner::
				1162
				1163	>>> text = "He was carefully disguised but captured quickly by police."
				1164	>>> re.findall(r"\w+ly", text)
				1165	['carefully', 'quickly']
				1166
				1167
				1168	Finding all Adverbs and their Positions
				1169	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				1170
				1171	If one wants more information about all matches of a pattern than the matched
				1172	text, :func:`finditer` is useful as it provides instances of
				1173	:class:`MatchObject` instead of strings. Continuing with the previous example,
				1174	if one was a writer who wanted to find all of the adverbs and their positions
				1175	in some text, he or she would use :func:`finditer` in the following manner::
				1176
				1177	>>> text = "He was carefully disguised but captured quickly by police."
				1178	>>> for m in re.finditer(r"\w+ly", text):
				1179	print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))
				1180	07-16: carefully
				1181	40-47: quickly
				1182
				1183
				1184	Raw String Notation
				1185	^^^^^^^^^^^^^^^^^^^
				1186
				1187	Raw string notation (``r"text"``) keeps regular expressions sane. Without it,
				1188	every backslash (``'\'``) in a regular expression would have to be prefixed with
				1189	another one to escape it. For example, the two following lines of code are
				1190	functionally identical::
				1191
				1192	>>> re.match(r"\W(.)\1\W", " ff ")
				1193	<_sre.SRE_Match object at 0x8262760>
				1194	>>> re.match("\\W(.)\\1\\W", " ff ")
				1195	<_sre.SRE_Match object at 0x82627a0>
				1196
				1197	When one wants to match a literal backslash, it must be escaped in the regular
				1198	expression. With raw string notation, this means ``r"\\"``. Without raw string
				1199	notation, one must use ``"\\\\"``, making the following lines of code
				1200	functionally identical::
				1201
				1202	>>> re.match(r"\\", r"\\")
				1203	<_sre.SRE_Match object at 0x827eb48>
				1204	>>> re.match("\\\\", r"\\")
				1205	<_sre.SRE_Match object at 0x827ec60>