Blame - Doc/library/re.rst - platform/external/python/cpython3

blob: 1caaaf291a106ebc10ef348516f6873c1ea3ffe3 [file] [log] [blame]

Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1
				2	:mod:`re` --- Regular expression operations
				3	===========================================
				4
				5	.. module:: re
				6	:synopsis: Regular expression operations.
				7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
				10
				11
				12
				13	This module provides regular expression matching operations similar to
				14	those found in Perl. Both patterns and strings to be searched can be
				15	Unicode strings as well as 8-bit strings. The :mod:`re` module is
				16	always available.
				17
				18	Regular expressions use the backslash character (``'\'``) to indicate
				19	special forms or to allow special characters to be used without invoking
				20	their special meaning. This collides with Python's usage of the same
				21	character for the same purpose in string literals; for example, to match
				22	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				23	string, because the regular expression must be ``\\``, and each
				24	backslash must be expressed as ``\\`` inside a regular Python string
				25	literal.
				26
				27	The solution is to use Python's raw string notation for regular expression
				28	patterns; backslashes are not handled in any special way in a string literal
				29	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				30	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	31	newline. Usually patterns will be expressed in Python code using this raw
				32	string notation.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	33
				34	.. seealso::
				35
				36	Mastering Regular Expressions
				37	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	38	second edition of the book no longer covers Python at all, but the first
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	39	edition covered writing good regular expression patterns in great detail.
				40
				41
				42	.. _re-syntax:
				43
				44	Regular Expression Syntax
				45	-------------------------
				46
				47	A regular expression (or RE) specifies a set of strings that matches it; the
				48	functions in this module let you check if a particular string matches a given
				49	regular expression (or if a given regular expression matches a particular
				50	string, which comes down to the same thing).
				51
				52	Regular expressions can be concatenated to form new regular expressions; if A
				53	and B are both regular expressions, then AB is also a regular expression.
				54	In general, if a string p matches A and another string q matches B, the
				55	string pq will match AB. This holds unless A or B contain low precedence
				56	operations; boundary conditions between A and B; or have numbered group
				57	references. Thus, complex expressions can easily be constructed from simpler
				58	primitive expressions like the ones described here. For details of the theory
				59	and implementation of regular expressions, consult the Friedl book referenced
				60	above, or almost any textbook about compiler construction.
				61
				62	A brief explanation of the format of regular expressions follows. For further
				63	information and a gentler presentation, consult the Regular Expression HOWTO,
				64	accessible from http://www.python.org/doc/howto/.
				65
				66	Regular expressions can contain both special and ordinary characters. Most
				67	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				68	expressions; they simply match themselves. You can concatenate ordinary
				69	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				70	section, we'll write RE's in ``this special style``, usually without quotes, and
				71	strings to be matched ``'in single quotes'``.)
				72
				73	Some characters, like ``'\|'`` or ``'('``, are special. Special
				74	characters either stand for classes of ordinary characters, or affect
				75	how the regular expressions around them are interpreted. Regular
				76	expression pattern strings may not contain null bytes, but can specify
				77	the null byte using the ``\number`` notation, e.g., ``'\x00'``.
				78
				79
				80	The special characters are:
				81
				82	.. %
				83
				84	``'.'``
				85	(Dot.) In the default mode, this matches any character except a newline. If
				86	the :const:`DOTALL` flag has been specified, this matches any character
				87	including a newline.
				88
				89	``'^'``
				90	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				91	matches immediately after each newline.
				92
				93	``'$'``
				94	Matches the end of the string or just before the newline at the end of the
				95	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				96	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				97	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
				98	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode.
				99
				100	``'*'``
				101	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				102	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				103	by any number of 'b's.
				104
				105	``'+'``
				106	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				107	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				108	match just 'a'.
				109
				110	``'?'``
				111	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				112	``ab?`` will match either 'a' or 'ab'.
				113
				114	``*?``, ``+?``, ``??``
				115	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				116	as much text as possible. Sometimes this behaviour isn't desired; if the RE
				117	``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
				118	string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
				119	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
				120	characters as possible will be matched. Using ``.*?`` in the previous
				121	expression will match only ``'<H1>'``.
				122
				123	``{m}``
				124	Specifies that exactly m copies of the previous RE should be matched; fewer
				125	matches cause the entire RE not to match. For example, ``a{6}`` will match
				126	exactly six ``'a'`` characters, but not five.
				127
				128	``{m,n}``
				129	Causes the resulting RE to match from m to n repetitions of the preceding
				130	RE, attempting to match as many repetitions as possible. For example,
				131	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				132	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				133	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				134	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				135	modifier would be confused with the previously described form.
				136
				137	``{m,n}?``
				138	Causes the resulting RE to match from m to n repetitions of the preceding
				139	RE, attempting to match as few repetitions as possible. This is the
				140	non-greedy version of the previous qualifier. For example, on the
				141	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				142	while ``a{3,5}?`` will only match 3 characters.
				143
				144	``'\'``
				145	Either escapes special characters (permitting you to match characters like
				146	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				147	sequences are discussed below.
				148
				149	If you're not using a raw string to express the pattern, remember that Python
				150	also uses the backslash as an escape sequence in string literals; if the escape
				151	sequence isn't recognized by Python's parser, the backslash and subsequent
				152	character are included in the resulting string. However, if Python would
				153	recognize the resulting sequence, the backslash should be repeated twice. This
				154	is complicated and hard to understand, so it's highly recommended that you use
				155	raw strings for all but the simplest expressions.
				156
				157	``[]``
				158	Used to indicate a set of characters. Characters can be listed individually, or
				159	a range of characters can be indicated by giving two characters and separating
				160	them by a ``'-'``. Special characters are not active inside sets. For example,
				161	``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
				162	``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
				163	``[a-zA-Z0-9]`` matches any letter or digit. Character classes such
				164	as ``\w`` or ``\S`` (defined below) are also acceptable inside a
				165	range, although the characters they match depends on whether :const:`LOCALE`
				166	or :const:`UNICODE` mode is in force. If you want to include a
				167	``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
				168	place it as the first character. The pattern ``[]]`` will match
				169	``']'``, for example.
				170
				171	You can match the characters not within a range by :dfn:`complementing` the set.
				172	This is indicated by including a ``'^'`` as the first character of the set;
				173	``'^'`` elsewhere will simply match the ``'^'`` character. For example,
				174	``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
				175	character except ``'^'``.
				176
				177	``'\|'``
				178	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				179	will match either A or B. An arbitrary number of REs can be separated by the
				180	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				181	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				182	right. When one pattern completely matches, that branch is accepted. This means
				183	that once ``A`` matches, ``B`` will not be tested further, even if it would
				184	produce a longer overall match. In other words, the ``'\|'`` operator is never
				185	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				186	character class, as in ``[\|]``.
				187
				188	``(...)``
				189	Matches whatever regular expression is inside the parentheses, and indicates the
				190	start and end of a group; the contents of a group can be retrieved after a match
				191	has been performed, and can be matched later in the string with the ``\number``
				192	special sequence, described below. To match the literals ``'('`` or ``')'``,
				193	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				194
				195	``(?...)``
				196	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				197	otherwise). The first character after the ``'?'`` determines what the meaning
				198	and further syntax of the construct is. Extensions usually do not create a new
				199	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				200	currently supported extensions.
				201
				202	``(?iLmsux)``
				203	(One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
				204	``'u'``, ``'x'``.) The group matches the empty string; the letters
				205	set the corresponding flags: :const:`re.I` (ignore case),
				206	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				207	:const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
				208	and :const:`re.X` (verbose), for the entire regular expression. (The
				209	flags are described in :ref:`contents-of-module-re`.) This
				210	is useful if you wish to include the flags as part of the regular
				211	expression, instead of passing a flag argument to the
				212	:func:`compile` function.
				213
				214	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				215	used first in the expression string, or after one or more whitespace characters.
				216	If there are non-whitespace characters before the flag, the results are
				217	undefined.
				218
				219	``(?:...)``
				220	A non-grouping version of regular parentheses. Matches whatever regular
				221	expression is inside the parentheses, but the substring matched by the group
				222	cannot be retrieved after performing a match or referenced later in the
				223	pattern.
				224
				225	``(?P<name>...)``
				226	Similar to regular parentheses, but the substring matched by the group is
				227	accessible via the symbolic group name name. Group names must be valid Python
				228	identifiers, and each group name must be defined only once within a regular
				229	expression. A symbolic group is also a numbered group, just as if the group
				230	were not named. So the group named 'id' in the example below can also be
				231	referenced as the numbered group 1.
				232
				233	For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
				234	referenced by its name in arguments to methods of match objects, such as
				235	``m.group('id')`` or ``m.end('id')``, and also by name in pattern text (for
				236	example, ``(?P=id)``) and replacement text (such as ``\g<id>``).
				237
				238	``(?P=name)``
				239	Matches whatever text was matched by the earlier group named name.
				240
				241	``(?#...)``
				242	A comment; the contents of the parentheses are simply ignored.
				243
				244	``(?=...)``
				245	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				246	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				247	``'Isaac '`` only if it's followed by ``'Asimov'``.
				248
				249	``(?!...)``
				250	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				251	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				252	followed by ``'Asimov'``.
				253
				254	``(?<=...)``
				255	Matches if the current position in the string is preceded by a match for ``...``
				256	that ends at the current position. This is called a :dfn:`positive lookbehind
				257	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				258	lookbehind will back up 3 characters and check if the contained pattern matches.
				259	The contained pattern must only match strings of some fixed length, meaning that
				260	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
				261	patterns which start with positive lookbehind assertions will never match at the
				262	beginning of the string being searched; you will most likely want to use the
				263	:func:`search` function rather than the :func:`match` function::
				264
				265	>>> import re
				266	>>> m = re.search('(?<=abc)def', 'abcdef')
				267	>>> m.group(0)
				268	'def'
				269
				270	This example looks for a word following a hyphen::
				271
				272	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				273	>>> m.group(0)
				274	'egg'
				275
				276	``(?<!...)``
				277	Matches if the current position in the string is not preceded by a match for
				278	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				279	positive lookbehind assertions, the contained pattern must only match strings of
				280	some fixed length. Patterns which start with negative lookbehind assertions may
				281	match at the beginning of the string being searched.
				282
				283	``(?(id/name)yes-pattern\|no-pattern)``
				284	Will try to match with ``yes-pattern`` if the group with given id or name
				285	exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
				286	can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
				287	matching pattern, which will match with ``'<user@host.com>'`` as well as
				288	``'user@host.com'``, but not with ``'<user@host.com'``.
				289
				290	.. versionadded:: 2.4
				291
				292	The special sequences consist of ``'\'`` and a character from the list below.
				293	If the ordinary character is not on the list, then the resulting RE will match
				294	the second character. For example, ``\$`` matches the character ``'$'``.
				295
				296	.. %
				297
				298	``\number``
				299	Matches the contents of the group of the same number. Groups are numbered
				300	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
				301	but not ``'the end'`` (note the space after the group). This special sequence
				302	can only be used to match one of the first 99 groups. If the first digit of
				303	number is 0, or number is 3 octal digits long, it will not be interpreted as
				304	a group match, but as the character with octal value number. Inside the
				305	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				306	characters.
				307
				308	``\A``
				309	Matches only at the start of the string.
				310
				311	``\b``
				312	Matches the empty string, but only at the beginning or end of a word. A word is
				313	defined as a sequence of alphanumeric or underscore characters, so the end of a
				314	word is indicated by whitespace or a non-alphanumeric, non-underscore character.
				315	Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
				316	precise set of characters deemed to be alphanumeric depends on the values of the
				317	``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
				318	the backspace character, for compatibility with Python's string literals.
				319
				320	``\B``
				321	Matches the empty string, but only when it is not at the beginning or end of a
				322	word. This is just the opposite of ``\b``, so is also subject to the settings
				323	of ``LOCALE`` and ``UNICODE``.
				324
				325	``\d``
				326	When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
				327	is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
				328	whatever is classified as a digit in the Unicode character properties database.
				329
				330	``\D``
				331	When the :const:`UNICODE` flag is not specified, matches any non-digit
				332	character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
				333	will match anything other than character marked as digits in the Unicode
				334	character properties database.
				335
				336	``\s``
				337	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				338	any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
				339	:const:`LOCALE`, it will match this set plus whatever characters are defined as
				340	space for the current locale. If :const:`UNICODE` is set, this will match the
				341	characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
				342	character properties database.
				343
				344	``\S``
				345	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				346	any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
				347	With :const:`LOCALE`, it will match any character not in this set, and not
				348	defined as space in the current locale. If :const:`UNICODE` is set, this will
				349	match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
				350	the Unicode character properties database.
				351
				352	``\w``
				353	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				354	any alphanumeric character and the underscore; this is equivalent to the set
				355	``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
				356	whatever characters are defined as alphanumeric for the current locale. If
				357	:const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
				358	is classified as alphanumeric in the Unicode character properties database.
				359
				360	``\W``
				361	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				362	any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
				363	With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
				364	not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
				365	this will match anything other than ``[0-9_]`` and characters marked as
				366	alphanumeric in the Unicode character properties database.
				367
				368	``\Z``
				369	Matches only at the end of the string.
				370
				371	Most of the standard escapes supported by Python string literals are also
				372	accepted by the regular expression parser::
				373
				374	\a \b \f \n
				375	\r \t \v \x
				376	\\
				377
				378	Octal escapes are included in a limited form: If the first digit is a 0, or if
				379	there are three octal digits, it is considered an octal escape. Otherwise, it is
				380	a group reference. As for string literals, octal escapes are always at most
				381	three digits in length.
				382
				383	.. % Note the lack of a period in the section title; it causes problems
				384	.. % with readers of the GNU info version. See http://www.python.org/sf/581414.
				385
				386
				387	.. _matching-searching:
				388
				389	Matching vs Searching
				390	---------------------
				391
				392	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				393
				394
				395	Python offers two different primitive operations based on regular expressions:
Georg Brandl	604c121	2007-08-23 21:36:05 +0000	[diff] [blame]	396	match checks for a match only at the beginning of the string, while
				397	search checks for a match anywhere in the string (this is what Perl does
				398	by default).
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	399
Georg Brandl	604c121	2007-08-23 21:36:05 +0000	[diff] [blame]	400	Note that match may differ from search even when using a regular expression
				401	beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	402	:const:`MULTILINE` mode also immediately following a newline. The "match"
				403	operation succeeds only if the pattern matches at the start of the string
				404	regardless of mode, or at the starting position given by the optional pos
				405	argument regardless of whether a newline precedes it.
				406
				407	.. % Examples from Tim Peters:
				408
				409	::
				410
				411	re.compile("a").match("ba", 1) # succeeds
				412	re.compile("^a").search("ba", 1) # fails; 'a' not at start
				413	re.compile("^a").search("\na", 1) # fails; 'a' not at start
				414	re.compile("^a", re.M).search("\na", 1) # succeeds
				415	re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
				416
				417
				418	.. _contents-of-module-re:
				419
				420	Module Contents
				421	---------------
				422
				423	The module defines several functions, constants, and an exception. Some of the
				424	functions are simplified versions of the full featured methods for compiled
				425	regular expressions. Most non-trivial applications always use the compiled
				426	form.
				427
				428
				429	.. function:: compile(pattern[, flags])
				430
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	431	Compile a regular expression pattern into a regular expression object, which
				432	can be used for matching using its :func:`match` and :func:`search` methods,
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	433	described below.
				434
				435	The expression's behaviour can be modified by specifying a flags value.
				436	Values can be any of the following variables, combined using bitwise OR (the
				437	``\|`` operator).
				438
				439	The sequence ::
				440
				441	prog = re.compile(pat)
				442	result = prog.match(str)
				443
				444	is equivalent to ::
				445
				446	result = re.match(pat, str)
				447
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	448	but the version using :func:`compile` is more efficient when the expression
				449	will be used several times in a single program.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	450
				451	.. % (The compiled version of the last pattern passed to
				452	.. % \function{re.match()} or \function{re.search()} is cached, so
				453	.. % programs that use only a single regular expression at a time needn't
				454	.. % worry about compiling regular expressions.)
				455
				456
				457	.. data:: I
				458	IGNORECASE
				459
				460	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
				461	lowercase letters, too. This is not affected by the current locale.
				462
				463
				464	.. data:: L
				465	LOCALE
				466
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	467	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
				468	current locale.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	469
				470
				471	.. data:: M
				472	MULTILINE
				473
				474	When specified, the pattern character ``'^'`` matches at the beginning of the
				475	string and at the beginning of each line (immediately following each newline);
				476	and the pattern character ``'$'`` matches at the end of the string and at the
				477	end of each line (immediately preceding each newline). By default, ``'^'``
				478	matches only at the beginning of the string, and ``'$'`` only at the end of the
				479	string and immediately before the newline (if any) at the end of the string.
				480
				481
				482	.. data:: S
				483	DOTALL
				484
				485	Make the ``'.'`` special character match any character at all, including a
				486	newline; without this flag, ``'.'`` will match anything except a newline.
				487
				488
				489	.. data:: U
				490	UNICODE
				491
				492	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
				493	on the Unicode character properties database.
				494
				495	.. versionadded:: 2.0
				496
				497
				498	.. data:: X
				499	VERBOSE
				500
				501	This flag allows you to write regular expressions that look nicer. Whitespace
				502	within the pattern is ignored, except when in a character class or preceded by
				503	an unescaped backslash, and, when a line contains a ``'#'`` neither in a
				504	character class or preceded by an unescaped backslash, all characters from the
				505	leftmost such ``'#'`` through the end of the line are ignored.
				506
				507	.. % XXX should add an example here
				508
				509
				510	.. function:: search(pattern, string[, flags])
				511
				512	Scan through string looking for a location where the regular expression
				513	pattern produces a match, and return a corresponding :class:`MatchObject`
				514	instance. Return ``None`` if no position in the string matches the pattern; note
				515	that this is different from finding a zero-length match at some point in the
				516	string.
				517
				518
				519	.. function:: match(pattern, string[, flags])
				520
				521	If zero or more characters at the beginning of string match the regular
				522	expression pattern, return a corresponding :class:`MatchObject` instance.
				523	Return ``None`` if the string does not match the pattern; note that this is
				524	different from a zero-length match.
				525
				526	.. note::
				527
				528	If you want to locate a match anywhere in string, use :meth:`search` instead.
				529
				530
				531	.. function:: split(pattern, string[, maxsplit=0])
				532
				533	Split string by the occurrences of pattern. If capturing parentheses are
				534	used in pattern, then the text of all groups in the pattern are also returned
				535	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				536	splits occur, and the remainder of the string is returned as the final element
				537	of the list. (Incompatibility note: in the original Python 1.5 release,
				538	maxsplit was ignored. This has been fixed in later releases.) ::
				539
				540	>>> re.split('\W+', 'Words, words, words.')
				541	['Words', 'words', 'words', '']
				542	>>> re.split('(\W+)', 'Words, words, words.')
				543	['Words', ', ', 'words', ', ', 'words', '.', '']
				544	>>> re.split('\W+', 'Words, words, words.', 1)
				545	['Words', 'words, words.']
				546
Skip Montanaro	222907d	2007-09-01 17:40:03 +0000	[diff] [blame]	547	Note that split will never split a string on an empty pattern match.
				548	For example ::
				549
				550	>>> re.split('x*', 'foo')
				551	['foo']
				552	>>> re.split("(?m)^$", "foo\n\nbar\n")
				553	['foo\n\nbar\n']
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	554
				555	.. function:: findall(pattern, string[, flags])
				556
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	557	Return all non-overlapping matches of pattern in string, as a list of
				558	strings. If one or more groups are present in the pattern, return a list of
				559	groups; this will be a list of tuples if the pattern has more than one group.
				560	Empty matches are included in the result unless they touch the beginning of
				561	another match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	562
				563	.. versionadded:: 1.5.2
				564
				565	.. versionchanged:: 2.4
				566	Added the optional flags argument.
				567
				568
				569	.. function:: finditer(pattern, string[, flags])
				570
Georg Brandl	e7a0990	2007-10-21 12:10:28 +0000	[diff] [blame]	571	Return an :term:`iterator` yielding :class:`MatchObject` instances over all
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	572	non-overlapping matches for the RE pattern in string. Empty matches are
				573	included in the result unless they touch the beginning of another match.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	574
				575	.. versionadded:: 2.2
				576
				577	.. versionchanged:: 2.4
				578	Added the optional flags argument.
				579
				580
				581	.. function:: sub(pattern, repl, string[, count])
				582
				583	Return the string obtained by replacing the leftmost non-overlapping occurrences
				584	of pattern in string by the replacement repl. If the pattern isn't found,
				585	string is returned unchanged. repl can be a string or a function; if it is
				586	a string, any backslash escapes in it are processed. That is, ``\n`` is
				587	converted to a single newline character, ``\r`` is converted to a linefeed, and
				588	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				589	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
				590	For example::
				591
				592	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				593	... r'static PyObject*\npy_\1(void)\n{',
				594	... 'def myfunc():')
				595	'static PyObject*\npy_myfunc(void)\n{'
				596
				597	If repl is a function, it is called for every non-overlapping occurrence of
				598	pattern. The function takes a single match object argument, and returns the
				599	replacement string. For example::
				600
				601	>>> def dashrepl(matchobj):
				602	... if matchobj.group(0) == '-': return ' '
				603	... else: return '-'
				604	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				605	'pro--gram files'
				606
				607	The pattern may be a string or an RE object; if you need to specify regular
				608	expression flags, you must use a RE object, or use embedded modifiers in a
				609	pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
				610
				611	The optional argument count is the maximum number of pattern occurrences to be
				612	replaced; count must be a non-negative integer. If omitted or zero, all
				613	occurrences will be replaced. Empty matches for the pattern are replaced only
				614	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				615	``'-a-b-c-'``.
				616
				617	In addition to character escapes and backreferences as described above,
				618	``\g<name>`` will use the substring matched by the group named ``name``, as
				619	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				620	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				621	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				622	reference to group 20, not a reference to group 2 followed by the literal
				623	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				624	substring matched by the RE.
				625
				626
				627	.. function:: subn(pattern, repl, string[, count])
				628
				629	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				630	number_of_subs_made)``.
				631
				632
				633	.. function:: escape(string)
				634
				635	Return string with all non-alphanumerics backslashed; this is useful if you
				636	want to match an arbitrary literal string that may have regular expression
				637	metacharacters in it.
				638
				639
				640	.. exception:: error
				641
				642	Exception raised when a string passed to one of the functions here is not a
				643	valid regular expression (for example, it might contain unmatched parentheses)
				644	or when some other error occurs during compilation or matching. It is never an
				645	error if a string contains no match for a pattern.
				646
				647
				648	.. _re-objects:
				649
				650	Regular Expression Objects
				651	--------------------------
				652
				653	Compiled regular expression objects support the following methods and
				654	attributes:
				655
				656
				657	.. method:: RegexObject.match(string[, pos[, endpos]])
				658
				659	If zero or more characters at the beginning of string match this regular
				660	expression, return a corresponding :class:`MatchObject` instance. Return
				661	``None`` if the string does not match the pattern; note that this is different
				662	from a zero-length match.
				663
				664	.. note::
				665
				666	If you want to locate a match anywhere in string, use :meth:`search` instead.
				667
				668	The optional second parameter pos gives an index in the string where the
				669	search is to start; it defaults to ``0``. This is not completely equivalent to
				670	slicing the string; the ``'^'`` pattern character matches at the real beginning
				671	of the string and at positions just after a newline, but not necessarily at the
				672	index where the search is to start.
				673
				674	The optional parameter endpos limits how far the string will be searched; it
				675	will be as if the string is endpos characters long, so only the characters
				676	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
				677	than pos, no match will be found, otherwise, if rx is a compiled regular
				678	expression object, ``rx.match(string, 0, 50)`` is equivalent to
				679	``rx.match(string[:50], 0)``.
				680
				681
				682	.. method:: RegexObject.search(string[, pos[, endpos]])
				683
				684	Scan through string looking for a location where this regular expression
				685	produces a match, and return a corresponding :class:`MatchObject` instance.
				686	Return ``None`` if no position in the string matches the pattern; note that this
				687	is different from finding a zero-length match at some point in the string.
				688
				689	The optional pos and endpos parameters have the same meaning as for the
				690	:meth:`match` method.
				691
				692
				693	.. method:: RegexObject.split(string[, maxsplit=0])
				694
				695	Identical to the :func:`split` function, using the compiled pattern.
				696
				697
				698	.. method:: RegexObject.findall(string[, pos[, endpos]])
				699
				700	Identical to the :func:`findall` function, using the compiled pattern.
				701
				702
				703	.. method:: RegexObject.finditer(string[, pos[, endpos]])
				704
				705	Identical to the :func:`finditer` function, using the compiled pattern.
				706
				707
				708	.. method:: RegexObject.sub(repl, string[, count=0])
				709
				710	Identical to the :func:`sub` function, using the compiled pattern.
				711
				712
				713	.. method:: RegexObject.subn(repl, string[, count=0])
				714
				715	Identical to the :func:`subn` function, using the compiled pattern.
				716
				717
				718	.. attribute:: RegexObject.flags
				719
				720	The flags argument used when the RE object was compiled, or ``0`` if no flags
				721	were provided.
				722
				723
				724	.. attribute:: RegexObject.groupindex
				725
				726	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				727	numbers. The dictionary is empty if no symbolic groups were used in the
				728	pattern.
				729
				730
				731	.. attribute:: RegexObject.pattern
				732
				733	The pattern string from which the RE object was compiled.
				734
				735
				736	.. _match-objects:
				737
				738	Match Objects
				739	-------------
				740
Georg Brandl	ba2e519	2007-09-27 06:26:58 +0000	[diff] [blame]	741	Match objects always have a boolean value of :const:`True`, so that you can test
				742	whether e.g. :func:`match` resulted in a match with a simple if statement. They
				743	support the following methods and attributes:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	744
				745
				746	.. method:: MatchObject.expand(template)
				747
				748	Return the string obtained by doing backslash substitution on the template
				749	string template, as done by the :meth:`sub` method. Escapes such as ``\n`` are
				750	converted to the appropriate characters, and numeric backreferences (``\1``,
				751	``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
				752	contents of the corresponding group.
				753
				754
				755	.. method:: MatchObject.group([group1, ...])
				756
				757	Returns one or more subgroups of the match. If there is a single argument, the
				758	result is a single string; if there are multiple arguments, the result is a
				759	tuple with one item per argument. Without arguments, group1 defaults to zero
				760	(the whole match is returned). If a groupN argument is zero, the corresponding
				761	return value is the entire matching string; if it is in the inclusive range
				762	[1..99], it is the string matching the corresponding parenthesized group. If a
				763	group number is negative or larger than the number of groups defined in the
				764	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				765	part of the pattern that did not match, the corresponding result is ``None``.
				766	If a group is contained in a part of the pattern that matched multiple times,
				767	the last match is returned.
				768
				769	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				770	arguments may also be strings identifying groups by their group name. If a
				771	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				772	exception is raised.
				773
				774	A moderately complicated example::
				775
				776	m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
				777
				778	After performing this match, ``m.group(1)`` is ``'3'``, as is
				779	``m.group('int')``, and ``m.group(2)`` is ``'14'``.
				780
				781
				782	.. method:: MatchObject.groups([default])
				783
				784	Return a tuple containing all the subgroups of the match, from 1 up to however
				785	many groups are in the pattern. The default argument is used for groups that
				786	did not participate in the match; it defaults to ``None``. (Incompatibility
				787	note: in the original Python 1.5 release, if the tuple was one element long, a
				788	string would be returned instead. In later versions (from 1.5.1 on), a
				789	singleton tuple is returned in such cases.)
				790
				791
				792	.. method:: MatchObject.groupdict([default])
				793
				794	Return a dictionary containing all the named subgroups of the match, keyed by
				795	the subgroup name. The default argument is used for groups that did not
				796	participate in the match; it defaults to ``None``.
				797
				798
				799	.. method:: MatchObject.start([group])
				800	MatchObject.end([group])
				801
				802	Return the indices of the start and end of the substring matched by group;
				803	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				804	group exists but did not contribute to the match. For a match object m, and
				805	a group g that did contribute to the match, the substring matched by group g
				806	(equivalent to ``m.group(g)``) is ::
				807
				808	m.string[m.start(g):m.end(g)]
				809
				810	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				811	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				812	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				813	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
				814
				815
				816	.. method:: MatchObject.span([group])
				817
				818	For :class:`MatchObject` m, return the 2-tuple ``(m.start(group),
				819	m.end(group))``. Note that if group did not contribute to the match, this is
				820	``(-1, -1)``. Again, group defaults to zero.
				821
				822
				823	.. attribute:: MatchObject.pos
				824
				825	The value of pos which was passed to the :func:`search` or :func:`match`
				826	method of the :class:`RegexObject`. This is the index into the string at which
				827	the RE engine started looking for a match.
				828
				829
				830	.. attribute:: MatchObject.endpos
				831
				832	The value of endpos which was passed to the :func:`search` or :func:`match`
				833	method of the :class:`RegexObject`. This is the index into the string beyond
				834	which the RE engine will not go.
				835
				836
				837	.. attribute:: MatchObject.lastindex
				838
				839	The integer index of the last matched capturing group, or ``None`` if no group
				840	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				841	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				842	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				843	string.
				844
				845
				846	.. attribute:: MatchObject.lastgroup
				847
				848	The name of the last matched capturing group, or ``None`` if the group didn't
				849	have a name, or if no group was matched at all.
				850
				851
				852	.. attribute:: MatchObject.re
				853
				854	The regular expression object whose :meth:`match` or :meth:`search` method
				855	produced this :class:`MatchObject` instance.
				856
				857
				858	.. attribute:: MatchObject.string
				859
				860	The string passed to :func:`match` or :func:`search`.
				861
				862
				863	Examples
				864	--------
				865
				866	Simulating scanf()
				867
				868	.. index:: single: scanf()
				869
				870	Python does not currently have an equivalent to :cfunc:`scanf`. Regular
				871	expressions are generally more powerful, though also more verbose, than
				872	:cfunc:`scanf` format strings. The table below offers some more-or-less
				873	equivalent mappings between :cfunc:`scanf` format tokens and regular
				874	expressions.
				875
				876	+--------------------------------+---------------------------------------------+
				877	\| :cfunc:`scanf` Token \| Regular Expression \|
				878	+================================+=============================================+
				879	\| ``%c`` \| ``.`` \|
				880	+--------------------------------+---------------------------------------------+
				881	\| ``%5c`` \| ``.{5}`` \|
				882	+--------------------------------+---------------------------------------------+
				883	\| ``%d`` \| ``[-+]?\d+`` \|
				884	+--------------------------------+---------------------------------------------+
				885	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				886	+--------------------------------+---------------------------------------------+
				887	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				888	+--------------------------------+---------------------------------------------+
				889	\| ``%o`` \| ``0[0-7]*`` \|
				890	+--------------------------------+---------------------------------------------+
				891	\| ``%s`` \| ``\S+`` \|
				892	+--------------------------------+---------------------------------------------+
				893	\| ``%u`` \| ``\d+`` \|
				894	+--------------------------------+---------------------------------------------+
				895	\| ``%x``, ``%X`` \| ``0[xX][\dA-Fa-f]+`` \|
				896	+--------------------------------+---------------------------------------------+
				897
				898	To extract the filename and numbers from a string like ::
				899
				900	/usr/sbin/sendmail - 0 errors, 4 warnings
				901
				902	you would use a :cfunc:`scanf` format like ::
				903
				904	%s - %d errors, %d warnings
				905
				906	The equivalent regular expression would be ::
				907
				908	(\S+) - (\d+) errors, (\d+) warnings
				909
				910	Avoiding recursion
				911
				912	If you create regular expressions that require the engine to perform a lot of
				913	recursion, you may encounter a :exc:`RuntimeError` exception with the message
				914	``maximum recursion limit`` exceeded. For example, ::
				915
				916	>>> import re
				917	>>> s = 'Begin ' + 1000*'a very long string ' + 'end'
				918	>>> re.match('Begin (\w\| )*? end', s).end()
				919	Traceback (most recent call last):
				920	File "<stdin>", line 1, in ?
				921	File "/usr/local/lib/python2.5/re.py", line 132, in match
				922	return _compile(pattern, flags).match(string)
				923	RuntimeError: maximum recursion limit exceeded
				924
				925	You can often restructure your regular expression to avoid recursion.
				926
				927	Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
				928	avoid recursion. Thus, the above regular expression can avoid recursion by
				929	being recast as ``Begin [a-zA-Z0-9_ ]*?end``. As a further benefit, such
				930	regular expressions will run faster than their recursive equivalents.
				931