Blame - Doc/library/re.rst - platform/external/python/cpython2

blob: a3d3deaa8e35caf0ef02943dde10e6b59bf0b05f [file] [log] [blame]

Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	1
				2	:mod:`re` --- Regular expression operations
				3	===========================================
				4
				5	.. module:: re
				6	:synopsis: Regular expression operations.
				7	.. moduleauthor:: Fredrik Lundh <fredrik@pythonware.com>
				8	.. sectionauthor:: Andrew M. Kuchling <amk@amk.ca>
				9
				10
				11
				12
				13	This module provides regular expression matching operations similar to
				14	those found in Perl. Both patterns and strings to be searched can be
				15	Unicode strings as well as 8-bit strings. The :mod:`re` module is
				16	always available.
				17
				18	Regular expressions use the backslash character (``'\'``) to indicate
				19	special forms or to allow special characters to be used without invoking
				20	their special meaning. This collides with Python's usage of the same
				21	character for the same purpose in string literals; for example, to match
				22	a literal backslash, one might have to write ``'\\\\'`` as the pattern
				23	string, because the regular expression must be ``\\``, and each
				24	backslash must be expressed as ``\\`` inside a regular Python string
				25	literal.
				26
				27	The solution is to use Python's raw string notation for regular expression
				28	patterns; backslashes are not handled in any special way in a string literal
				29	prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
				30	``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
				31	newline. Usually patterns will be expressed in Python code using this raw string
				32	notation.
				33
				34	.. seealso::
				35
				36	Mastering Regular Expressions
				37	Book on regular expressions by Jeffrey Friedl, published by O'Reilly. The
				38	second edition of the book no longer covers Python at all, but the first
				39	edition covered writing good regular expression patterns in great detail.
				40
				41
				42	.. _re-syntax:
				43
				44	Regular Expression Syntax
				45	-------------------------
				46
				47	A regular expression (or RE) specifies a set of strings that matches it; the
				48	functions in this module let you check if a particular string matches a given
				49	regular expression (or if a given regular expression matches a particular
				50	string, which comes down to the same thing).
				51
				52	Regular expressions can be concatenated to form new regular expressions; if A
				53	and B are both regular expressions, then AB is also a regular expression.
				54	In general, if a string p matches A and another string q matches B, the
				55	string pq will match AB. This holds unless A or B contain low precedence
				56	operations; boundary conditions between A and B; or have numbered group
				57	references. Thus, complex expressions can easily be constructed from simpler
				58	primitive expressions like the ones described here. For details of the theory
				59	and implementation of regular expressions, consult the Friedl book referenced
				60	above, or almost any textbook about compiler construction.
				61
				62	A brief explanation of the format of regular expressions follows. For further
				63	information and a gentler presentation, consult the Regular Expression HOWTO,
				64	accessible from http://www.python.org/doc/howto/.
				65
				66	Regular expressions can contain both special and ordinary characters. Most
				67	ordinary characters, like ``'A'``, ``'a'``, or ``'0'``, are the simplest regular
				68	expressions; they simply match themselves. You can concatenate ordinary
				69	characters, so ``last`` matches the string ``'last'``. (In the rest of this
				70	section, we'll write RE's in ``this special style``, usually without quotes, and
				71	strings to be matched ``'in single quotes'``.)
				72
				73	Some characters, like ``'\|'`` or ``'('``, are special. Special
				74	characters either stand for classes of ordinary characters, or affect
				75	how the regular expressions around them are interpreted. Regular
				76	expression pattern strings may not contain null bytes, but can specify
				77	the null byte using the ``\number`` notation, e.g., ``'\x00'``.
				78
				79
				80	The special characters are:
				81
				82	.. %
				83
				84	``'.'``
				85	(Dot.) In the default mode, this matches any character except a newline. If
				86	the :const:`DOTALL` flag has been specified, this matches any character
				87	including a newline.
				88
				89	``'^'``
				90	(Caret.) Matches the start of the string, and in :const:`MULTILINE` mode also
				91	matches immediately after each newline.
				92
				93	``'$'``
				94	Matches the end of the string or just before the newline at the end of the
				95	string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
				96	matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
				97	only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
				98	matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode.
				99
				100	``'*'``
				101	Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
				102	many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
				103	by any number of 'b's.
				104
				105	``'+'``
				106	Causes the resulting RE to match 1 or more repetitions of the preceding RE.
				107	``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
				108	match just 'a'.
				109
				110	``'?'``
				111	Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
				112	``ab?`` will match either 'a' or 'ab'.
				113
				114	``*?``, ``+?``, ``??``
				115	The ``'*'``, ``'+'``, and ``'?'`` qualifiers are all :dfn:`greedy`; they match
				116	as much text as possible. Sometimes this behaviour isn't desired; if the RE
				117	``<.*>`` is matched against ``'<H1>title</H1>'``, it will match the entire
				118	string, and not just ``'<H1>'``. Adding ``'?'`` after the qualifier makes it
				119	perform the match in :dfn:`non-greedy` or :dfn:`minimal` fashion; as few
				120	characters as possible will be matched. Using ``.*?`` in the previous
				121	expression will match only ``'<H1>'``.
				122
				123	``{m}``
				124	Specifies that exactly m copies of the previous RE should be matched; fewer
				125	matches cause the entire RE not to match. For example, ``a{6}`` will match
				126	exactly six ``'a'`` characters, but not five.
				127
				128	``{m,n}``
				129	Causes the resulting RE to match from m to n repetitions of the preceding
				130	RE, attempting to match as many repetitions as possible. For example,
				131	``a{3,5}`` will match from 3 to 5 ``'a'`` characters. Omitting m specifies a
				132	lower bound of zero, and omitting n specifies an infinite upper bound. As an
				133	example, ``a{4,}b`` will match ``aaaab`` or a thousand ``'a'`` characters
				134	followed by a ``b``, but not ``aaab``. The comma may not be omitted or the
				135	modifier would be confused with the previously described form.
				136
				137	``{m,n}?``
				138	Causes the resulting RE to match from m to n repetitions of the preceding
				139	RE, attempting to match as few repetitions as possible. This is the
				140	non-greedy version of the previous qualifier. For example, on the
				141	6-character string ``'aaaaaa'``, ``a{3,5}`` will match 5 ``'a'`` characters,
				142	while ``a{3,5}?`` will only match 3 characters.
				143
				144	``'\'``
				145	Either escapes special characters (permitting you to match characters like
				146	``'*'``, ``'?'``, and so forth), or signals a special sequence; special
				147	sequences are discussed below.
				148
				149	If you're not using a raw string to express the pattern, remember that Python
				150	also uses the backslash as an escape sequence in string literals; if the escape
				151	sequence isn't recognized by Python's parser, the backslash and subsequent
				152	character are included in the resulting string. However, if Python would
				153	recognize the resulting sequence, the backslash should be repeated twice. This
				154	is complicated and hard to understand, so it's highly recommended that you use
				155	raw strings for all but the simplest expressions.
				156
				157	``[]``
				158	Used to indicate a set of characters. Characters can be listed individually, or
				159	a range of characters can be indicated by giving two characters and separating
				160	them by a ``'-'``. Special characters are not active inside sets. For example,
				161	``[akm$]`` will match any of the characters ``'a'``, ``'k'``,
				162	``'m'``, or ``'$'``; ``[a-z]`` will match any lowercase letter, and
				163	``[a-zA-Z0-9]`` matches any letter or digit. Character classes such
				164	as ``\w`` or ``\S`` (defined below) are also acceptable inside a
				165	range, although the characters they match depends on whether :const:`LOCALE`
				166	or :const:`UNICODE` mode is in force. If you want to include a
				167	``']'`` or a ``'-'`` inside a set, precede it with a backslash, or
				168	place it as the first character. The pattern ``[]]`` will match
				169	``']'``, for example.
				170
				171	You can match the characters not within a range by :dfn:`complementing` the set.
				172	This is indicated by including a ``'^'`` as the first character of the set;
				173	``'^'`` elsewhere will simply match the ``'^'`` character. For example,
				174	``[^5]`` will match any character except ``'5'``, and ``[^^]`` will match any
				175	character except ``'^'``.
				176
				177	``'\|'``
				178	``A\|B``, where A and B can be arbitrary REs, creates a regular expression that
				179	will match either A or B. An arbitrary number of REs can be separated by the
				180	``'\|'`` in this way. This can be used inside groups (see below) as well. As
				181	the target string is scanned, REs separated by ``'\|'`` are tried from left to
				182	right. When one pattern completely matches, that branch is accepted. This means
				183	that once ``A`` matches, ``B`` will not be tested further, even if it would
				184	produce a longer overall match. In other words, the ``'\|'`` operator is never
				185	greedy. To match a literal ``'\|'``, use ``\\|``, or enclose it inside a
				186	character class, as in ``[\|]``.
				187
				188	``(...)``
				189	Matches whatever regular expression is inside the parentheses, and indicates the
				190	start and end of a group; the contents of a group can be retrieved after a match
				191	has been performed, and can be matched later in the string with the ``\number``
				192	special sequence, described below. To match the literals ``'('`` or ``')'``,
				193	use ``$`` or ``$``, or enclose them inside a character class: ``[(] [)]``.
				194
				195	``(?...)``
				196	This is an extension notation (a ``'?'`` following a ``'('`` is not meaningful
				197	otherwise). The first character after the ``'?'`` determines what the meaning
				198	and further syntax of the construct is. Extensions usually do not create a new
				199	group; ``(?P<name>...)`` is the only exception to this rule. Following are the
				200	currently supported extensions.
				201
				202	``(?iLmsux)``
				203	(One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
				204	``'u'``, ``'x'``.) The group matches the empty string; the letters
				205	set the corresponding flags: :const:`re.I` (ignore case),
				206	:const:`re.L` (locale dependent), :const:`re.M` (multi-line),
				207	:const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
				208	and :const:`re.X` (verbose), for the entire regular expression. (The
				209	flags are described in :ref:`contents-of-module-re`.) This
				210	is useful if you wish to include the flags as part of the regular
				211	expression, instead of passing a flag argument to the
				212	:func:`compile` function.
				213
				214	Note that the ``(?x)`` flag changes how the expression is parsed. It should be
				215	used first in the expression string, or after one or more whitespace characters.
				216	If there are non-whitespace characters before the flag, the results are
				217	undefined.
				218
				219	``(?:...)``
				220	A non-grouping version of regular parentheses. Matches whatever regular
				221	expression is inside the parentheses, but the substring matched by the group
				222	cannot be retrieved after performing a match or referenced later in the
				223	pattern.
				224
				225	``(?P<name>...)``
				226	Similar to regular parentheses, but the substring matched by the group is
				227	accessible via the symbolic group name name. Group names must be valid Python
				228	identifiers, and each group name must be defined only once within a regular
				229	expression. A symbolic group is also a numbered group, just as if the group
				230	were not named. So the group named 'id' in the example below can also be
				231	referenced as the numbered group 1.
				232
				233	For example, if the pattern is ``(?P<id>[a-zA-Z_]\w*)``, the group can be
				234	referenced by its name in arguments to methods of match objects, such as
				235	``m.group('id')`` or ``m.end('id')``, and also by name in pattern text (for
				236	example, ``(?P=id)``) and replacement text (such as ``\g<id>``).
				237
				238	``(?P=name)``
				239	Matches whatever text was matched by the earlier group named name.
				240
				241	``(?#...)``
				242	A comment; the contents of the parentheses are simply ignored.
				243
				244	``(?=...)``
				245	Matches if ``...`` matches next, but doesn't consume any of the string. This is
				246	called a lookahead assertion. For example, ``Isaac (?=Asimov)`` will match
				247	``'Isaac '`` only if it's followed by ``'Asimov'``.
				248
				249	``(?!...)``
				250	Matches if ``...`` doesn't match next. This is a negative lookahead assertion.
				251	For example, ``Isaac (?!Asimov)`` will match ``'Isaac '`` only if it's not
				252	followed by ``'Asimov'``.
				253
				254	``(?<=...)``
				255	Matches if the current position in the string is preceded by a match for ``...``
				256	that ends at the current position. This is called a :dfn:`positive lookbehind
				257	assertion`. ``(?<=abc)def`` will find a match in ``abcdef``, since the
				258	lookbehind will back up 3 characters and check if the contained pattern matches.
				259	The contained pattern must only match strings of some fixed length, meaning that
				260	``abc`` or ``a\|b`` are allowed, but ``a*`` and ``a{3,4}`` are not. Note that
				261	patterns which start with positive lookbehind assertions will never match at the
				262	beginning of the string being searched; you will most likely want to use the
				263	:func:`search` function rather than the :func:`match` function::
				264
				265	>>> import re
				266	>>> m = re.search('(?<=abc)def', 'abcdef')
				267	>>> m.group(0)
				268	'def'
				269
				270	This example looks for a word following a hyphen::
				271
				272	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				273	>>> m.group(0)
				274	'egg'
				275
				276	``(?<!...)``
				277	Matches if the current position in the string is not preceded by a match for
				278	``...``. This is called a :dfn:`negative lookbehind assertion`. Similar to
				279	positive lookbehind assertions, the contained pattern must only match strings of
				280	some fixed length. Patterns which start with negative lookbehind assertions may
				281	match at the beginning of the string being searched.
				282
				283	``(?(id/name)yes-pattern\|no-pattern)``
				284	Will try to match with ``yes-pattern`` if the group with given id or name
				285	exists, and with ``no-pattern`` if it doesn't. ``no-pattern`` is optional and
				286	can be omitted. For example, ``(<)?(\w+@\w+(?:\.\w+)+)(?(1)>)`` is a poor email
				287	matching pattern, which will match with ``'<user@host.com>'`` as well as
				288	``'user@host.com'``, but not with ``'<user@host.com'``.
				289
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	290
				291	The special sequences consist of ``'\'`` and a character from the list below.
				292	If the ordinary character is not on the list, then the resulting RE will match
				293	the second character. For example, ``\$`` matches the character ``'$'``.
				294
				295	.. %
				296
				297	``\number``
				298	Matches the contents of the group of the same number. Groups are numbered
				299	starting from 1. For example, ``(.+) \1`` matches ``'the the'`` or ``'55 55'``,
				300	but not ``'the end'`` (note the space after the group). This special sequence
				301	can only be used to match one of the first 99 groups. If the first digit of
				302	number is 0, or number is 3 octal digits long, it will not be interpreted as
				303	a group match, but as the character with octal value number. Inside the
				304	``'['`` and ``']'`` of a character class, all numeric escapes are treated as
				305	characters.
				306
				307	``\A``
				308	Matches only at the start of the string.
				309
				310	``\b``
				311	Matches the empty string, but only at the beginning or end of a word. A word is
				312	defined as a sequence of alphanumeric or underscore characters, so the end of a
				313	word is indicated by whitespace or a non-alphanumeric, non-underscore character.
				314	Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
				315	precise set of characters deemed to be alphanumeric depends on the values of the
				316	``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
				317	the backspace character, for compatibility with Python's string literals.
				318
				319	``\B``
				320	Matches the empty string, but only when it is not at the beginning or end of a
				321	word. This is just the opposite of ``\b``, so is also subject to the settings
				322	of ``LOCALE`` and ``UNICODE``.
				323
				324	``\d``
				325	When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
				326	is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
				327	whatever is classified as a digit in the Unicode character properties database.
				328
				329	``\D``
				330	When the :const:`UNICODE` flag is not specified, matches any non-digit
				331	character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
				332	will match anything other than character marked as digits in the Unicode
				333	character properties database.
				334
				335	``\s``
				336	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				337	any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
				338	:const:`LOCALE`, it will match this set plus whatever characters are defined as
				339	space for the current locale. If :const:`UNICODE` is set, this will match the
				340	characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
				341	character properties database.
				342
				343	``\S``
				344	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				345	any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
				346	With :const:`LOCALE`, it will match any character not in this set, and not
				347	defined as space in the current locale. If :const:`UNICODE` is set, this will
				348	match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
				349	the Unicode character properties database.
				350
				351	``\w``
				352	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				353	any alphanumeric character and the underscore; this is equivalent to the set
				354	``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
				355	whatever characters are defined as alphanumeric for the current locale. If
				356	:const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
				357	is classified as alphanumeric in the Unicode character properties database.
				358
				359	``\W``
				360	When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
				361	any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
				362	With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
				363	not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
				364	this will match anything other than ``[0-9_]`` and characters marked as
				365	alphanumeric in the Unicode character properties database.
				366
				367	``\Z``
				368	Matches only at the end of the string.
				369
				370	Most of the standard escapes supported by Python string literals are also
				371	accepted by the regular expression parser::
				372
				373	\a \b \f \n
				374	\r \t \v \x
				375	\\
				376
				377	Octal escapes are included in a limited form: If the first digit is a 0, or if
				378	there are three octal digits, it is considered an octal escape. Otherwise, it is
				379	a group reference. As for string literals, octal escapes are always at most
				380	three digits in length.
				381
				382	.. % Note the lack of a period in the section title; it causes problems
				383	.. % with readers of the GNU info version. See http://www.python.org/sf/581414.
				384
				385
				386	.. _matching-searching:
				387
				388	Matching vs Searching
				389	---------------------
				390
				391	.. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org>
				392
				393
				394	Python offers two different primitive operations based on regular expressions:
Guido van Rossum	04110fb	2007-08-24 16:32:05 +0000	[diff] [blame]	395	match checks for a match only at the beginning of the string, while
				396	search checks for a match anywhere in the string (this is what Perl does
				397	by default).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	398
Guido van Rossum	04110fb	2007-08-24 16:32:05 +0000	[diff] [blame]	399	Note that match may differ from search even when using a regular expression
				400	beginning with ``'^'``: ``'^'`` matches only at the start of the string, or in
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	401	:const:`MULTILINE` mode also immediately following a newline. The "match"
				402	operation succeeds only if the pattern matches at the start of the string
				403	regardless of mode, or at the starting position given by the optional pos
				404	argument regardless of whether a newline precedes it.
				405
				406	.. % Examples from Tim Peters:
				407
				408	::
				409
				410	re.compile("a").match("ba", 1) # succeeds
				411	re.compile("^a").search("ba", 1) # fails; 'a' not at start
				412	re.compile("^a").search("\na", 1) # fails; 'a' not at start
				413	re.compile("^a", re.M).search("\na", 1) # succeeds
				414	re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
				415
				416
				417	.. _contents-of-module-re:
				418
				419	Module Contents
				420	---------------
				421
				422	The module defines several functions, constants, and an exception. Some of the
				423	functions are simplified versions of the full featured methods for compiled
				424	regular expressions. Most non-trivial applications always use the compiled
				425	form.
				426
				427
				428	.. function:: compile(pattern[, flags])
				429
				430	Compile a regular expression pattern into a regular expression object, which can
				431	be used for matching using its :func:`match` and :func:`search` methods,
				432	described below.
				433
				434	The expression's behaviour can be modified by specifying a flags value.
				435	Values can be any of the following variables, combined using bitwise OR (the
				436	``\|`` operator).
				437
				438	The sequence ::
				439
				440	prog = re.compile(pat)
				441	result = prog.match(str)
				442
				443	is equivalent to ::
				444
				445	result = re.match(pat, str)
				446
				447	but the version using :func:`compile` is more efficient when the expression will
				448	be used several times in a single program.
				449
				450	.. % (The compiled version of the last pattern passed to
				451	.. % \function{re.match()} or \function{re.search()} is cached, so
				452	.. % programs that use only a single regular expression at a time needn't
				453	.. % worry about compiling regular expressions.)
				454
				455
				456	.. data:: I
				457	IGNORECASE
				458
				459	Perform case-insensitive matching; expressions like ``[A-Z]`` will match
				460	lowercase letters, too. This is not affected by the current locale.
				461
				462
				463	.. data:: L
				464	LOCALE
				465
				466	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the current
				467	locale.
				468
				469
				470	.. data:: M
				471	MULTILINE
				472
				473	When specified, the pattern character ``'^'`` matches at the beginning of the
				474	string and at the beginning of each line (immediately following each newline);
				475	and the pattern character ``'$'`` matches at the end of the string and at the
				476	end of each line (immediately preceding each newline). By default, ``'^'``
				477	matches only at the beginning of the string, and ``'$'`` only at the end of the
				478	string and immediately before the newline (if any) at the end of the string.
				479
				480
				481	.. data:: S
				482	DOTALL
				483
				484	Make the ``'.'`` special character match any character at all, including a
				485	newline; without this flag, ``'.'`` will match anything except a newline.
				486
				487
				488	.. data:: U
				489	UNICODE
				490
				491	Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
				492	on the Unicode character properties database.
				493
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	494
				495	.. data:: X
				496	VERBOSE
				497
				498	This flag allows you to write regular expressions that look nicer. Whitespace
				499	within the pattern is ignored, except when in a character class or preceded by
				500	an unescaped backslash, and, when a line contains a ``'#'`` neither in a
				501	character class or preceded by an unescaped backslash, all characters from the
				502	leftmost such ``'#'`` through the end of the line are ignored.
				503
Georg Brandl	81ac1ce	2007-08-31 17:17:17 +0000	[diff] [blame]	504	This means that the two following regular expression objects are equal::
				505
				506	re.compile(r""" [a-z]+ # some letters
				507	\.\. # two dots
				508	[a-z]* # perhaps more letters""")
				509	re.compile(r"[a-z]+\.\.[a-z]*")
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	510
				511
				512	.. function:: search(pattern, string[, flags])
				513
				514	Scan through string looking for a location where the regular expression
				515	pattern produces a match, and return a corresponding :class:`MatchObject`
				516	instance. Return ``None`` if no position in the string matches the pattern; note
				517	that this is different from finding a zero-length match at some point in the
				518	string.
				519
				520
				521	.. function:: match(pattern, string[, flags])
				522
				523	If zero or more characters at the beginning of string match the regular
				524	expression pattern, return a corresponding :class:`MatchObject` instance.
				525	Return ``None`` if the string does not match the pattern; note that this is
				526	different from a zero-length match.
				527
				528	.. note::
				529
				530	If you want to locate a match anywhere in string, use :meth:`search` instead.
				531
				532
				533	.. function:: split(pattern, string[, maxsplit=0])
				534
				535	Split string by the occurrences of pattern. If capturing parentheses are
				536	used in pattern, then the text of all groups in the pattern are also returned
				537	as part of the resulting list. If maxsplit is nonzero, at most maxsplit
				538	splits occur, and the remainder of the string is returned as the final element
				539	of the list. (Incompatibility note: in the original Python 1.5 release,
				540	maxsplit was ignored. This has been fixed in later releases.) ::
				541
				542	>>> re.split('\W+', 'Words, words, words.')
				543	['Words', 'words', 'words', '']
				544	>>> re.split('(\W+)', 'Words, words, words.')
				545	['Words', ', ', 'words', ', ', 'words', '.', '']
				546	>>> re.split('\W+', 'Words, words, words.', 1)
				547	['Words', 'words, words.']
				548
Thomas Wouters	89d996e	2007-09-08 17:39:28 +0000	[diff] [blame]	549	Note that split will never split a string on an empty pattern match.
				550	For example ::
				551
				552	>>> re.split('x*', 'foo')
				553	['foo']
				554	>>> re.split("(?m)^$", "foo\n\nbar\n")
				555	['foo\n\nbar\n']
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	556
				557	.. function:: findall(pattern, string[, flags])
				558
				559	Return a list of all non-overlapping matches of pattern in string. If one
				560	or more groups are present in the pattern, return a list of groups; this will be
				561	a list of tuples if the pattern has more than one group. Empty matches are
				562	included in the result unless they touch the beginning of another match.
				563
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	564
				565	.. function:: finditer(pattern, string[, flags])
				566
				567	Return an iterator over all non-overlapping matches for the RE pattern in
				568	string. For each match, the iterator returns a match object. Empty matches
				569	are included in the result unless they touch the beginning of another match.
				570
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	571
				572	.. function:: sub(pattern, repl, string[, count])
				573
				574	Return the string obtained by replacing the leftmost non-overlapping occurrences
				575	of pattern in string by the replacement repl. If the pattern isn't found,
				576	string is returned unchanged. repl can be a string or a function; if it is
				577	a string, any backslash escapes in it are processed. That is, ``\n`` is
				578	converted to a single newline character, ``\r`` is converted to a linefeed, and
				579	so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such
				580	as ``\6``, are replaced with the substring matched by group 6 in the pattern.
				581	For example::
				582
				583	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				584	... r'static PyObject*\npy_\1(void)\n{',
				585	... 'def myfunc():')
				586	'static PyObject*\npy_myfunc(void)\n{'
				587
				588	If repl is a function, it is called for every non-overlapping occurrence of
				589	pattern. The function takes a single match object argument, and returns the
				590	replacement string. For example::
				591
				592	>>> def dashrepl(matchobj):
				593	... if matchobj.group(0) == '-': return ' '
				594	... else: return '-'
				595	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				596	'pro--gram files'
				597
				598	The pattern may be a string or an RE object; if you need to specify regular
				599	expression flags, you must use a RE object, or use embedded modifiers in a
				600	pattern; for example, ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``.
				601
				602	The optional argument count is the maximum number of pattern occurrences to be
				603	replaced; count must be a non-negative integer. If omitted or zero, all
				604	occurrences will be replaced. Empty matches for the pattern are replaced only
				605	when not adjacent to a previous match, so ``sub('x*', '-', 'abc')`` returns
				606	``'-a-b-c-'``.
				607
				608	In addition to character escapes and backreferences as described above,
				609	``\g<name>`` will use the substring matched by the group named ``name``, as
				610	defined by the ``(?P<name>...)`` syntax. ``\g<number>`` uses the corresponding
				611	group number; ``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous
				612	in a replacement such as ``\g<2>0``. ``\20`` would be interpreted as a
				613	reference to group 20, not a reference to group 2 followed by the literal
				614	character ``'0'``. The backreference ``\g<0>`` substitutes in the entire
				615	substring matched by the RE.
				616
				617
				618	.. function:: subn(pattern, repl, string[, count])
				619
				620	Perform the same operation as :func:`sub`, but return a tuple ``(new_string,
				621	number_of_subs_made)``.
				622
				623
				624	.. function:: escape(string)
				625
				626	Return string with all non-alphanumerics backslashed; this is useful if you
				627	want to match an arbitrary literal string that may have regular expression
				628	metacharacters in it.
				629
				630
				631	.. exception:: error
				632
				633	Exception raised when a string passed to one of the functions here is not a
				634	valid regular expression (for example, it might contain unmatched parentheses)
				635	or when some other error occurs during compilation or matching. It is never an
				636	error if a string contains no match for a pattern.
				637
				638
				639	.. _re-objects:
				640
				641	Regular Expression Objects
				642	--------------------------
				643
				644	Compiled regular expression objects support the following methods and
				645	attributes:
				646
				647
				648	.. method:: RegexObject.match(string[, pos[, endpos]])
				649
				650	If zero or more characters at the beginning of string match this regular
				651	expression, return a corresponding :class:`MatchObject` instance. Return
				652	``None`` if the string does not match the pattern; note that this is different
				653	from a zero-length match.
				654
				655	.. note::
				656
				657	If you want to locate a match anywhere in string, use :meth:`search` instead.
				658
				659	The optional second parameter pos gives an index in the string where the
				660	search is to start; it defaults to ``0``. This is not completely equivalent to
				661	slicing the string; the ``'^'`` pattern character matches at the real beginning
				662	of the string and at positions just after a newline, but not necessarily at the
				663	index where the search is to start.
				664
				665	The optional parameter endpos limits how far the string will be searched; it
				666	will be as if the string is endpos characters long, so only the characters
				667	from pos to ``endpos - 1`` will be searched for a match. If endpos is less
				668	than pos, no match will be found, otherwise, if rx is a compiled regular
				669	expression object, ``rx.match(string, 0, 50)`` is equivalent to
				670	``rx.match(string[:50], 0)``.
				671
				672
				673	.. method:: RegexObject.search(string[, pos[, endpos]])
				674
				675	Scan through string looking for a location where this regular expression
				676	produces a match, and return a corresponding :class:`MatchObject` instance.
				677	Return ``None`` if no position in the string matches the pattern; note that this
				678	is different from finding a zero-length match at some point in the string.
				679
				680	The optional pos and endpos parameters have the same meaning as for the
				681	:meth:`match` method.
				682
				683
				684	.. method:: RegexObject.split(string[, maxsplit=0])
				685
				686	Identical to the :func:`split` function, using the compiled pattern.
				687
				688
				689	.. method:: RegexObject.findall(string[, pos[, endpos]])
				690
				691	Identical to the :func:`findall` function, using the compiled pattern.
				692
				693
				694	.. method:: RegexObject.finditer(string[, pos[, endpos]])
				695
				696	Identical to the :func:`finditer` function, using the compiled pattern.
				697
				698
				699	.. method:: RegexObject.sub(repl, string[, count=0])
				700
				701	Identical to the :func:`sub` function, using the compiled pattern.
				702
				703
				704	.. method:: RegexObject.subn(repl, string[, count=0])
				705
				706	Identical to the :func:`subn` function, using the compiled pattern.
				707
				708
				709	.. attribute:: RegexObject.flags
				710
				711	The flags argument used when the RE object was compiled, or ``0`` if no flags
				712	were provided.
				713
				714
				715	.. attribute:: RegexObject.groupindex
				716
				717	A dictionary mapping any symbolic group names defined by ``(?P<id>)`` to group
				718	numbers. The dictionary is empty if no symbolic groups were used in the
				719	pattern.
				720
				721
				722	.. attribute:: RegexObject.pattern
				723
				724	The pattern string from which the RE object was compiled.
				725
				726
				727	.. _match-objects:
				728
				729	Match Objects
				730	-------------
				731
				732	:class:`MatchObject` instances support the following methods and attributes:
				733
				734
				735	.. method:: MatchObject.expand(template)
				736
				737	Return the string obtained by doing backslash substitution on the template
				738	string template, as done by the :meth:`sub` method. Escapes such as ``\n`` are
				739	converted to the appropriate characters, and numeric backreferences (``\1``,
				740	``\2``) and named backreferences (``\g<1>``, ``\g<name>``) are replaced by the
				741	contents of the corresponding group.
				742
				743
				744	.. method:: MatchObject.group([group1, ...])
				745
				746	Returns one or more subgroups of the match. If there is a single argument, the
				747	result is a single string; if there are multiple arguments, the result is a
				748	tuple with one item per argument. Without arguments, group1 defaults to zero
				749	(the whole match is returned). If a groupN argument is zero, the corresponding
				750	return value is the entire matching string; if it is in the inclusive range
				751	[1..99], it is the string matching the corresponding parenthesized group. If a
				752	group number is negative or larger than the number of groups defined in the
				753	pattern, an :exc:`IndexError` exception is raised. If a group is contained in a
				754	part of the pattern that did not match, the corresponding result is ``None``.
				755	If a group is contained in a part of the pattern that matched multiple times,
				756	the last match is returned.
				757
				758	If the regular expression uses the ``(?P<name>...)`` syntax, the groupN
				759	arguments may also be strings identifying groups by their group name. If a
				760	string argument is not used as a group name in the pattern, an :exc:`IndexError`
				761	exception is raised.
				762
				763	A moderately complicated example::
				764
				765	m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
				766
				767	After performing this match, ``m.group(1)`` is ``'3'``, as is
				768	``m.group('int')``, and ``m.group(2)`` is ``'14'``.
				769
				770
				771	.. method:: MatchObject.groups([default])
				772
				773	Return a tuple containing all the subgroups of the match, from 1 up to however
				774	many groups are in the pattern. The default argument is used for groups that
				775	did not participate in the match; it defaults to ``None``. (Incompatibility
				776	note: in the original Python 1.5 release, if the tuple was one element long, a
				777	string would be returned instead. In later versions (from 1.5.1 on), a
				778	singleton tuple is returned in such cases.)
				779
				780
				781	.. method:: MatchObject.groupdict([default])
				782
				783	Return a dictionary containing all the named subgroups of the match, keyed by
				784	the subgroup name. The default argument is used for groups that did not
				785	participate in the match; it defaults to ``None``.
				786
				787
				788	.. method:: MatchObject.start([group])
				789	MatchObject.end([group])
				790
				791	Return the indices of the start and end of the substring matched by group;
				792	group defaults to zero (meaning the whole matched substring). Return ``-1`` if
				793	group exists but did not contribute to the match. For a match object m, and
				794	a group g that did contribute to the match, the substring matched by group g
				795	(equivalent to ``m.group(g)``) is ::
				796
				797	m.string[m.start(g):m.end(g)]
				798
				799	Note that ``m.start(group)`` will equal ``m.end(group)`` if group matched a
				800	null string. For example, after ``m = re.search('b(c?)', 'cba')``,
				801	``m.start(0)`` is 1, ``m.end(0)`` is 2, ``m.start(1)`` and ``m.end(1)`` are both
				802	2, and ``m.start(2)`` raises an :exc:`IndexError` exception.
				803
				804
				805	.. method:: MatchObject.span([group])
				806
				807	For :class:`MatchObject` m, return the 2-tuple ``(m.start(group),
				808	m.end(group))``. Note that if group did not contribute to the match, this is
				809	``(-1, -1)``. Again, group defaults to zero.
				810
				811
				812	.. attribute:: MatchObject.pos
				813
				814	The value of pos which was passed to the :func:`search` or :func:`match`
				815	method of the :class:`RegexObject`. This is the index into the string at which
				816	the RE engine started looking for a match.
				817
				818
				819	.. attribute:: MatchObject.endpos
				820
				821	The value of endpos which was passed to the :func:`search` or :func:`match`
				822	method of the :class:`RegexObject`. This is the index into the string beyond
				823	which the RE engine will not go.
				824
				825
				826	.. attribute:: MatchObject.lastindex
				827
				828	The integer index of the last matched capturing group, or ``None`` if no group
				829	was matched at all. For example, the expressions ``(a)b``, ``((a)(b))``, and
				830	``((ab))`` will have ``lastindex == 1`` if applied to the string ``'ab'``, while
				831	the expression ``(a)(b)`` will have ``lastindex == 2``, if applied to the same
				832	string.
				833
				834
				835	.. attribute:: MatchObject.lastgroup
				836
				837	The name of the last matched capturing group, or ``None`` if the group didn't
				838	have a name, or if no group was matched at all.
				839
				840
				841	.. attribute:: MatchObject.re
				842
				843	The regular expression object whose :meth:`match` or :meth:`search` method
				844	produced this :class:`MatchObject` instance.
				845
				846
				847	.. attribute:: MatchObject.string
				848
				849	The string passed to :func:`match` or :func:`search`.
				850
				851
				852	Examples
				853	--------
				854
				855	Simulating scanf()
				856
				857	.. index:: single: scanf()
				858
				859	Python does not currently have an equivalent to :cfunc:`scanf`. Regular
				860	expressions are generally more powerful, though also more verbose, than
				861	:cfunc:`scanf` format strings. The table below offers some more-or-less
				862	equivalent mappings between :cfunc:`scanf` format tokens and regular
				863	expressions.
				864
				865	+--------------------------------+---------------------------------------------+
				866	\| :cfunc:`scanf` Token \| Regular Expression \|
				867	+================================+=============================================+
				868	\| ``%c`` \| ``.`` \|
				869	+--------------------------------+---------------------------------------------+
				870	\| ``%5c`` \| ``.{5}`` \|
				871	+--------------------------------+---------------------------------------------+
				872	\| ``%d`` \| ``[-+]?\d+`` \|
				873	+--------------------------------+---------------------------------------------+
				874	\| ``%e``, ``%E``, ``%f``, ``%g`` \| ``[-+]?(\d+(\.\d*)?\|\.\d+)([eE][-+]?\d+)?`` \|
				875	+--------------------------------+---------------------------------------------+
				876	\| ``%i`` \| ``[-+]?(0[xX][\dA-Fa-f]+\|0[0-7]*\|\d+)`` \|
				877	+--------------------------------+---------------------------------------------+
				878	\| ``%o`` \| ``0[0-7]*`` \|
				879	+--------------------------------+---------------------------------------------+
				880	\| ``%s`` \| ``\S+`` \|
				881	+--------------------------------+---------------------------------------------+
				882	\| ``%u`` \| ``\d+`` \|
				883	+--------------------------------+---------------------------------------------+
				884	\| ``%x``, ``%X`` \| ``0[xX][\dA-Fa-f]+`` \|
				885	+--------------------------------+---------------------------------------------+
				886
				887	To extract the filename and numbers from a string like ::
				888
				889	/usr/sbin/sendmail - 0 errors, 4 warnings
				890
				891	you would use a :cfunc:`scanf` format like ::
				892
				893	%s - %d errors, %d warnings
				894
				895	The equivalent regular expression would be ::
				896
				897	(\S+) - (\d+) errors, (\d+) warnings
				898
				899	Avoiding recursion
				900
				901	If you create regular expressions that require the engine to perform a lot of
				902	recursion, you may encounter a :exc:`RuntimeError` exception with the message
				903	``maximum recursion limit`` exceeded. For example, ::
				904
				905	>>> import re
				906	>>> s = 'Begin ' + 1000*'a very long string ' + 'end'
				907	>>> re.match('Begin (\w\| )*? end', s).end()
				908	Traceback (most recent call last):
				909	File "<stdin>", line 1, in ?
				910	File "/usr/local/lib/python2.5/re.py", line 132, in match
				911	return _compile(pattern, flags).match(string)
				912	RuntimeError: maximum recursion limit exceeded
				913
				914	You can often restructure your regular expression to avoid recursion.
				915
				916	Starting with Python 2.3, simple uses of the ``*?`` pattern are special-cased to
				917	avoid recursion. Thus, the above regular expression can avoid recursion by
				918	being recast as ``Begin [a-zA-Z0-9_ ]*?end``. As a further benefit, such
				919	regular expressions will run faster than their recursive equivalents.
				920