Blame - Doc/lib/libre.tex - platform/external/python/cpython2

blob: 9bad62a72895144ab9d1ee28f1eb39157f6c2e8e [file] [log] [blame]

Fred Drake	295da24	1998-08-10 19:42:37 +0000	[diff] [blame]	1	\section{\module{re} ---
Fred Drake	062ea2e	2000-10-06 19:59:22 +0000	[diff] [blame]	2	Regular expression operations}
Fred Drake	66da9d6	1998-08-07 18:57:18 +0000	[diff] [blame]	3	\declaremodule{standard}{re}
Fred Drake	062ea2e	2000-10-06 19:59:22 +0000	[diff] [blame]	4	\moduleauthor{Fredrik Lundh}{effbot@telia.com}
Andrew M. Kuchling	1f774b0	2001-11-05 21:34:36 +0000	[diff] [blame]	5	\sectionauthor{Andrew M. Kuchling}{akuchlin@mems-exchange.org}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	6
Fred Drake	b91e934	1998-07-23 17:59:49 +0000	[diff] [blame]	7
Fred Drake	062ea2e	2000-10-06 19:59:22 +0000	[diff] [blame]	8	\modulesynopsis{Regular expression search and match operations with a
				9	Perl-style expression syntax.}
Fred Drake	b91e934	1998-07-23 17:59:49 +0000	[diff] [blame]	10
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	11
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	12	This module provides regular expression matching operations similar to
Fred Drake	062ea2e	2000-10-06 19:59:22 +0000	[diff] [blame]	13	those found in Perl. Regular expression pattern strings may not
				14	contain null bytes, but can specify the null byte using the
				15	\code{\e\var{number}} notation. Both patterns and strings to be
				16	searched can be Unicode strings as well as 8-bit strings. The
				17	\module{re} module is always available.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	18
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	19	Regular expressions use the backslash character (\character{\e}) to
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	20	indicate special forms or to allow special characters to be used
				21	without invoking their special meaning. This collides with Python's
				22	usage of the same character for the same purpose in string literals;
				23	for example, to match a literal backslash, one might have to write
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	24	\code{'\e\e\e\e'} as the pattern string, because the regular expression
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	25	must be \samp{\e\e}, and each backslash must be expressed as
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	26	\samp{\e\e} inside a regular Python string literal.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	27
				28	The solution is to use Python's raw string notation for regular
				29	expression patterns; backslashes are not handled in any special way in
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	30	a string literal prefixed with \character{r}. So \code{r"\e n"} is a
				31	two-character string containing \character{\e} and \character{n},
				32	while \code{"\e n"} is a one-character string containing a newline.
				33	Usually patterns will be expressed in Python code using this raw
				34	string notation.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	35
Fred Drake	062ea2e	2000-10-06 19:59:22 +0000	[diff] [blame]	36	\strong{Implementation note:}
				37	The \module{re}\refstmodindex{pre} module has two distinct
				38	implementations: \module{sre} is the default implementation and
				39	includes Unicode support, but may run into stack limitations for some
				40	patterns. Though this will be fixed for a future release of Python,
				41	the older implementation (without Unicode support) is still available
				42	as the \module{pre}\refstmodindex{pre} module.
				43
				44
Fred Drake	e20bd19	2001-04-12 16:47:17 +0000	[diff] [blame]	45	\begin{seealso}
				46	\seetitle{Mastering Regular Expressions}{Book on regular expressions
				47	by Jeffrey Friedl, published by O'Reilly. The Python
				48	material in this book dates from before the \refmodule{re}
				49	module, but it covers writing good regular expression
				50	patterns in great detail.}
				51	\end{seealso}
				52
				53
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	54	\subsection{Regular Expression Syntax \label{re-syntax}}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	55
				56	A regular expression (or RE) specifies a set of strings that matches
				57	it; the functions in this module let you check if a particular string
				58	matches a given regular expression (or if a given regular expression
				59	matches a particular string, which comes down to the same thing).
				60
				61	Regular expressions can be concatenated to form new regular
				62	expressions; if \emph{A} and \emph{B} are both regular expressions,
Fred Drake	51629c2	2001-08-02 20:52:00 +0000	[diff] [blame]	63	then \emph{AB} is also a regular expression. If a string \emph{p}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	64	matches A and another string \emph{q} matches B, the string \emph{pq}
Fred Drake	51629c2	2001-08-02 20:52:00 +0000	[diff] [blame]	65	will match AB if \emph{A} and \emph{B} do no specify boundary
				66	conditions that are no longer satisfied by \emph{pq}. Thus, complex
				67	expressions can easily be constructed from simpler primitive
				68	expressions like the ones described here. For details of the theory
				69	and implementation of regular expressions, consult the Friedl book
				70	referenced below, or almost any textbook about compiler construction.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	71
Andrew M. Kuchling	c1cea20	1998-10-28 15:44:14 +0000	[diff] [blame]	72	A brief explanation of the format of regular expressions follows. For
				73	further information and a gentler presentation, consult the Regular
				74	Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	75
				76	Regular expressions can contain both special and ordinary characters.
Fred Drake	f4bdb57	2001-07-12 14:13:43 +0000	[diff] [blame]	77	Most ordinary characters, like \character{A}, \character{a}, or
				78	\character{0}, are the simplest regular expressions; they simply match
				79	themselves. You can concatenate ordinary characters, so \regexp{last}
				80	matches the string \code{'last'}. (In the rest of this section, we'll
				81	write RE's in \regexp{this special style}, usually without quotes, and
				82	strings to be matched \code{'in single quotes'}.)
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	83
Fred Drake	f4bdb57	2001-07-12 14:13:43 +0000	[diff] [blame]	84	Some characters, like \character{\|} or \character{(}, are special.
				85	Special characters either stand for classes of ordinary characters, or
				86	affect how the regular expressions around them are interpreted.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	87
				88	The special characters are:
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	89
Fred Drake	1e270f0	1998-11-30 22:58:12 +0000	[diff] [blame]	90	\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	91
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	92	\item[\character{.}] (Dot.) In the default mode, this matches any
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	93	character except a newline. If the \constant{DOTALL} flag has been
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	94	specified, this matches any character including a newline.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	95
Fred Drake	7bc6f7a	2002-02-14 15:19:30 +0000	[diff] [blame]	96	\item[\character{\textasciicircum}] (Caret.) Matches the start of the
				97	string, and in \constant{MULTILINE} mode also matches immediately
				98	after each newline.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	99
Fred Drake	c547b46	2001-07-23 21:14:59 +0000	[diff] [blame]	100	\item[\character{\$}] Matches the end of the string or just before the
				101	newline at the end of the string, and in \constant{MULTILINE} mode
				102	also matches before a newline. \regexp{foo} matches both 'foo' and
				103	'foobar', while the regular expression \regexp{foo\$} matches only
Fred Drake	b6b2aa6	2002-02-25 18:56:45 +0000	[diff] [blame]	104	'foo'. More interestingly, searching for \regexp{foo.\$} in
Fred Drake	c547b46	2001-07-23 21:14:59 +0000	[diff] [blame]	105	'foo1\textbackslash nfoo2\textbackslash n' matches 'foo2' normally,
				106	but 'foo1' in \constant{MULTILINE} mode.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	107
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	108	\item[\character{*}] Causes the resulting RE to
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	109	match 0 or more repetitions of the preceding RE, as many repetitions
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	110	as are possible. \regexp{ab*} will
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	111	match 'a', 'ab', or 'a' followed by any number of 'b's.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	112
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	113	\item[\character{+}] Causes the
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	114	resulting RE to match 1 or more repetitions of the preceding RE.
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	115	\regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	116	will not match just 'a'.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	117
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	118	\item[\character{?}] Causes the resulting RE to
				119	match 0 or 1 repetitions of the preceding RE. \regexp{ab?} will
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	120	match either 'a' or 'ab'.
Fred Drake	f4bdb57	2001-07-12 14:13:43 +0000	[diff] [blame]	121
				122	\item[\code{?}, \code{+?}, \code{??}] The \character{},
				123	\character{+}, and \character{?} qualifiers are all \dfn{greedy}; they
				124	match as much text as possible. Sometimes this behaviour isn't
				125	desired; if the RE \regexp{<.*>} is matched against
				126	\code{'<H1>title</H1>'}, it will match the entire string, and not just
				127	\code{'<H1>'}. Adding \character{?} after the qualifier makes it
				128	perform the match in \dfn{non-greedy} or \dfn{minimal} fashion; as
				129	\emph{few} characters as possible will be matched. Using \regexp{.*?}
				130	in the previous expression will match only \code{'<H1>'}.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	131
Fred Drake	e74f8de	2001-08-01 16:56:51 +0000	[diff] [blame]	132	\item[\code{\{\var{m}\}}]
				133	Specifies that exactly \var{m} copies of the previous RE should be
				134	matched; fewer matches cause the entire RE not to match. For example,
				135	\regexp{a\{6\}} will match exactly six \character{a} characters, but
				136	not five.
				137
Guido van Rossum	0148bbf	1997-12-22 22:41:40 +0000	[diff] [blame]	138	\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
				139	\var{m} to \var{n} repetitions of the preceding RE, attempting to
Andrew M. Kuchling	c1cea20	1998-10-28 15:44:14 +0000	[diff] [blame]	140	match as many repetitions as possible. For example, \regexp{a\{3,5\}}
				141	will match from 3 to 5 \character{a} characters. Omitting \var{n}
Fred Drake	51629c2	2001-08-02 20:52:00 +0000	[diff] [blame]	142	specifies an infinite upper bound; you can't omit \var{m}. As an
				143	example, \regexp{a\{4,\}b} will match \code{aaaab}, a thousand
				144	\character{a} characters followed by a \code{b}, but not \code{aaab}.
				145	The comma may not be omitted or the modifier would be confused with
				146	the previously described form.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	147
Guido van Rossum	0148bbf	1997-12-22 22:41:40 +0000	[diff] [blame]	148	\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
				149	match from \var{m} to \var{n} repetitions of the preceding RE,
				150	attempting to match as \emph{few} repetitions as possible. This is
				151	the non-greedy version of the previous qualifier. For example, on the
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	152	6-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
				153	\character{a} characters, while \regexp{a\{3,5\}?} will only match 3
				154	characters.
				155
				156	\item[\character{\e}] Either escapes special characters (permitting
				157	you to match characters like \character{*}, \character{?}, and so
				158	forth), or signals a special sequence; special sequences are discussed
				159	below.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	160
				161	If you're not using a raw string to
				162	express the pattern, remember that Python also uses the
				163	backslash as an escape sequence in string literals; if the escape
				164	sequence isn't recognized by Python's parser, the backslash and
				165	subsequent character are included in the resulting string. However,
				166	if Python would recognize the resulting sequence, the backslash should
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	167	be repeated twice. This is complicated and hard to understand, so
				168	it's highly recommended that you use raw strings for all but the
				169	simplest expressions.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	170
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	171	\item[\code{[]}] Used to indicate a set of characters. Characters can
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	172	be listed individually, or a range of characters can be indicated by
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	173	giving two characters and separating them by a \character{-}. Special
				174	characters are not active inside sets. For example, \regexp{[akm\$]}
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	175	will match any of the characters \character{a}, \character{k},
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	176	\character{m}, or \character{\$}; \regexp{[a-z]}
				177	will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
Fred Drake	1e270f0	1998-11-30 22:58:12 +0000	[diff] [blame]	178	letter or digit. Character classes such as \code{\e w} or \code{\e S}
				179	(defined below) are also acceptable inside a range. If you want to
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	180	include a \character{]} or a \character{-} inside a set, precede it with a
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	181	backslash, or place it as the first character. The
				182	pattern \regexp{[]]} will match \code{']'}, for example.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	183
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	184	You can match the characters not within a range by \dfn{complementing}
Fred Drake	7bc6f7a	2002-02-14 15:19:30 +0000	[diff] [blame]	185	the set. This is indicated by including a
				186	\character{\textasciicircum} as the first character of the set;
				187	\character{\textasciicircum} elsewhere will simply match the
				188	\character{\textasciicircum} character. For example,
				189	\regexp{[{\textasciicircum}5]} will match
				190	any character except \character{5}, and
				191	\regexp{[\textasciicircum\code{\textasciicircum}]} will match any character
				192	except \character{\textasciicircum}.
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	193
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	194	\item[\character{\|}]\code{A\|B}, where A and B can be arbitrary REs,
Fred Drake	062ea2e	2000-10-06 19:59:22 +0000	[diff] [blame]	195	creates a regular expression that will match either A or B. An
				196	arbitrary number of REs can be separated by the \character{\|} in this
				197	way. This can be used inside groups (see below) as well. REs
				198	separated by \character{\|} are tried from left to right, and the first
				199	one that allows the complete pattern to match is considered the
				200	accepted branch. This means that if \code{A} matches, \code{B} will
				201	never be tested, even if it would produce a longer overall match. In
				202	other words, the \character{\|} operator is never greedy. To match a
				203	literal \character{\|}, use \regexp{\e\|}, or enclose it inside a
				204	character class, as in \regexp{[\|]}.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	205
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	206	\item[\code{(...)}] Matches whatever regular expression is inside the
				207	parentheses, and indicates the start and end of a group; the contents
				208	of a group can be retrieved after a match has been performed, and can
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	209	be matched later in the string with the \regexp{\e \var{number}} special
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	210	sequence, described below. To match the literals \character{(} or
Fred Drake	2c4f554	2000-10-10 22:00:03 +0000	[diff] [blame]	211	\character{)}, use \regexp{\e(} or \regexp{\e)}, or enclose them
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	212	inside a character class: \regexp{[(] [)]}.
				213
				214	\item[\code{(?...)}] This is an extension notation (a \character{?}
				215	following a \character{(} is not meaningful otherwise). The first
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	216	character after the \character{?}
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	217	determines what the meaning and further syntax of the construct is.
Guido van Rossum	e9625e8	1998-04-02 01:32:24 +0000	[diff] [blame]	218	Extensions usually do not create a new group;
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	219	\regexp{(?P<\var{name}>...)} is the only exception to this rule.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	220	Following are the currently supported extensions.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	221
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	222	\item[\code{(?iLmsux)}] (One or more letters from the set \character{i},
				223	\character{L}, \character{m}, \character{s}, \character{u},
				224	\character{x}.) The group matches the empty string; the letters set
				225	the corresponding flags (\constant{re.I}, \constant{re.L},
				226	\constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X})
				227	for the entire regular expression. This is useful if you wish to
				228	include the flags as part of the regular expression, instead of
				229	passing a \var{flag} argument to the \function{compile()} function.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	230
Fred Drake	062ea2e	2000-10-06 19:59:22 +0000	[diff] [blame]	231	Note that the \regexp{(?x)} flag changes how the expression is parsed.
				232	It should be used first in the expression string, or after one or more
				233	whitespace characters. If there are non-whitespace characters before
				234	the flag, the results are undefined.
				235
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	236	\item[\code{(?:...)}] A non-grouping version of regular parentheses.
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	237	Matches whatever regular expression is inside the parentheses, but the
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	238	substring matched by the
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	239	group \emph{cannot} be retrieved after performing a match or
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	240	referenced later in the pattern.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	241
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	242	\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
Guido van Rossum	e9625e8	1998-04-02 01:32:24 +0000	[diff] [blame]	243	the substring matched by the group is accessible via the symbolic group
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	244	name \var{name}. Group names must be valid Python identifiers, and
				245	each group name must be defined only once within a regular expression. A
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	246	symbolic group is also a numbered group, just as if the group were not
				247	named. So the group named 'id' in the example above can also be
				248	referenced as the numbered group 1.
				249
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	250	For example, if the pattern is
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	251	\regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
Fred Drake	907e76b	2001-07-06 20:30:11 +0000	[diff] [blame]	252	name in arguments to methods of match objects, such as
				253	\code{m.group('id')} or \code{m.end('id')}, and also by name in
				254	pattern text (for example, \regexp{(?P=id)}) and replacement text
				255	(such as \code{\e g<id>}).
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	256
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	257	\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
				258	earlier group named \var{name}.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	259
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	260	\item[\code{(?\#...)}] A comment; the contents of the parentheses are
				261	simply ignored.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	262
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	263	\item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	264	consume any of the string. This is called a lookahead assertion. For
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	265	example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
				266	followed by \code{'Asimov'}.
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	267
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	268	\item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	269	is a negative lookahead assertion. For example,
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	270	\regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
				271	followed by \code{'Asimov'}.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	272
Andrew M. Kuchling	9351dd2	2000-10-05 15:22:28 +0000	[diff] [blame]	273	\item[\code{(?<=...)}] Matches if the current position in the string
				274	is preceded by a match for \regexp{...} that ends at the current
Fred Drake	f275803	2002-03-16 05:58:12 +0000	[diff] [blame]	275	position. This is called a \dfn{positive lookbehind assertion}.
				276	\regexp{(?<=abc)def} will find a match in \samp{abcdef}, since the
				277	lookbehind will back up 3 characters and check if the contained
				278	pattern matches. The contained pattern must only match strings of
				279	some fixed length, meaning that \regexp{abc} or \regexp{a\|b} are
				280	allowed, but \regexp{a*} and \regexp{a\{3,4\}} are not. Note that
				281	patterns which start with positive lookbehind assertions will never
				282	match at the beginning of the string being searched; you will most
				283	likely want to use the \function{search()} function rather than the
				284	\function{match()} function:
				285
				286	\begin{verbatim}
				287	>>> import re
				288	>>> m = re.search('(?<=abc)def', 'abdef')
				289	>>> m.group(0)
				290	'def'
				291	\end{verbatim}
				292
				293	This example looks for a word following a hyphen:
				294
				295	\begin{verbatim}
				296	>>> m = re.search('(?<=-)\w+', 'spam-egg')
				297	>>> m.group(0)
				298	'egg'
				299	\end{verbatim}
Andrew M. Kuchling	9351dd2	2000-10-05 15:22:28 +0000	[diff] [blame]	300
				301	\item[\code{(?<!...)}] Matches if the current position in the string
Fred Drake	f275803	2002-03-16 05:58:12 +0000	[diff] [blame]	302	is not preceded by a match for \regexp{...}. This is called a
				303	\dfn{negative lookbehind assertion}. Similar to positive lookbehind
Andrew M. Kuchling	9351dd2	2000-10-05 15:22:28 +0000	[diff] [blame]	304	assertions, the contained pattern must only match strings of some
Fred Drake	f275803	2002-03-16 05:58:12 +0000	[diff] [blame]	305	fixed length. Patterns which start with negative lookbehind
				306	assertions will may match at the beginning of the string being
				307	searched.
Andrew M. Kuchling	9351dd2	2000-10-05 15:22:28 +0000	[diff] [blame]	308
Fred Drake	2705e80	1998-02-16 21:21:13 +0000	[diff] [blame]	309	\end{list}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	310
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	311	The special sequences consist of \character{\e} and a character from the
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	312	list below. If the ordinary character is not on the list, then the
				313	resulting RE will match the second character. For example,
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	314	\regexp{\e\$} matches the character \character{\$}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	315
Fred Drake	1e270f0	1998-11-30 22:58:12 +0000	[diff] [blame]	316	\begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	317
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	318	\item[\code{\e \var{number}}] Matches the contents of the group of the
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	319	same number. Groups are numbered starting from 1. For example,
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	320	\regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	321	\code{'the end'} (note
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	322	the space after the group). This special sequence can only be used to
				323	match one of the first 99 groups. If the first digit of \var{number}
				324	is 0, or \var{number} is 3 octal digits long, it will not be interpreted
				325	as a group match, but as the character with octal value \var{number}.
Eric S. Raymond	46ccd1d	2001-08-28 12:50:03 +0000	[diff] [blame]	326	(There is a group 0, which is the entire matched pattern, but it can't
				327	be referenced with \regexp{\e 0}; instead, use \regexp{\e g<0>}.)
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	328	Inside the \character{[} and \character{]} of a character class, all numeric
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	329	escapes are treated as characters.
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	330
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	331	\item[\code{\e A}] Matches only at the start of the string.
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	332
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	333	\item[\code{\e b}] Matches the empty string, but only at the
				334	beginning or end of a word. A word is defined as a sequence of
				335	alphanumeric characters, so the end of a word is indicated by
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	336	whitespace or a non-alphanumeric character. Inside a character range,
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	337	\regexp{\e b} represents the backspace character, for compatibility with
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	338	Python's string literals.
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	339
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	340	\item[\code{\e B}] Matches the empty string, but only when it is
				341	\emph{not} at the beginning or end of a word.
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	342
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	343	\item[\code{\e d}]Matches any decimal digit; this is
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	344	equivalent to the set \regexp{[0-9]}.
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	345
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	346	\item[\code{\e D}]Matches any non-digit character; this is
Fred Drake	7bc6f7a	2002-02-14 15:19:30 +0000	[diff] [blame]	347	equivalent to the set \regexp{[{\textasciicircum}0-9]}.
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	348
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	349	\item[\code{\e s}]Matches any whitespace character; this is
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	350	equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	351
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	352	\item[\code{\e S}]Matches any non-whitespace character; this is
Fred Drake	7bc6f7a	2002-02-14 15:19:30 +0000	[diff] [blame]	353	equivalent to the set \regexp{[\textasciicircum\ \e t\e n\e r\e f\e v]}.
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	354
				355	\item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE}
				356	flags are not specified,
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	357	matches any alphanumeric character; this is equivalent to the set
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	358	\regexp{[a-zA-Z0-9_]}. With \constant{LOCALE}, it will match the set
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	359	\regexp{[0-9_]} plus whatever characters are defined as letters for
				360	the current locale. If \constant{UNICODE} is set, this will match the
				361	characters \regexp{[0-9_]} plus whatever is classified as alphanumeric
				362	in the Unicode character properties database.
				363
				364	\item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE}
				365	flags are not specified, matches any non-alphanumeric character; this
Fred Drake	7bc6f7a	2002-02-14 15:19:30 +0000	[diff] [blame]	366	is equivalent to the set \regexp{[{\textasciicircum}a-zA-Z0-9_]}. With
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	367	\constant{LOCALE}, it will match any character not in the set
				368	\regexp{[0-9_]}, and not defined as a letter for the current locale.
				369	If \constant{UNICODE} is set, this will match anything other than
				370	\regexp{[0-9_]} and characters marked at alphanumeric in the Unicode
				371	character properties database.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	372
				373	\item[\code{\e Z}]Matches only at the end of the string.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	374
				375	\item[\code{\e \e}] Matches a literal backslash.
				376
Fred Drake	2705e80	1998-02-16 21:21:13 +0000	[diff] [blame]	377	\end{list}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	378
Fred Drake	42de185	1998-04-20 16:28:44 +0000	[diff] [blame]	379
Fred Drake	768ac6b	1998-12-22 18:19:45 +0000	[diff] [blame]	380	\subsection{Matching vs. Searching \label{matching-searching}}
				381	\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
				382
Fred Drake	768ac6b	1998-12-22 18:19:45 +0000	[diff] [blame]	383	Python offers two different primitive operations based on regular
				384	expressions: match and search. If you are accustomed to Perl's
				385	semantics, the search operation is what you're looking for. See the
				386	\function{search()} function and corresponding method of compiled
				387	regular expression objects.
				388
				389	Note that match may differ from search using a regular expression
Fred Drake	7bc6f7a	2002-02-14 15:19:30 +0000	[diff] [blame]	390	beginning with \character{\textasciicircum}:
				391	\character{\textasciicircum} matches only at the
Fred Drake	3d0971e	1999-06-29 21:21:19 +0000	[diff] [blame]	392	start of the string, or in \constant{MULTILINE} mode also immediately
				393	following a newline. The ``match'' operation succeeds only if the
				394	pattern matches at the start of the string regardless of mode, or at
				395	the starting position given by the optional \var{pos} argument
				396	regardless of whether a newline precedes it.
Fred Drake	768ac6b	1998-12-22 18:19:45 +0000	[diff] [blame]	397
				398	% Examples from Tim Peters:
				399	\begin{verbatim}
				400	re.compile("a").match("ba", 1) # succeeds
				401	re.compile("^a").search("ba", 1) # fails; 'a' not at start
				402	re.compile("^a").search("\na", 1) # fails; 'a' not at start
				403	re.compile("^a", re.M).search("\na", 1) # succeeds
				404	re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
				405	\end{verbatim}
				406
				407
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	408	\subsection{Module Contents}
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	409	\nodename{Contents of Module re}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	410
				411	The module defines the following functions and constants, and an exception:
				412
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	413
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	414	\begin{funcdesc}{compile}{pattern\optional{, flags}}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	415	Compile a regular expression pattern into a regular expression
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	416	object, which can be used for matching using its \function{match()} and
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	417	\function{search()} methods, described below.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	418
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	419	The expression's behaviour can be modified by specifying a
				420	\var{flags} value. Values can be any of the following variables,
				421	combined using bitwise OR (the \code{\|} operator).
				422
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	423	The sequence
				424
				425	\begin{verbatim}
				426	prog = re.compile(pat)
				427	result = prog.match(str)
				428	\end{verbatim}
				429
				430	is equivalent to
				431
				432	\begin{verbatim}
				433	result = re.match(pat, str)
				434	\end{verbatim}
				435
				436	but the version using \function{compile()} is more efficient when the
				437	expression will be used several times in a single program.
				438	%(The compiled version of the last pattern passed to
Fred Drake	895aa9d	2001-04-18 17:26:20 +0000	[diff] [blame]	439	%\function{re.match()} or \function{re.search()} is cached, so
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	440	%programs that use only a single regular expression at a time needn't
				441	%worry about compiling regular expressions.)
				442	\end{funcdesc}
				443
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	444	\begin{datadesc}{I}
				445	\dataline{IGNORECASE}
Fred Drake	f4bdb57	2001-07-12 14:13:43 +0000	[diff] [blame]	446	Perform case-insensitive matching; expressions like \regexp{[A-Z]}
				447	will match lowercase letters, too. This is not affected by the
				448	current locale.
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	449	\end{datadesc}
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	450
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	451	\begin{datadesc}{L}
				452	\dataline{LOCALE}
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	453	Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	454	\regexp{\e B} dependent on the current locale.
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	455	\end{datadesc}
Guido van Rossum	a42c178	1997-12-09 20:41:47 +0000	[diff] [blame]	456
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	457	\begin{datadesc}{M}
				458	\dataline{MULTILINE}
Fred Drake	7bc6f7a	2002-02-14 15:19:30 +0000	[diff] [blame]	459	When specified, the pattern character \character{\textasciicircum}
				460	matches at the beginning of the string and at the beginning of each
				461	line (immediately following each newline); and the pattern character
Fred Drake	f4bdb57	2001-07-12 14:13:43 +0000	[diff] [blame]	462	\character{\$} matches at the end of the string and at the end of each
Fred Drake	7bc6f7a	2002-02-14 15:19:30 +0000	[diff] [blame]	463	line (immediately preceding each newline). By default,
				464	\character{\textasciicircum} matches only at the beginning of the
				465	string, and \character{\$} only at the end of the string and
				466	immediately before the newline (if any) at the end of the string.
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	467	\end{datadesc}
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	468
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	469	\begin{datadesc}{S}
				470	\dataline{DOTALL}
Fred Drake	e53793b	2000-09-25 17:52:40 +0000	[diff] [blame]	471	Make the \character{.} special character match any character at all,
				472	including a newline; without this flag, \character{.} will match
				473	anything \emph{except} a newline.
				474	\end{datadesc}
				475
				476	\begin{datadesc}{U}
				477	\dataline{UNICODE}
				478	Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
				479	\regexp{\e B} dependent on the Unicode character properties database.
				480	\versionadded{2.0}
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	481	\end{datadesc}
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	482
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	483	\begin{datadesc}{X}
				484	\dataline{VERBOSE}
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	485	This flag allows you to write regular expressions that look nicer.
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	486	Whitespace within the pattern is ignored,
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	487	except when in a character class or preceded by an unescaped
Fred Drake	f4bdb57	2001-07-12 14:13:43 +0000	[diff] [blame]	488	backslash, and, when a line contains a \character{\#} neither in a
				489	character class or preceded by an unescaped backslash, all characters
				490	from the leftmost such \character{\#} through the end of the line are
				491	ignored.
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	492	% XXX should add an example here
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	493	\end{datadesc}
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	494
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	495
Guido van Rossum	7d447aa	1998-10-13 16:03:52 +0000	[diff] [blame]	496	\begin{funcdesc}{search}{pattern, string\optional{, flags}}
				497	Scan through \var{string} looking for a location where the regular
				498	expression \var{pattern} produces a match, and return a
				499	corresponding \class{MatchObject} instance.
				500	Return \code{None} if no
				501	position in the string matches the pattern; note that this is
				502	different from finding a zero-length match at some point in the string.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	503	\end{funcdesc}
				504
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	505	\begin{funcdesc}{match}{pattern, string\optional{, flags}}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	506	If zero or more characters at the beginning of \var{string} match
				507	the regular expression \var{pattern}, return a corresponding
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	508	\class{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	509	match the pattern; note that this is different from a zero-length
				510	match.
Fred Drake	768ac6b	1998-12-22 18:19:45 +0000	[diff] [blame]	511
Fred Drake	0aa811c	2001-10-20 04:24:09 +0000	[diff] [blame]	512	\note{If you want to locate a match anywhere in
				513	\var{string}, use \method{search()} instead.}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	514	\end{funcdesc}
				515
Fred Drake	77a6c9e	2000-09-07 14:00:51 +0000	[diff] [blame]	516	\begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	517	Split \var{string} by the occurrences of \var{pattern}. If
Andrew M. Kuchling	d22e250	1998-08-14 14:49:20 +0000	[diff] [blame]	518	capturing parentheses are used in \var{pattern}, then the text of all
				519	groups in the pattern are also returned as part of the resulting list.
Guido van Rossum	9754639	1998-01-12 18:58:53 +0000	[diff] [blame]	520	If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
				521	occur, and the remainder of the string is returned as the final
				522	element of the list. (Incompatibility note: in the original Python
				523	1.5 release, \var{maxsplit} was ignored. This has been fixed in
				524	later releases.)
Fred Drake	768ac6b	1998-12-22 18:19:45 +0000	[diff] [blame]	525
Fred Drake	1947991	1998-02-13 06:58:54 +0000	[diff] [blame]	526	\begin{verbatim}
Andrew M. Kuchling	d22e250	1998-08-14 14:49:20 +0000	[diff] [blame]	527	>>> re.split('\W+', 'Words, words, words.')
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	528	['Words', 'words', 'words', '']
Andrew M. Kuchling	d22e250	1998-08-14 14:49:20 +0000	[diff] [blame]	529	>>> re.split('(\W+)', 'Words, words, words.')
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	530	['Words', ', ', 'words', ', ', 'words', '.', '']
Andrew M. Kuchling	d22e250	1998-08-14 14:49:20 +0000	[diff] [blame]	531	>>> re.split('\W+', 'Words, words, words.', 1)
Guido van Rossum	9754639	1998-01-12 18:58:53 +0000	[diff] [blame]	532	['Words', 'words, words.']
Fred Drake	1947991	1998-02-13 06:58:54 +0000	[diff] [blame]	533	\end{verbatim}
Fred Drake	768ac6b	1998-12-22 18:19:45 +0000	[diff] [blame]	534
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	535	This function combines and extends the functionality of
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	536	the old \function{regsub.split()} and \function{regsub.splitx()}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	537	\end{funcdesc}
				538
Guido van Rossum	6c373f7	1998-06-29 22:48:01 +0000	[diff] [blame]	539	\begin{funcdesc}{findall}{pattern, string}
Fred Drake	e74f8de	2001-08-01 16:56:51 +0000	[diff] [blame]	540	Return a list of all non-overlapping matches of \var{pattern} in
				541	\var{string}. If one or more groups are present in the pattern,
				542	return a list of groups; this will be a list of tuples if the
				543	pattern has more than one group. Empty matches are included in the
				544	result.
				545	\versionadded{1.5.2}
Guido van Rossum	6c373f7	1998-06-29 22:48:01 +0000	[diff] [blame]	546	\end{funcdesc}
				547
Fred Drake	e74f8de	2001-08-01 16:56:51 +0000	[diff] [blame]	548	\begin{funcdesc}{sub}{pattern, repl, string\optional{, count}}
				549	Return the string obtained by replacing the leftmost non-overlapping
				550	occurrences of \var{pattern} in \var{string} by the replacement
				551	\var{repl}. If the pattern isn't found, \var{string} is returned
				552	unchanged. \var{repl} can be a string or a function; if it is a
				553	string, any backslash escapes in it are processed. That is,
				554	\samp{\e n} is converted to a single newline character, \samp{\e r}
				555	is converted to a linefeed, and so forth. Unknown escapes such as
				556	\samp{\e j} are left alone. Backreferences, such as \samp{\e6}, are
				557	replaced with the substring matched by group 6 in the pattern. For
				558	example:
				559
				560	\begin{verbatim}
				561	>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9])\s$\s*$:',
				562	... r'static PyObject*\npy_\1(void)\n{',
				563	... 'def myfunc():')
				564	'static PyObject*\npy_myfunc(void)\n{'
				565	\end{verbatim}
				566
				567	If \var{repl} is a function, it is called for every non-overlapping
				568	occurrence of \var{pattern}. The function takes a single match
				569	object argument, and returns the replacement string. For example:
Fred Drake	768ac6b	1998-12-22 18:19:45 +0000	[diff] [blame]	570
Fred Drake	1947991	1998-02-13 06:58:54 +0000	[diff] [blame]	571	\begin{verbatim}
Barry Warsaw	4552f3d	1997-11-20 00:15:13 +0000	[diff] [blame]	572	>>> def dashrepl(matchobj):
Guido van Rossum	e9625e8	1998-04-02 01:32:24 +0000	[diff] [blame]	573	.... if matchobj.group(0) == '-': return ' '
				574	.... else: return '-'
Barry Warsaw	4552f3d	1997-11-20 00:15:13 +0000	[diff] [blame]	575	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				576	'pro--gram files'
Fred Drake	1947991	1998-02-13 06:58:54 +0000	[diff] [blame]	577	\end{verbatim}
Fred Drake	768ac6b	1998-12-22 18:19:45 +0000	[diff] [blame]	578
Fred Drake	e74f8de	2001-08-01 16:56:51 +0000	[diff] [blame]	579	The pattern may be a string or an RE object; if you need to specify
				580	regular expression flags, you must use a RE object, or use embedded
				581	modifiers in a pattern; for example, \samp{sub("(?i)b+", "x", "bbbb
				582	BBBB")} returns \code{'x x'}.
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	583
Fred Drake	e74f8de	2001-08-01 16:56:51 +0000	[diff] [blame]	584	The optional argument \var{count} is the maximum number of pattern
				585	occurrences to be replaced; \var{count} must be a non-negative
				586	integer. If omitted or zero, all occurrences will be replaced.
				587	Empty matches for the pattern are replaced only when not adjacent to
				588	a previous match, so \samp{sub('x*', '-', 'abc')} returns
				589	\code{'-a-b-c-'}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	590
Fred Drake	e74f8de	2001-08-01 16:56:51 +0000	[diff] [blame]	591	In addition to character escapes and backreferences as described
				592	above, \samp{\e g<name>} will use the substring matched by the group
				593	named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
				594	\samp{\e g<number>} uses the corresponding group number;
				595	\samp{\e g<2>} is therefore equivalent to \samp{\e 2}, but isn't
				596	ambiguous in a replacement such as \samp{\e g<2>0}. \samp{\e 20}
				597	would be interpreted as a reference to group 20, not a reference to
Eric S. Raymond	46ccd1d	2001-08-28 12:50:03 +0000	[diff] [blame]	598	group 2 followed by the literal character \character{0}. The
				599	backreference \samp{\e g<0>} substitutes in the entire substring
				600	matched by the RE.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	601	\end{funcdesc}
				602
Fred Drake	e74f8de	2001-08-01 16:56:51 +0000	[diff] [blame]	603	\begin{funcdesc}{subn}{pattern, repl, string\optional{, count}}
				604	Perform the same operation as \function{sub()}, but return a tuple
				605	\code{(\var{new_string}, \var{number_of_subs_made})}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	606	\end{funcdesc}
				607
Guido van Rossum	7d447aa	1998-10-13 16:03:52 +0000	[diff] [blame]	608	\begin{funcdesc}{escape}{string}
				609	Return \var{string} with all non-alphanumerics backslashed; this is
				610	useful if you want to match an arbitrary literal string that may have
				611	regular expression metacharacters in it.
				612	\end{funcdesc}
				613
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	614	\begin{excdesc}{error}
				615	Exception raised when a string passed to one of the functions here
Fred Drake	907e76b	2001-07-06 20:30:11 +0000	[diff] [blame]	616	is not a valid regular expression (for example, it might contain
				617	unmatched parentheses) or when some other error occurs during
				618	compilation or matching. It is never an error if a string contains
				619	no match for a pattern.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	620	\end{excdesc}
				621
Fred Drake	42de185	1998-04-20 16:28:44 +0000	[diff] [blame]	622
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	623	\subsection{Regular Expression Objects \label{re-objects}}
Fred Drake	42de185	1998-04-20 16:28:44 +0000	[diff] [blame]	624
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	625	Compiled regular expression objects support the following methods and
				626	attributes:
				627
Fred Drake	77a6c9e	2000-09-07 14:00:51 +0000	[diff] [blame]	628	\begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
				629	endpos}}}
Guido van Rossum	7d447aa	1998-10-13 16:03:52 +0000	[diff] [blame]	630	Scan through \var{string} looking for a location where this regular
				631	expression produces a match, and return a
				632	corresponding \class{MatchObject} instance. Return \code{None} if no
				633	position in the string matches the pattern; note that this is
				634	different from finding a zero-length match at some point in the string.
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	635
Guido van Rossum	7d447aa	1998-10-13 16:03:52 +0000	[diff] [blame]	636	The optional \var{pos} and \var{endpos} parameters have the same
				637	meaning as for the \method{match()} method.
				638	\end{methoddesc}
				639
Fred Drake	77a6c9e	2000-09-07 14:00:51 +0000	[diff] [blame]	640	\begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
				641	endpos}}}
Guido van Rossum	eb53ae4	1997-10-05 18:54:07 +0000	[diff] [blame]	642	If zero or more characters at the beginning of \var{string} match
				643	this regular expression, return a corresponding
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	644	\class{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossum	eb53ae4	1997-10-05 18:54:07 +0000	[diff] [blame]	645	match the pattern; note that this is different from a zero-length
				646	match.
Fred Drake	768ac6b	1998-12-22 18:19:45 +0000	[diff] [blame]	647
Fred Drake	0aa811c	2001-10-20 04:24:09 +0000	[diff] [blame]	648	\note{If you want to locate a match anywhere in
				649	\var{string}, use \method{search()} instead.}
Fred Drake	768ac6b	1998-12-22 18:19:45 +0000	[diff] [blame]	650
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	651	The optional second parameter \var{pos} gives an index in the string
Andrew M. Kuchling	65b7863	1998-06-22 15:02:42 +0000	[diff] [blame]	652	where the search is to start; it defaults to \code{0}. This is not
Fred Drake	7bc6f7a	2002-02-14 15:19:30 +0000	[diff] [blame]	653	completely equivalent to slicing the string; the
				654	\code{'\textasciicircum'} pattern
Andrew M. Kuchling	65b7863	1998-06-22 15:02:42 +0000	[diff] [blame]	655	character matches at the real beginning of the string and at positions
				656	just after a newline, but not necessarily at the index where the search
				657	is to start.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	658
				659	The optional parameter \var{endpos} limits how far the string will
				660	be searched; it will be as if the string is \var{endpos} characters
				661	long, so only the characters from \var{pos} to \var{endpos} will be
				662	searched for a match.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	663	\end{methoddesc}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	664
Fred Drake	77a6c9e	2000-09-07 14:00:51 +0000	[diff] [blame]	665	\begin{methoddesc}[RegexObject]{split}{string\optional{,
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	666	maxsplit\code{ = 0}}}
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	667	Identical to the \function{split()} function, using the compiled pattern.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	668	\end{methoddesc}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	669
Guido van Rossum	6c373f7	1998-06-29 22:48:01 +0000	[diff] [blame]	670	\begin{methoddesc}[RegexObject]{findall}{string}
				671	Identical to the \function{findall()} function, using the compiled pattern.
				672	\end{methoddesc}
				673
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	674	\begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	675	Identical to the \function{sub()} function, using the compiled pattern.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	676	\end{methoddesc}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	677
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	678	\begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
				679	count\code{ = 0}}}
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	680	Identical to the \function{subn()} function, using the compiled pattern.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	681	\end{methoddesc}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	682
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	683
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	684	\begin{memberdesc}[RegexObject]{flags}
Fred Drake	895aa9d	2001-04-18 17:26:20 +0000	[diff] [blame]	685	The flags argument used when the RE object was compiled, or
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	686	\code{0} if no flags were provided.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	687	\end{memberdesc}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	688
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	689	\begin{memberdesc}[RegexObject]{groupindex}
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	690	A dictionary mapping any symbolic group names defined by
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	691	\regexp{(?P<\var{id}>)} to group numbers. The dictionary is empty if no
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	692	symbolic groups were used in the pattern.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	693	\end{memberdesc}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	694
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	695	\begin{memberdesc}[RegexObject]{pattern}
Fred Drake	895aa9d	2001-04-18 17:26:20 +0000	[diff] [blame]	696	The pattern string from which the RE object was compiled.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	697	\end{memberdesc}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	698
Fred Drake	42de185	1998-04-20 16:28:44 +0000	[diff] [blame]	699
Fred Drake	d16d498	1998-09-10 20:21:00 +0000	[diff] [blame]	700	\subsection{Match Objects \label{match-objects}}
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	701
Fred Drake	f4bdb57	2001-07-12 14:13:43 +0000	[diff] [blame]	702	\class{MatchObject} instances support the following methods and
				703	attributes:
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	704
Andrew M. Kuchling	7a90db6	2000-10-05 12:35:29 +0000	[diff] [blame]	705	\begin{methoddesc}[MatchObject]{expand}{template}
				706	Return the string obtained by doing backslash substitution on the
				707	template string \var{template}, as done by the \method{sub()} method.
				708	Escapes such as \samp{\e n} are converted to the appropriate
Fred Drake	f4bdb57	2001-07-12 14:13:43 +0000	[diff] [blame]	709	characters, and numeric backreferences (\samp{\e 1}, \samp{\e 2}) and
				710	named backreferences (\samp{\e g<1>}, \samp{\e g<name>}) are replaced
				711	by the contents of the corresponding group.
Andrew M. Kuchling	7a90db6	2000-10-05 12:35:29 +0000	[diff] [blame]	712	\end{methoddesc}
				713
Fred Drake	77a6c9e	2000-09-07 14:00:51 +0000	[diff] [blame]	714	\begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	715	Returns one or more subgroups of the match. If there is a single
				716	argument, the result is a single string; if there are
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	717	multiple arguments, the result is a tuple with one item per argument.
Fred Drake	907e76b	2001-07-06 20:30:11 +0000	[diff] [blame]	718	Without arguments, \var{group1} defaults to zero (the whole match
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	719	is returned).
				720	If a \var{groupN} argument is zero, the corresponding return value is the
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	721	entire matching string; if it is in the inclusive range [1..99], it is
Guido van Rossum	791468f	1998-04-03 20:07:37 +0000	[diff] [blame]	722	the string matching the the corresponding parenthesized group. If a
				723	group number is negative or larger than the number of groups defined
				724	in the pattern, an \exception{IndexError} exception is raised.
				725	If a group is contained in a part of the pattern that did not match,
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	726	the corresponding result is \code{None}. If a group is contained in a
Guido van Rossum	791468f	1998-04-03 20:07:37 +0000	[diff] [blame]	727	part of the pattern that matched multiple times, the last match is
				728	returned.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	729
Andrew M. Kuchling	2533281	1998-04-09 14:56:04 +0000	[diff] [blame]	730	If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	731	the \var{groupN} arguments may also be strings identifying groups by
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	732	their group name. If a string argument is not used as a group name in
Guido van Rossum	791468f	1998-04-03 20:07:37 +0000	[diff] [blame]	733	the pattern, an \exception{IndexError} exception is raised.
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	734
				735	A moderately complicated example:
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	736
				737	\begin{verbatim}
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	738	m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	739	\end{verbatim}
				740
				741	After performing this match, \code{m.group(1)} is \code{'3'}, as is
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	742	\code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	743	\end{methoddesc}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	744
Guido van Rossum	6c373f7	1998-06-29 22:48:01 +0000	[diff] [blame]	745	\begin{methoddesc}[MatchObject]{groups}{\optional{default}}
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	746	Return a tuple containing all the subgroups of the match, from 1 up to
Guido van Rossum	6c373f7	1998-06-29 22:48:01 +0000	[diff] [blame]	747	however many groups are in the pattern. The \var{default} argument is
				748	used for groups that did not participate in the match; it defaults to
				749	\code{None}. (Incompatibility note: in the original Python 1.5
				750	release, if the tuple was one element long, a string would be returned
				751	instead. In later versions (from 1.5.1 on), a singleton tuple is
				752	returned in such cases.)
				753	\end{methoddesc}
				754
				755	\begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
				756	Return a dictionary containing all the \emph{named} subgroups of the
				757	match, keyed by the subgroup name. The \var{default} argument is
				758	used for groups that did not participate in the match; it defaults to
				759	\code{None}.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	760	\end{methoddesc}
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	761
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	762	\begin{methoddesc}[MatchObject]{start}{\optional{group}}
Fred Drake	013ad98	1998-03-08 07:38:27 +0000	[diff] [blame]	763	\funcline{end}{\optional{group}}
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	764	Return the indices of the start and end of the substring
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	765	matched by \var{group}; \var{group} defaults to zero (meaning the whole
				766	matched substring).
Fred Drake	77a6c9e	2000-09-07 14:00:51 +0000	[diff] [blame]	767	Return \code{-1} if \var{group} exists but
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	768	did not contribute to the match. For a match object
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	769	\var{m}, and a group \var{g} that did contribute to the match, the
				770	substring matched by group \var{g} (equivalent to
				771	\code{\var{m}.group(\var{g})}) is
				772
				773	\begin{verbatim}
				774	m.string[m.start(g):m.end(g)]
				775	\end{verbatim}
				776
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	777	Note that
				778	\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	779	\var{group} matched a null string. For example, after \code{\var{m} =
				780	re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
				781	\code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
				782	\code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	783	an \exception{IndexError} exception.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	784	\end{methoddesc}
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	785
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	786	\begin{methoddesc}[MatchObject]{span}{\optional{group}}
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	787	For \class{MatchObject} \var{m}, return the 2-tuple
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	788	\code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	789	Note that if \var{group} did not contribute to the match, this is
Fred Drake	77a6c9e	2000-09-07 14:00:51 +0000	[diff] [blame]	790	\code{(-1, -1)}. Again, \var{group} defaults to zero.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	791	\end{methoddesc}
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	792
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	793	\begin{memberdesc}[MatchObject]{pos}
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	794	The value of \var{pos} which was passed to the
Fred Drake	895aa9d	2001-04-18 17:26:20 +0000	[diff] [blame]	795	\function{search()} or \function{match()} function. This is the index
Tim Peters	7533587	2001-11-03 19:35:43 +0000	[diff] [blame]	796	into the string at which the RE engine started looking for a match.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	797	\end{memberdesc}
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	798
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	799	\begin{memberdesc}[MatchObject]{endpos}
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	800	The value of \var{endpos} which was passed to the
Fred Drake	895aa9d	2001-04-18 17:26:20 +0000	[diff] [blame]	801	\function{search()} or \function{match()} function. This is the index
				802	into the string beyond which the RE engine will not go.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	803	\end{memberdesc}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	804
Andrew M. Kuchling	75afc0b	2000-10-18 23:08:13 +0000	[diff] [blame]	805	\begin{memberdesc}[MatchObject]{lastgroup}
				806	The name of the last matched capturing group, or \code{None} if the
				807	group didn't have a name, or if no group was matched at all.
				808	\end{memberdesc}
				809
				810	\begin{memberdesc}[MatchObject]{lastindex}
				811	The integer index of the last matched capturing group, or \code{None}
				812	if no group was matched at all.
				813	\end{memberdesc}
				814
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	815	\begin{memberdesc}[MatchObject]{re}
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	816	The regular expression object whose \method{match()} or
				817	\method{search()} method produced this \class{MatchObject} instance.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	818	\end{memberdesc}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	819
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	820	\begin{memberdesc}[MatchObject]{string}
Fred Drake	20e0196	1998-02-19 15:09:35 +0000	[diff] [blame]	821	The string passed to \function{match()} or \function{search()}.
Fred Drake	76547c5	1998-04-03 05:59:05 +0000	[diff] [blame]	822	\end{memberdesc}
Fred Drake	1cec7fa	2001-11-29 08:45:22 +0000	[diff] [blame]	823
				824	\subsection{Examples}
				825
Fred Drake	1cec7fa	2001-11-29 08:45:22 +0000	[diff] [blame]	826	\leftline{\strong{Simulating \cfunction{scanf()}}}
				827
				828	Python does not currently have an equivalent to \cfunction{scanf()}.
				829	\ttindex{scanf()}
				830	Regular expressions are generally more powerful, though also more
				831	verbose, than \cfunction{scanf()} format strings. The table below
				832	offers some more-or-less equivalent mappings between
				833	\cfunction{scanf()} format tokens and regular expressions.
				834
				835	\begin{tableii}{l\|l}{textrm}{\cfunction{scanf()} Token}{Regular Expression}
				836	\lineii{\code{\%c}}
				837	{\regexp{.}}
				838	\lineii{\code{\%5c}}
				839	{\regexp{.\{5\}}}
				840	\lineii{\code{\%d}}
				841	{\regexp{[-+]\e d+}}
				842	\lineii{\code{\%e}, \code{\%E}, \code{\%f}, \code{\%g}}
				843	{\regexp{[-+](\e d+(\e.\e d)?\|\e d\e.\e d+)([eE]\e d+)?}}
				844	\lineii{\code{\%i}}
				845	{\regexp{[-+](0[xX][\e dA-Fa-f]+\|0[0-7]*\|\e d+)}}
				846	\lineii{\code{\%o}}
				847	{\regexp{0[0-7]*}}
				848	\lineii{\code{\%s}}
Fred Drake	ed0a719	2001-11-29 20:23:14 +0000	[diff] [blame]	849	{\regexp{\e S+}}
Fred Drake	1cec7fa	2001-11-29 08:45:22 +0000	[diff] [blame]	850	\lineii{\code{\%u}}
				851	{\regexp{\e d+}}
				852	\lineii{\code{\%x}, \code{\%X}}
				853	{\regexp{0[xX][\e dA-Fa-f]}}
				854	\end{tableii}
				855
				856	To extract the filename and numbers from a string like
				857
				858	\begin{verbatim}
				859	/usr/sbin/sendmail - 0 errors, 4 warnings
				860	\end{verbatim}
				861
				862	you would use a \cfunction{scanf()} format like
				863
				864	\begin{verbatim}
				865	%s - %d errors, %d warnings
				866	\end{verbatim}
				867
				868	The equivalent regular expression would be
				869
				870	\begin{verbatim}
Skip Montanaro	a8e1d81	2002-03-04 23:08:28 +0000	[diff] [blame]	871	(\S+) - (\d+) errors, (\d+) warnings
Fred Drake	1cec7fa	2001-11-29 08:45:22 +0000	[diff] [blame]	872	\end{verbatim}
				873
Skip Montanaro	a8e1d81	2002-03-04 23:08:28 +0000	[diff] [blame]	874	\leftline{\strong{Avoiding backtracking}}
				875
				876	If you create regular expressions that require the engine to perform a lot
				877	of backtracking, you may encounter a RuntimeError exception with the message
				878	\code{maximum recursion limit exceeded}. For example,
				879
				880	\begin{verbatim}
Fred Drake	9479c95	2002-03-05 04:02:39 +0000	[diff] [blame]	881	>>> s = "<" + "that's a very big string!"*1000 + ">"
				882	>>> re.match('<.*?>', s)
				883	Traceback (most recent call last):
				884	File "<stdin>", line 1, in ?
				885	File "/usr/local/lib/python2.3/sre.py", line 132, in match
				886	return _compile(pattern, flags).match(string)
				887	RuntimeError: maximum recursion limit exceeded
Skip Montanaro	a8e1d81	2002-03-04 23:08:28 +0000	[diff] [blame]	888	\end{verbatim}
				889
				890	You can often restructure your regular expression to avoid backtracking.
Fred Drake	9479c95	2002-03-05 04:02:39 +0000	[diff] [blame]	891	The above regular expression can be recast as
				892	\regexp{\textless[\textasciicircum \textgreater]*\textgreater}. As a
				893	further benefit, such regular expressions will run faster than their
				894	backtracking equivalents.