Blame - Doc/libre.tex - platform/external/python/cpython3

blob: 93cd4c6961e53ad4bfb65da146bb7a09910666f6 [file] [log] [blame]

Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	1	\section{Built-in Module \sectcode{re}}
				2	\label{module-re}
				3
				4	\bimodindex{re}
				5
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	6	This module provides regular expression matching operations similar to
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	7	those found in Perl. It's 8-bit clean: both patterns and strings may
				8	contain null bytes and characters whose high bit is set. It is always
				9	available.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	10
				11	Regular expressions use the backslash character (\code{\e}) to
				12	indicate special forms or to allow special characters to be used
				13	without invoking their special meaning. This collides with Python's
				14	usage of the same character for the same purpose in string literals;
				15	for example, to match a literal backslash, one might have to write
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	16	\code{\e\e\e\e} as the pattern string, because the regular expression
				17	must be \code{\e\e}, and each backslash must be expressed as
				18	\code{\e\e} inside a regular Python string literal.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	19
				20	The solution is to use Python's raw string notation for regular
				21	expression patterns; backslashes are not handled in any special way in
				22	a string literal prefixed with 'r'. So \code{r"\e n"} is a two
				23	character string containing a backslash and the letter 'n', while
				24	\code{"\e n"} is a one-character string containing a newline. Usually
				25	patterns will be expressed in Python code using this raw string notation.
				26
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	27	\subsection{Regular Expression Syntax}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	28
				29	A regular expression (or RE) specifies a set of strings that matches
				30	it; the functions in this module let you check if a particular string
				31	matches a given regular expression (or if a given regular expression
				32	matches a particular string, which comes down to the same thing).
				33
				34	Regular expressions can be concatenated to form new regular
				35	expressions; if \emph{A} and \emph{B} are both regular expressions,
				36	then \emph{AB} is also an regular expression. If a string \emph{p}
				37	matches A and another string \emph{q} matches B, the string \emph{pq}
				38	will match AB. Thus, complex expressions can easily be constructed
				39	from simpler primitive expressions like the ones described here. For
				40	details of the theory and implementation of regular expressions,
				41	consult the Friedl book referenced below, or almost any textbook about
				42	compiler construction.
				43
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	44	A brief explanation of the format of regular expressions follows.
				45	%For further information and a gentler presentation, consult XXX somewhere.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	46
				47	Regular expressions can contain both special and ordinary characters.
				48	Most ordinary characters, like '\code{A}', '\code{a}', or '\code{0}',
				49	are the simplest regular expressions; they simply match themselves.
				50	You can concatenate ordinary characters, so '\code{last}' matches the
				51	characters 'last'. (In the rest of this section, we'll write RE's in
				52	\code{this special font}, usually without quotes, and strings to be
				53	matched 'in single quotes'.)
				54
				55	Some characters, like \code{\|} or \code{(}, are special. Special
				56	characters either stand for classes of ordinary characters, or affect
				57	how the regular expressions around them are interpreted.
				58
				59	The special characters are:
				60	\begin{itemize}
				61	\item[\code{.}] (Dot.) In the default mode, this matches any
				62	character except a newline. If the \code{DOTALL} flag has been
				63	specified, this matches any character including a newline.
				64	\item[\code{\^}] (Caret.) Matches the start of the string, and in
				65	\code{MULTILINE} mode also immediately after each newline.
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	66	\item[\code{\$}] Matches the end of the string, and in
				67	\code{MULTILINE} mode also matches before a newline.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	68	\code{foo} matches both 'foo' and 'foobar', while the regular
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	69	expression \code{foo\$} matches only 'foo'.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	70	%
				71	\item[\code{*}] Causes the resulting RE to
				72	match 0 or more repetitions of the preceding RE, as many repetitions
				73	as are possible. \code{ab*} will
				74	match 'a', 'ab', or 'a' followed by any number of 'b's.
				75	%
				76	\item[\code{+}] Causes the
				77	resulting RE to match 1 or more repetitions of the preceding RE.
				78	\code{ab+} will match 'a' followed by any non-zero number of 'b's; it
				79	will not match just 'a'.
				80	%
				81	\item[\code{?}] Causes the resulting RE to
				82	match 0 or 1 repetitions of the preceding RE. \code{ab?} will
				83	match either 'a' or 'ab'.
				84	\item[\code{?}, \code{+?}, \code{??}] The \code{}, \code{+}, and
				85	\code{?} qualifiers are all \dfn{greedy}; they match as much text as
				86	possible. Sometimes this behaviour isn't desired; if the RE
				87	\code{<.*>} is matched against \code{<H1>title</H1>}, it will match the
				88	entire string, and not just \code{<H1>}.
				89	Adding \code{?} after the qualifier makes it perform the match in
				90	\dfn{non-greedy} or \dfn{minimal} fashion; as few characters as
				91	possible will be matched. Using \code{.*?} in the previous
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	92	expression will match only \code{<H1>}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	93	%
Guido van Rossum	0148bbf	1997-12-22 22:41:40 +0000	[diff] [blame]	94	\item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
				95	\var{m} to \var{n} repetitions of the preceding RE, attempting to
				96	match as many repetitions as possible. For example, \code{a\{3,5\}}
				97	will match from 3 to 5 'a' characters.
				98	%
				99	\item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
				100	match from \var{m} to \var{n} repetitions of the preceding RE,
				101	attempting to match as \emph{few} repetitions as possible. This is
				102	the non-greedy version of the previous qualifier. For example, on the
				103	6-character string 'aaaaaa', \code{a\{3,5\}} will match 5 'a'
				104	characters, while \code{a\{3,5\}?} will only match 3 characters.
				105	%
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	106	\item[\code{\e}] Either escapes special characters (permitting you to match
				107	characters like '*?+\&\$'), or signals a special sequence; special
				108	sequences are discussed below.
				109
				110	If you're not using a raw string to
				111	express the pattern, remember that Python also uses the
				112	backslash as an escape sequence in string literals; if the escape
				113	sequence isn't recognized by Python's parser, the backslash and
				114	subsequent character are included in the resulting string. However,
				115	if Python would recognize the resulting sequence, the backslash should
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	116	be repeated twice. This is complicated and hard to understand, so
				117	it's highly recommended that you use raw strings for all but the
				118	simplest expressions.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	119	%
				120	\item[\code{[]}] Used to indicate a set of characters. Characters can
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	121	be listed individually, or a range of characters can be indicated by
				122	giving two characters and separating them by a '-'. Special
				123	characters are not active inside sets. For example, \code{[akm\$]}
				124	will match any of the characters 'a', 'k', 'm', or '\$'; \code{[a-z]}
				125	will match any lowercase letter and \code{[a-zA-Z0-9]} matches any
				126	letter or digit. Character classes such as \code{\e w} or \code {\e
				127	S} (defined below) are also acceptable inside a range. If you want to
				128	include a \code{]} or a \code{-} inside a set, precede it with a
				129	backslash.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	130
				131	Characters \emph{not} within a range can be matched by including a
				132	\code{\^} as the first character of the set; \code{\^} elsewhere will
				133	simply match the '\code{\^}' character.
				134	%
				135	\item[\code{\|}]\code{A\|B}, where A and B can be arbitrary REs,
				136	creates a regular expression that will match either A or B. This can
Guido van Rossum	eb0f066	1997-12-30 20:38:16 +0000	[diff] [blame]	137	be used inside groups (see below) as well. To match a literal '\code{\|}',
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	138	use \code{\e\|}, or enclose it inside a character class, like \code{[\|]}.
				139	%
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	140	\item[\code{(...)}] Matches whatever regular expression is inside the
				141	parentheses, and indicates the start and end of a group; the contents
				142	of a group can be retrieved after a match has been performed, and can
				143	be matched later in the string with the \code{\e \var{number}} special
				144	sequence, described below. To match the literals '(' or ')',
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	145	use \code{\e(} or \code{\e)}, or enclose them inside a character
				146	class: \code{[(] [)]}.
				147	%
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	148	\item[\code{(?...)}] This is an extension notation (a '?' following a
				149	'(' is not meaningful otherwise). The first character after the '?'
				150	determines what the meaning and further syntax of the construct is.
				151	Following are the currently supported extensions.
				152	%
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	153	\item[\code{(?iLmsx)}] (One or more letters from the set '\code{i}',
				154	'\code{L}', '\code{m}', '\code{s}', '\code{x}'.) The group matches
				155	the empty string; the letters set the corresponding flags
				156	(\code{re.I}, \code{re.L}, \code{re.M}, \code{re.S}, \code{re.X}) for
				157	the entire regular expression. This is useful if you wish include the
				158	flags as part of the regular expression, instead of passing a
				159	\var{flag} argument to the \code{compile()} function.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	160	%
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	161	\item[\code{(?:...)}] A non-grouping version of regular parentheses.
				162	Matches whatever's inside the parentheses, but the text matched by the
				163	group \emph{cannot} be retrieved after performing a match or
				164	referenced later in the pattern.
				165	%
				166	\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
				167	the text matched by the group is accessible via the symbolic group
				168	name \var{name}. Group names must be valid Python identifiers. A
				169	symbolic group is also a numbered group, just as if the group were not
				170	named. So the group named 'id' in the example above can also be
				171	referenced as the numbered group 1.
				172
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	173	For example, if the pattern is
				174	\code{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	175	name in arguments to methods of match objects, such as \code{m.group('id')}
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	176	or \code{m.end('id')}, and also by name in pattern text
				177	(e.g. \code{(?P=id)}) and replacement text (e.g. \code{\e g<id>}).
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	178	%
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	179	\item[\code{(?P=\var{name})}] Matches whatever text was matched by the
				180	earlier group named \var{name}.
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	181	%
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	182	\item[\code{(?\#...)}] A comment; the contents of the parentheses are
				183	simply ignored.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	184	%
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	185	\item[\code{(?=...)}] Matches if \code{...} matches next, but doesn't
				186	consume any of the string. This is called a lookahead assertion. For
				187	example, \code{Isaac (?=Asimov)} will match 'Isaac~' only if it's
				188	followed by 'Asimov'.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	189	%
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	190	\item[\code{(?!...)}] Matches if \code{...} doesn't match next. This
				191	is a negative lookahead assertion. For example,
				192	\code{Isaac (?!Asimov)} will match 'Isaac~' only if it's \emph{not}
				193	followed by 'Asimov'.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	194
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	195	\end{itemize}
				196
				197	The special sequences consist of '\code{\e}' and a character from the
				198	list below. If the ordinary character is not on the list, then the
				199	resulting RE will match the second character. For example,
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	200	\code{\e\$} matches the character '\$'.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	201
				202	\begin{itemize}
				203
				204	%
				205	\item[\code{\e \var{number}}] Matches the contents of the group of the
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	206	same number. Groups are numbered starting from 1. For example,
				207	\code{(.+) \e 1} matches 'the the' or '55 55', but not 'the end' (note
				208	the space after the group). This special sequence can only be used to
				209	match one of the first 99 groups. If the first digit of \var{number}
				210	is 0, or \var{number} is 3 octal digits long, it will not be interpreted
				211	as a group match, but as the character with octal value \var{number}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	212	%
				213	\item[\code{\e A}] Matches only at the start of the string.
				214	%
				215	\item[\code{\e b}] Matches the empty string, but only at the
				216	beginning or end of a word. A word is defined as a sequence of
				217	alphanumeric characters, so the end of a word is indicated by
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	218	whitespace or a non-alphanumeric character. Inside a character range,
				219	\code{\e b} represents the backspace character, for compatibility with
				220	Python's string literals.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	221	%
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	222	\item[\code{\e B}] Matches the empty string, but only when it is
				223	\emph{not} at the beginning or end of a word.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	224	%
				225	\item[\code{\e d}]Matches any decimal digit; this is
				226	equivalent to the set \code{[0-9]}.
				227	%
				228	\item[\code{\e D}]Matches any non-digit character; this is
Fred Drake	c458638	1998-01-06 15:46:21 +0000	[diff] [blame]	229	equivalent to the set \code{[{\^}0-9]}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	230	%
				231	\item[\code{\e s}]Matches any whitespace character; this is
				232	equivalent to the set \code{[ \e t\e n\e r\e f\e v]}.
				233	%
				234	\item[\code{\e S}]Matches any non-whitespace character; this is
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	235	equivalent to the set \code{[\^ \e t\e n\e r\e f\e v]}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	236	%
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	237	\item[\code{\e w}]When the \code{LOCALE} flag is not specified,
				238	matches any alphanumeric character; this is equivalent to the set
				239	\code{[a-zA-Z0-9_]}. With \code{LOCALE}, it will match the set
				240	\code{[0-9_]} plus whatever characters are defined as letters for the
				241	current locale.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	242	%
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	243	\item[\code{\e W}]When the \code{LOCALE} flag is not specified,
				244	matches any non-alphanumeric character; this is equivalent to the set
				245	\code{[{\^}a-zA-Z0-9_]}. With \code{LOCALE}, it will match any
				246	character not in the set \code{[0-9_]}, and not defined as a letter
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	247	for the current locale.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	248
				249	\item[\code{\e Z}]Matches only at the end of the string.
				250	%
				251
				252	\item[\code{\e \e}] Matches a literal backslash.
				253
				254	\end{itemize}
				255
				256	\subsection{Module Contents}
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	257	\nodename{Contents of Module re}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	258
				259	The module defines the following functions and constants, and an exception:
				260
				261	\renewcommand{\indexsubitem}{(in module re)}
				262
				263	\begin{funcdesc}{compile}{pattern\optional{\, flags}}
				264	Compile a regular expression pattern into a regular expression
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	265	object, which can be used for matching using its \code{match()} and
				266	\code{search()} methods, described below.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	267
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	268	The expression's behaviour can be modified by specifying a
				269	\var{flags} value. Values can be any of the following variables,
				270	combined using bitwise OR (the \code{\|} operator).
				271
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	272	\begin{description}
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	273
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	274	% The use of \quad in the item labels is ugly but adds enough space
				275	% to the label that it doesn't get visually run-in with the text.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	276
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	277	\item[\code{I} or \code{IGNORECASE} or \code{(?i)}\quad]
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	278
				279	Perform case-insensitive matching; expressions like \code{[A-Z]} will match
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	280	lowercase letters, too. This is not affected by the current locale.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	281
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	282	\item[\code{L} or \code{LOCALE} or \code{(?L)}\quad]
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	283
				284	Make \code{\e w}, \code{\e W}, \code{\e b},
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	285	\code{\e B}, dependent on the current locale.
Guido van Rossum	a42c178	1997-12-09 20:41:47 +0000	[diff] [blame]	286
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	287	\item[\code{M} or \code{MULTILINE} or \code{(?m)}\quad]
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	288
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	289	When specified, the pattern character \code{\^} matches at the
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	290	beginning of the string and at the beginning of each line
				291	(immediately following each newline); and the pattern character
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	292	\code{\$} matches at the end of the string and at the end of each line
				293	(immediately preceding each newline).
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	294	By default, \code{\^} matches only at the beginning of the string, and
				295	\code{\$} only at the end of the string and immediately before the
				296	newline (if any) at the end of the string.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	297
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	298	\item[\code{S} or \code{DOTALL} or \code{(?s)}\quad]
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	299
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	300	Make the \code{.} special character any character at all, including a
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	301	newline; without this flag, \code{.} will match anything \emph{except}
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	302	a newline.
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	303
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	304	\item[\code{X} or \code{VERBOSE} or \code{(?x)}\quad]
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	305
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	306	Ignore whitespace within the pattern
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	307	except when in a character class or preceded by an unescaped
				308	backslash, and, when a line contains a \code{\#} neither in a character
				309	class or preceded by an unescaped backslash, all characters from the
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	310	leftmost such \code{\#} through the end of the line are ignored.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	311
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	312	\end{description}
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	313
Fred Drake	78f8e98	1997-12-29 21:39:39 +0000	[diff] [blame]	314	The sequence
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	315	%
				316	\bcode\begin{verbatim}
				317	prog = re.compile(pat)
				318	result = prog.match(str)
				319	\end{verbatim}\ecode
				320	%
				321	is equivalent to
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	322
				323	\begin{verbatim}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	324	result = re.match(pat, str)
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	325	\end{verbatim}
				326
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	327	but the version using \code{compile()} is more efficient when the
				328	expression will be used several times in a single program.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	329	%(The compiled version of the last pattern passed to \code{regex.match()} or
				330	%\code{regex.search()} is cached, so programs that use only a single
				331	%regular expression at a time needn't worry about compiling regular
				332	%expressions.)
				333	\end{funcdesc}
				334
				335	\begin{funcdesc}{escape}{string}
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	336	Return \var{string} with all non-alphanumerics backslashed; this is
				337	useful if you want to match an arbitrary literal string that may have
				338	regular expression metacharacters in it.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	339	\end{funcdesc}
				340
				341	\begin{funcdesc}{match}{pattern\, string\optional{\, flags}}
				342	If zero or more characters at the beginning of \var{string} match
				343	the regular expression \var{pattern}, return a corresponding
Guido van Rossum	0148bbf	1997-12-22 22:41:40 +0000	[diff] [blame]	344	\code{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	345	match the pattern; note that this is different from a zero-length
				346	match.
				347	\end{funcdesc}
				348
				349	\begin{funcdesc}{search}{pattern\, string\optional{\, flags}}
				350	Scan through \var{string} looking for a location where the regular
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	351	expression \var{pattern} produces a match, and return a
				352	corresponding \code{MatchObject} instance.
Guido van Rossum	0148bbf	1997-12-22 22:41:40 +0000	[diff] [blame]	353	Return \code{None} if no
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	354	position in the string matches the pattern; note that this is
				355	different from finding a zero-length match at some point in the string.
				356	\end{funcdesc}
				357
				358	\begin{funcdesc}{split}{pattern\, string\, \optional{, maxsplit=0}}
				359	Split \var{string} by the occurrences of \var{pattern}. If
				360	capturing parentheses are used in pattern, then occurrences of
				361	patterns or subpatterns are also returned.
Guido van Rossum	9754639	1998-01-12 18:58:53 +0000	[diff] [blame]	362	If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
				363	occur, and the remainder of the string is returned as the final
				364	element of the list. (Incompatibility note: in the original Python
				365	1.5 release, \var{maxsplit} was ignored. This has been fixed in
				366	later releases.)
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	367	%
				368	\bcode\begin{verbatim}
				369	>>> re.split('[\W]+', 'Words, words, words.')
				370	['Words', 'words', 'words', '']
				371	>>> re.split('([\W]+)', 'Words, words, words.')
				372	['Words', ', ', 'words', ', ', 'words', '.', '']
Guido van Rossum	9754639	1998-01-12 18:58:53 +0000	[diff] [blame]	373	>>> re.split('[\W]+', 'Words, words, words.', 1)
				374	['Words', 'words, words.']
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	375	\end{verbatim}\ecode
				376	%
				377	This function combines and extends the functionality of
Guido van Rossum	9754639	1998-01-12 18:58:53 +0000	[diff] [blame]	378	the old \code{regsub.split()} and \code{regsub.splitx()}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	379	\end{funcdesc}
				380
				381	\begin{funcdesc}{sub}{pattern\, repl\, string\optional{, count=0}}
				382	Return the string obtained by replacing the leftmost non-overlapping
				383	occurrences of \var{pattern} in \var{string} by the replacement
Barry Warsaw	4552f3d	1997-11-20 00:15:13 +0000	[diff] [blame]	384	\var{repl}. If the pattern isn't found, \var{string} is returned
				385	unchanged. \var{repl} can be a string or a function; if a function,
				386	it is called for every non-overlapping occurance of \var{pattern}.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	387	The function takes a single match object argument, and returns the
				388	replacement string. For example:
Barry Warsaw	4552f3d	1997-11-20 00:15:13 +0000	[diff] [blame]	389	%
				390	\bcode\begin{verbatim}
				391	>>> def dashrepl(matchobj):
				392	... if matchobj.group(0) == '-': return ' '
				393	... else: return '-'
				394	>>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
				395	'pro--gram files'
				396	\end{verbatim}\ecode
				397	%
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	398	The pattern may be a string or a
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	399	regex object; if you need to specify
				400	regular expression flags, you must use a regex object, or use
				401	embedded modifiers in a pattern; e.g.
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	402
				403	\begin{verbatim}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	404	sub("(?i)b+", "x", "bbbb BBBB") returns 'x x'.
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	405	\end{verbatim}
				406
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	407	The optional argument \var{count} is the maximum number of pattern
				408	occurrences to be replaced; count must be a non-negative integer, and
				409	the default value of 0 means to replace all occurrences.
				410
				411	Empty matches for the pattern are replaced only when not adjacent to a
				412	previous match, so \code{sub('x*', '-', 'abc')} returns '-a-b-c-'.
				413	\end{funcdesc}
				414
				415	\begin{funcdesc}{subn}{pattern\, repl\, string\optional{, count=0}}
				416	Perform the same operation as \code{sub()}, but return a tuple
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	417	\code{(\var{new_string}, \var{number_of_subs_made})}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	418	\end{funcdesc}
				419
				420	\begin{excdesc}{error}
				421	Exception raised when a string passed to one of the functions here
				422	is not a valid regular expression (e.g., unmatched parentheses) or
				423	when some other error occurs during compilation or matching. (It is
				424	never an error if a string contains no match for a pattern.)
				425	\end{excdesc}
				426
				427	\subsection{Regular Expression Objects}
				428	Compiled regular expression objects support the following methods and
				429	attributes:
				430
Guido van Rossum	eb53ae4	1997-10-05 18:54:07 +0000	[diff] [blame]	431	\renewcommand{\indexsubitem}{(re method)}
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	432	\begin{funcdesc}{match}{string\optional{\, pos}\optional{\, endpos}}
Guido van Rossum	eb53ae4	1997-10-05 18:54:07 +0000	[diff] [blame]	433	If zero or more characters at the beginning of \var{string} match
				434	this regular expression, return a corresponding
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	435	\code{MatchObject} instance. Return \code{None} if the string does not
Guido van Rossum	eb53ae4	1997-10-05 18:54:07 +0000	[diff] [blame]	436	match the pattern; note that this is different from a zero-length
				437	match.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	438
				439	The optional second parameter \var{pos} gives an index in the string
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	440	where the search is to start; it defaults to \code{0}. The
				441	\code{'\^'} pattern character will match at the index where the
				442	search is to start.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	443
				444	The optional parameter \var{endpos} limits how far the string will
				445	be searched; it will be as if the string is \var{endpos} characters
				446	long, so only the characters from \var{pos} to \var{endpos} will be
				447	searched for a match.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	448	\end{funcdesc}
				449
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	450	\begin{funcdesc}{search}{string\optional{\, pos}\optional{\, endpos}}
Guido van Rossum	eb53ae4	1997-10-05 18:54:07 +0000	[diff] [blame]	451	Scan through \var{string} looking for a location where this regular
				452	expression produces a match. Return \code{None} if no
				453	position in the string matches the pattern; note that this is
				454	different from finding a zero-length match at some point in the string.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	455
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	456	The optional \var{pos} and \var{endpos} parameters have the same
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	457	meaning as for the \code{match()} method.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	458	\end{funcdesc}
				459
				460	\begin{funcdesc}{split}{string\, \optional{, maxsplit=0}}
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	461	Identical to the \code{split()} function, using the compiled pattern.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	462	\end{funcdesc}
				463
				464	\begin{funcdesc}{sub}{repl\, string\optional{, count=0}}
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	465	Identical to the \code{sub()} function, using the compiled pattern.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	466	\end{funcdesc}
				467
				468	\begin{funcdesc}{subn}{repl\, string\optional{, count=0}}
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	469	Identical to the \code{subn()} function, using the compiled pattern.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	470	\end{funcdesc}
				471
				472	\renewcommand{\indexsubitem}{(regex attribute)}
				473
				474	\begin{datadesc}{flags}
				475	The flags argument used when the regex object was compiled, or 0 if no
				476	flags were provided.
				477	\end{datadesc}
				478
				479	\begin{datadesc}{groupindex}
				480	A dictionary mapping any symbolic group names (defined by
				481	\code{?P<\var{id}>}) to group numbers. The dictionary is empty if no
				482	symbolic groups were used in the pattern.
				483	\end{datadesc}
				484
				485	\begin{datadesc}{pattern}
				486	The pattern string from which the regex object was compiled.
				487	\end{datadesc}
				488
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	489	\subsection{Match Objects}
				490
				491	\code{MatchObject} instances support the following methods and attributes:
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	492
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	493	\begin{funcdesc}{group}{\optional{group1, group2, ...}}
				494	Returns one or more subgroups of the match. If there is a single
				495	argument, the result is a single string; if there are
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	496	multiple arguments, the result is a tuple with one item per argument.
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	497	Without arguments, \var{group1} defaults to zero (i.e. the whole match
				498	is returned).
				499	If a \var{groupN} argument is zero, the corresponding return value is the
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	500	entire matching string; if it is in the inclusive range [1..99], it is
				501	the string matching the the corresponding parenthesized group. If no
				502	such group exists, the corresponding result is
				503	\code{None}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	504
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	505	If the regular expression uses the \code{(?P<\var{name}>...)} syntax,
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	506	the \var{groupN} arguments may also be strings identifying groups by
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	507	their group name.
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	508
				509	A moderately complicated example:
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	510
				511	\begin{verbatim}
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	512	m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	513	\end{verbatim}
				514
				515	After performing this match, \code{m.group(1)} is \code{'3'}, as is
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	516	\code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	517	\end{funcdesc}
				518
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	519	\begin{funcdesc}{groups}{}
				520	Return a tuple containing all the subgroups of the match, from 1 up to
				521	however many groups are in the pattern. Groups that did not
Guido van Rossum	9754639	1998-01-12 18:58:53 +0000	[diff] [blame]	522	participate in the match have values of \code{None}. (Incompatibility
				523	note: in the original Python 1.5 release, if the tuple was one element
				524	long, a string would be returned instead. In later versions, a
				525	singleton tuple is returned in such cases.)
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	526	\end{funcdesc}
				527
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	528	\begin{funcdesc}{start}{\optional{group}}
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	529	\end{funcdesc}
				530
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	531	\begin{funcdesc}{end}{\optional{group}}
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	532	Return the indices of the start and end of the substring
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	533	matched by \var{group}; \var{group} defaults to zero (meaning the whole
				534	matched substring).
				535	Return \code{None} if \var{group} exists but
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	536	did not contribute to the match. For a match object
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	537	\var{m}, and a group \var{g} that did contribute to the match, the
				538	substring matched by group \var{g} (equivalent to
				539	\code{\var{m}.group(\var{g})}) is
				540
				541	\begin{verbatim}
				542	m.string[m.start(g):m.end(g)]
				543	\end{verbatim}
				544
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	545	Note that
				546	\code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	547	\var{group} matched a null string. For example, after \code{\var{m} =
				548	re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
				549	\code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
				550	\code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
				551	an \code{IndexError} exception.
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	552
				553	\end{funcdesc}
				554
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	555	\begin{funcdesc}{span}{\optional{group}}
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	556	For \code{MatchObject} \var{m}, return the 2-tuple
				557	\code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	558	Note that if \var{group} did not contribute to the match, this is
Guido van Rossum	4650392	1998-01-19 23:14:17 +0000	[diff] [blame]	559	\code{(None, None)}. Again, \var{group} defaults to zero.
Guido van Rossum	e4eb223	1997-12-17 00:23:39 +0000	[diff] [blame]	560	\end{funcdesc}
				561
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	562	\begin{datadesc}{pos}
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	563	The value of \var{pos} which was passed to the
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	564	\code{search()} or \code{match()} function. This is the index into
				565	the string at which the regex engine started looking for a match.
Guido van Rossum	0b33410	1997-12-08 17:33:40 +0000	[diff] [blame]	566	\end{datadesc}
				567
				568	\begin{datadesc}{endpos}
				569	The value of \var{endpos} which was passed to the
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	570	\code{search()} or \code{match()} function. This is the index into
				571	the string beyond which the regex engine will not go.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	572	\end{datadesc}
				573
				574	\begin{datadesc}{re}
Guido van Rossum	48d0437	1997-12-11 20:19:08 +0000	[diff] [blame]	575	The regular expression object whose \code{match()} or \code{search()} method
				576	produced this \code{MatchObject} instance.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	577	\end{datadesc}
				578
				579	\begin{datadesc}{string}
				580	The string passed to \code{match()} or \code{search()}.
				581	\end{datadesc}
				582
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	583	\begin{seealso}
Fred Drake	f995181	1997-12-29 16:37:04 +0000	[diff] [blame]	584	\seetext{Jeffrey Friedl, \emph{Mastering Regular Expressions},
Fred Drake	023f87f	1998-01-12 19:16:24 +0000	[diff] [blame]	585	O'Reilly. The Python material in this book dates from before the
				586	\code{re} module, but it covers writing good regular expression
				587	patterns in great detail.}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	588	\end{seealso}