Blame - Doc/libre.tex - platform/external/python/cpython3

blob: a2bc1fdd68883286933e16220dd016a6f8de315b [file] [log] [blame]

Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	1	\section{Built-in Module \sectcode{re}}
				2	\label{module-re}
				3
				4	\bimodindex{re}
				5
				6	% XXX Remove before 1.5final release.
				7	{\large\bf The \code{re} module is still in the process of being
				8	developed, and more features will be added in future 1.5 alphas and
				9	betas. This documentation is also preliminary and incomplete. If you
				10	find a bug or documentation error, or just find something unclear,
				11	please send a message to
				12	\code{string-sig@python.org}, and we'll fix it.}
				13
				14	This module provides regular expression matching operations similar to
				15	those found in Perl. It's 8-bit
				16	clean: both patterns and strings may contain null bytes and characters
				17	whose high bit is set. It is always available.
				18
				19	Regular expressions use the backslash character (\code{\e}) to
				20	indicate special forms or to allow special characters to be used
				21	without invoking their special meaning. This collides with Python's
				22	usage of the same character for the same purpose in string literals;
				23	for example, to match a literal backslash, one might have to write
				24	\code{\e\e\e\e} as the pattern string, because the regular expression must be \code{\e\e}, and each backslash must be expressed as \code{\e\e} inside a regular Python string literal.
				25
				26	The solution is to use Python's raw string notation for regular
				27	expression patterns; backslashes are not handled in any special way in
				28	a string literal prefixed with 'r'. So \code{r"\e n"} is a two
				29	character string containing a backslash and the letter 'n', while
				30	\code{"\e n"} is a one-character string containing a newline. Usually
				31	patterns will be expressed in Python code using this raw string notation.
				32
				33	% XXX Can the following section be dropped, or should it be boiled down?
				34
				35	%\strong{Please note:} There is a little-known fact about Python string
				36	%literals which means that you don't usually have to worry about
				37	%doubling backslashes, even though they are used to escape special
				38	%characters in string literals as well as in regular expressions. This
				39	%is because Python doesn't remove backslashes from string literals if
				40	%they are followed by an unrecognized escape character.
				41	%\emph{However}, if you want to include a literal \dfn{backslash} in a
				42	%regular expression represented as a string literal, you have to
				43	%\emph{quadruple} it or enclose it in a singleton character class.
				44	%E.g.\ to extract \LaTeX\ \code{\e section\{{\rm
				45	%\ldots}\}} headers from a document, you can use this pattern:
				46	%\code{'[\e ] section\{\e (.*\e )\}'}. \emph{Another exception:}
				47	%the escape sequence \code{\e b} is significant in string literals
				48	%(where it means the ASCII bell character) as well as in Emacs regular
				49	%expressions (where it stands for a word boundary), so in order to
				50	%search for a word boundary, you should use the pattern \code{'\e \e b'}.
				51	%Similarly, a backslash followed by a digit 0-7 should be doubled to
				52	%avoid interpretation as an octal escape.
				53
				54	\subsection{Regular Expressions}
				55
				56	A regular expression (or RE) specifies a set of strings that matches
				57	it; the functions in this module let you check if a particular string
				58	matches a given regular expression (or if a given regular expression
				59	matches a particular string, which comes down to the same thing).
				60
				61	Regular expressions can be concatenated to form new regular
				62	expressions; if \emph{A} and \emph{B} are both regular expressions,
				63	then \emph{AB} is also an regular expression. If a string \emph{p}
				64	matches A and another string \emph{q} matches B, the string \emph{pq}
				65	will match AB. Thus, complex expressions can easily be constructed
				66	from simpler primitive expressions like the ones described here. For
				67	details of the theory and implementation of regular expressions,
				68	consult the Friedl book referenced below, or almost any textbook about
				69	compiler construction.
				70
				71	A brief explanation of the format of regular expressions follows. For
				72	further information and a gentler presentation, consult XXX somewhere.
				73
				74	Regular expressions can contain both special and ordinary characters.
				75	Most ordinary characters, like '\code{A}', '\code{a}', or '\code{0}',
				76	are the simplest regular expressions; they simply match themselves.
				77	You can concatenate ordinary characters, so '\code{last}' matches the
				78	characters 'last'. (In the rest of this section, we'll write RE's in
				79	\code{this special font}, usually without quotes, and strings to be
				80	matched 'in single quotes'.)
				81
				82	Some characters, like \code{\|} or \code{(}, are special. Special
				83	characters either stand for classes of ordinary characters, or affect
				84	how the regular expressions around them are interpreted.
				85
				86	The special characters are:
				87	\begin{itemize}
				88	\item[\code{.}] (Dot.) In the default mode, this matches any
				89	character except a newline. If the \code{DOTALL} flag has been
				90	specified, this matches any character including a newline.
				91	\item[\code{\^}] (Caret.) Matches the start of the string, and in
				92	\code{MULTILINE} mode also immediately after each newline.
				93	\item[\code{\$}] Matches the end of the string.
				94	\code{foo} matches both 'foo' and 'foobar', while the regular
				95	expression '\code{foo\$}' matches only 'foo'.
				96	%
				97	\item[\code{*}] Causes the resulting RE to
				98	match 0 or more repetitions of the preceding RE, as many repetitions
				99	as are possible. \code{ab*} will
				100	match 'a', 'ab', or 'a' followed by any number of 'b's.
				101	%
				102	\item[\code{+}] Causes the
				103	resulting RE to match 1 or more repetitions of the preceding RE.
				104	\code{ab+} will match 'a' followed by any non-zero number of 'b's; it
				105	will not match just 'a'.
				106	%
				107	\item[\code{?}] Causes the resulting RE to
				108	match 0 or 1 repetitions of the preceding RE. \code{ab?} will
				109	match either 'a' or 'ab'.
				110	\item[\code{?}, \code{+?}, \code{??}] The \code{}, \code{+}, and
				111	\code{?} qualifiers are all \dfn{greedy}; they match as much text as
				112	possible. Sometimes this behaviour isn't desired; if the RE
				113	\code{<.*>} is matched against \code{<H1>title</H1>}, it will match the
				114	entire string, and not just \code{<H1>}.
				115	Adding \code{?} after the qualifier makes it perform the match in
				116	\dfn{non-greedy} or \dfn{minimal} fashion; as few characters as
				117	possible will be matched. Using \code{.*?} in the previous
				118	expression, it will match only \code{<H1>}.
				119	%
				120	\item[\code{\e}] Either escapes special characters (permitting you to match
				121	characters like '*?+\&\$'), or signals a special sequence; special
				122	sequences are discussed below.
				123
				124	If you're not using a raw string to
				125	express the pattern, remember that Python also uses the
				126	backslash as an escape sequence in string literals; if the escape
				127	sequence isn't recognized by Python's parser, the backslash and
				128	subsequent character are included in the resulting string. However,
				129	if Python would recognize the resulting sequence, the backslash should
				130	be repeated twice. This is complicated and hard to understand, so
				131	it's highly recommended that you use raw strings.
				132	%
				133	\item[\code{[]}] Used to indicate a set of characters. Characters can
				134	be listed individually, or a range is indicated by giving two
				135	characters and separating them by a '-'. Special characters are not
				136	active inside sets. For example, \code{[akm\$]} will match any of the
				137	characters 'a', 'k', 'm', or '\$'; \code{[a-z]} will match any
				138	lowercase letter and \code{[a-zA-Z0-9]} matches any letter or digit.
				139	Character classes of the form \code{\e \var{X}} defined below are also acceptable.
				140	If you want to include a \code{]} or a \code{-} inside a
				141	set, precede it with a backslash.
				142
				143	Characters \emph{not} within a range can be matched by including a
				144	\code{\^} as the first character of the set; \code{\^} elsewhere will
				145	simply match the '\code{\^}' character.
				146	%
				147	\item[\code{\|}]\code{A\|B}, where A and B can be arbitrary REs,
				148	creates a regular expression that will match either A or B. This can
				149	be used inside groups (see below) as well. To match a literal '\|',
				150	use \code{\e\|}, or enclose it inside a character class, like \code{[\|]}.
				151	%
				152	\item[\code{( ... )}] Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the
				153	contents of a group can be retrieved after a match has been performed,
				154	and can be matched later in the string with the
				155	\code{\e \var{number}} special sequence, described below. To match the
				156	literals '(' or ')',
				157	use \code{\e(} or \code{\e)}, or enclose them inside a character
				158	class: \code{[(] [)]}.
				159	%
				160	\item[\code{(?:...)}] A non-grouping version of regular parentheses.
				161	Matches whatever's inside the parentheses, but the text matched by the
				162	group \emph{cannot} be retrieved after performing a match or
				163	referenced later in the pattern.
				164	%
				165	\item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
				166	the text matched by the group is accessible via the symbolic group
				167	name \var{name}. Group names must be valid Python identifiers. A
				168	symbolic group is also a numbered group, just as if the group were not
				169	named. So the group named 'id' in the example above can also be
				170	referenced as the numbered group 1.
				171
				172	For example, if the pattern string is
				173	\code{r'(?P<id>[a-zA-Z_]\e w*)'}, the group can be referenced by its
				174	name in arguments to methods of match objects, such as \code{m.group('id')}
				175	or \code{m.end('id')}, and also by name in pattern text (e.g. \code{(?P=id)}) and
				176	replacement text (e.g. \code{\e g<id>}).
				177	%
				178	\item[\code{(?\#...)}] A comment; the contents of the parentheses are simply ignored.
				179	%
				180	\item[\code{(?=...)}] Matches if \code{RE} matches next. This is not
				181	implemented as of Python 1.5a3.
				182	%
				183	\item[\code{(?!...)}] Matches if \code{...} doesn't match next. This is not
				184	implemented as of Python 1.5a3.
				185	\end{itemize}
				186
				187	The special sequences consist of '\code{\e}' and a character from the
				188	list below. If the ordinary character is not on the list, then the
				189	resulting RE will match the second character. For example,
				190	\code{\e\$} matches the character '\$'. Ones where the backslash
				191	should be doubled are indicated.
				192
				193	\begin{itemize}
				194
				195	%
				196	\item[\code{\e \var{number}}] Matches the contents of the group of the
				197	same number. For example, \code{(.+) \e 1} matches 'the the' or '55
				198	55', but not 'the end' (note the space after the group). This special
				199	sequence can only be used to match one of the first 99 groups. If the
				200	first digit of \var{number} is 0, or \var{number} is 3 octal digits
				201	long, it will not interpreted as a group match, but as the character
				202	with octal value \var{number}.
				203	%
				204	\item[\code{\e A}] Matches only at the start of the string.
				205	%
				206	\item[\code{\e b}] Matches the empty string, but only at the
				207	beginning or end of a word. A word is defined as a sequence of
				208	alphanumeric characters, so the end of a word is indicated by
				209	whitespace or a non-alphanumeric character.
				210	%
				211	\item[\code{\e B}] Matches the empty string, but only when it is \emph{not} at the
				212	beginning or end of a word.
				213	%
				214	\item[\code{\e d}]Matches any decimal digit; this is
				215	equivalent to the set \code{[0-9]}.
				216	%
				217	\item[\code{\e D}]Matches any non-digit character; this is
				218	equivalent to the set \code{[\^0-9]}.
				219	%
				220	\item[\code{\e s}]Matches any whitespace character; this is
				221	equivalent to the set \code{[ \e t\e n\e r\e f\e v]}.
				222	%
				223	\item[\code{\e S}]Matches any non-whitespace character; this is
				224	equivalent to the set \code{[\^ \e t\e n\e r\e f\e v]}.
				225	%
				226	\item[\code{\e w}]Matches any alphanumeric character; this is
				227	equivalent to the set \code{[a-zA-Z0-9_]}.
				228	%
				229	\item[\code{\e W}] Matches any non-alphanumeric character; this is
Guido van Rossum	eb53ae4	1997-10-05 18:54:07 +0000	[diff] [blame]	230	equivalent to the set \code{[\^ a-zA-Z0-9_]}.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	231
				232	\item[\code{\e Z}]Matches only at the end of the string.
				233	%
				234
				235	\item[\code{\e \e}] Matches a literal backslash.
				236
				237	\end{itemize}
				238
				239	\subsection{Module Contents}
				240
				241	The module defines the following functions and constants, and an exception:
				242
				243	\renewcommand{\indexsubitem}{(in module re)}
				244
				245	\begin{funcdesc}{compile}{pattern\optional{\, flags}}
				246	Compile a regular expression pattern into a regular expression
				247	object, which can be used for matching using its \code{match} and
				248	\code{search} methods, described below.
				249
				250	The sequence
				251	%
				252	\bcode\begin{verbatim}
				253	prog = re.compile(pat)
				254	result = prog.match(str)
				255	\end{verbatim}\ecode
				256	%
				257	is equivalent to
				258	%
				259	\bcode\begin{verbatim}
				260	result = re.match(pat, str)
				261	\end{verbatim}\ecode
				262	%
				263	but the version using \code{compile()} is more efficient when multiple
				264	regular expressions are used concurrently in a single program.
				265	%(The compiled version of the last pattern passed to \code{regex.match()} or
				266	%\code{regex.search()} is cached, so programs that use only a single
				267	%regular expression at a time needn't worry about compiling regular
				268	%expressions.)
				269	\end{funcdesc}
				270
				271	\begin{funcdesc}{escape}{string}
				272	Return \var{string} with all non-alphanumerics backslashed; this is
				273	useful if you want to match some variable string which may have
				274	regular expression metacharacters in it.
				275	\end{funcdesc}
				276
				277	\begin{funcdesc}{match}{pattern\, string\optional{\, flags}}
				278	If zero or more characters at the beginning of \var{string} match
				279	the regular expression \var{pattern}, return a corresponding
				280	\code{Match} object. Return \code{None} if the string does not
				281	match the pattern; note that this is different from a zero-length
				282	match.
				283	\end{funcdesc}
				284
				285	\begin{funcdesc}{search}{pattern\, string\optional{\, flags}}
				286	Scan through \var{string} looking for a location where the regular
				287	expression \var{pattern} produces a match. Return \code{None} if no
				288	position in the string matches the pattern; note that this is
				289	different from finding a zero-length match at some point in the string.
				290	\end{funcdesc}
				291
				292	\begin{funcdesc}{split}{pattern\, string\, \optional{, maxsplit=0}}
				293	Split \var{string} by the occurrences of \var{pattern}. If
				294	capturing parentheses are used in pattern, then occurrences of
				295	patterns or subpatterns are also returned.
				296	%
				297	\bcode\begin{verbatim}
				298	>>> re.split('[\W]+', 'Words, words, words.')
				299	['Words', 'words', 'words', '']
				300	>>> re.split('([\W]+)', 'Words, words, words.')
				301	['Words', ', ', 'words', ', ', 'words', '.', '']
				302	\end{verbatim}\ecode
				303	%
				304	This function combines and extends the functionality of
				305	\code{regex.split()} and \code{regex.splitx()}.
				306	\end{funcdesc}
				307
				308	\begin{funcdesc}{sub}{pattern\, repl\, string\optional{, count=0}}
				309	Return the string obtained by replacing the leftmost non-overlapping
				310	occurrences of \var{pattern} in \var{string} by the replacement
				311	\var{repl}, which can be a string or the function that returns a string. If the pattern isn't found, \var{string} is returned unchanged. The
				312	pattern may be a string or a regexp object; if you need to specify
				313	regular expression flags, you must use a regexp object, or use
				314	embedded modifiers in a pattern string; e.g.
				315	%
				316	\bcode\begin{verbatim}
				317	sub("(?i)b+", "x", "bbbb BBBB") returns 'x x'.
				318	\end{verbatim}\ecode
				319	%
				320	The optional argument \var{count} is the maximum number of pattern
				321	occurrences to be replaced; count must be a non-negative integer, and
				322	the default value of 0 means to replace all occurrences.
				323
				324	Empty matches for the pattern are replaced only when not adjacent to a
				325	previous match, so \code{sub('x*', '-', 'abc')} returns '-a-b-c-'.
				326	\end{funcdesc}
				327
				328	\begin{funcdesc}{subn}{pattern\, repl\, string\optional{, count=0}}
				329	Perform the same operation as \code{sub()}, but return a tuple
				330	\code{(new_string, number_of_subs_made)}.
				331	\end{funcdesc}
				332
				333	\begin{excdesc}{error}
				334	Exception raised when a string passed to one of the functions here
				335	is not a valid regular expression (e.g., unmatched parentheses) or
				336	when some other error occurs during compilation or matching. (It is
				337	never an error if a string contains no match for a pattern.)
				338	\end{excdesc}
				339
				340	\subsection{Regular Expression Objects}
				341	Compiled regular expression objects support the following methods and
				342	attributes:
				343
Guido van Rossum	eb53ae4	1997-10-05 18:54:07 +0000	[diff] [blame]	344	\renewcommand{\indexsubitem}{(re method)}
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	345	\begin{funcdesc}{match}{string\optional{\, pos}}
Guido van Rossum	eb53ae4	1997-10-05 18:54:07 +0000	[diff] [blame]	346	If zero or more characters at the beginning of \var{string} match
				347	this regular expression, return a corresponding
				348	\code{Match} object. Return \code{None} if the string does not
				349	match the pattern; note that this is different from a zero-length
				350	match.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	351
				352	The optional second parameter \var{pos} gives an index in the string
				353	where the search is to start; it defaults to \code{0}. This is not
				354	completely equivalent to slicing the string; the \code{'\^'} pattern
				355	character matches at the real begin of the string and at positions
				356	just after a newline, not necessarily at the index where the search
				357	is to start.
				358	\end{funcdesc}
				359
				360	\begin{funcdesc}{search}{string\optional{\, pos}}
Guido van Rossum	eb53ae4	1997-10-05 18:54:07 +0000	[diff] [blame]	361	Scan through \var{string} looking for a location where this regular
				362	expression produces a match. Return \code{None} if no
				363	position in the string matches the pattern; note that this is
				364	different from finding a zero-length match at some point in the string.
Guido van Rossum	1acceb0	1997-08-14 23:12:18 +0000	[diff] [blame]	365
				366	The optional second parameter has the same meaning as for the
				367	\code{match} method.
				368	\end{funcdesc}
				369
				370	\begin{funcdesc}{split}{string\, \optional{, maxsplit=0}}
				371	Identical to the \code{split} function, using the compiled pattern.
				372	\end{funcdesc}
				373
				374	\begin{funcdesc}{sub}{repl\, string\optional{, count=0}}
				375	Identical to the \code{sub} function, using the compiled pattern.
				376	\end{funcdesc}
				377
				378	\begin{funcdesc}{subn}{repl\, string\optional{, count=0}}
				379	Identical to the \code{subn} function, using the compiled pattern.
				380	\end{funcdesc}
				381
				382	\renewcommand{\indexsubitem}{(regex attribute)}
				383
				384	\begin{datadesc}{flags}
				385	The flags argument used when the regex object was compiled, or 0 if no
				386	flags were provided.
				387	\end{datadesc}
				388
				389	\begin{datadesc}{groupindex}
				390	A dictionary mapping any symbolic group names (defined by
				391	\code{?P<\var{id}>}) to group numbers. The dictionary is empty if no
				392	symbolic groups were used in the pattern.
				393	\end{datadesc}
				394
				395	\begin{datadesc}{pattern}
				396	The pattern string from which the regex object was compiled.
				397	\end{datadesc}
				398
				399	\subsection{Match Objects}
				400	Match objects support the following methods and attributes:
				401
				402	\begin{funcdesc}{span}{group}
				403	Return the 2-tuple \code{(start(\var{group}), end(\var{group}))}.
				404	Note that if \var{group} did not contribute to the match, this is \code{(None,
				405	None)}.
				406	\end{funcdesc}
				407
				408	\begin{funcdesc}{start}{group}
				409	\end{funcdesc}
				410
				411	\begin{funcdesc}{end}{group}
				412	Return the indices of the start and end of the substring matched by
				413	\var{group}. Return \code{None} if \var{group} exists but did not contribute to
				414	the match. Note that for a match object \code{m}, and a group \code{g}
				415	that did contribute to the match, the substring matched by group \code{g} is
				416	\bcode\begin{verbatim}
				417	m.string[m.start(g):m.end(g)]
				418	\end{verbatim}\ecode
				419	%
				420	Note too that \code{m.start(\var{group})} will equal
				421	\code{m.end(\var{group})} if \var{group} matched a null string. For example,
				422	after \code{m = re.search('b(c?)', 'cba')}, \code{m.start(0)} is 1,
				423	\code{m.end(0)} is 2, \code{m.start(1)} and \code{m.end(1)} are both
				424	2, and \code{m.start(2)} raises an
				425	\code{IndexError} exception.
				426	\end{funcdesc}
				427
				428	\begin{funcdesc}{group}{\optional{g1, g2, ...})}
				429	This method is only valid when the last call to the \code{match}
				430	or \code{search} method found a match. It returns one or more
				431	groups of the match. If there is a single \var{index} argument,
				432	the result is a single string; if there are multiple arguments, the
				433	result is a tuple with one item per argument. If the \var{index} is
				434	zero, the corresponding return value is the entire matching string; if
				435	it is in the inclusive range [1..99], it is the string matching the
				436	the corresponding parenthesized group (using the default syntax,
				437	groups are parenthesized using \code{\e (} and \code{\e )}). If no
				438	such group exists, the corresponding result is \code{None}.
				439
				440	If the regular expression was compiled by \code{symcomp} instead of
				441	\code{compile}, the \var{index} arguments may also be strings
				442	identifying groups by their group name.
				443	\end{funcdesc}
				444
				445	\begin{datadesc}{pos}
				446	The index at which the search or match began.
				447	\end{datadesc}
				448
				449	\begin{datadesc}{re}
				450	The regular expression object whose match() or search() method
				451	produced this match object.
				452	\end{datadesc}
				453
				454	\begin{datadesc}{string}
				455	The string passed to \code{match()} or \code{search()}.
				456	\end{datadesc}
				457
				458
				459
				460	\begin{seealso}
				461	\seetext Jeffrey Friedl, \emph{Mastering Regular Expressions}.
				462	\end{seealso}
				463