Blame - Doc/libregex.tex - platform/external/python/cpython3

blob: 1b18679f830740a3545c03b46f17f8eee0ff33e4 [file] [log] [blame]

Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	1	\section{Built-in Module \sectcode{regex}}
				2
				3	\bimodindex{regex}
				4	This module provides regular expression matching operations similar to
				5	those found in Emacs. It is always available.
				6
				7	By default the patterns are Emacs-style regular expressions; there is
				8	a way to change the syntax to match that of several well-known
				9	\UNIX{} utilities.
				10
				11	This module is 8-bit clean: both patterns and strings may contain null
				12	bytes and characters whose high bit is set.
				13
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	14	\strong{Please note:} There is a little-known fact about Python string
				15	literals which means that you don't usually have to worry about
				16	doubling backslashes, even though they are used to escape special
				17	characters in string literals as well as in regular expressions. This
				18	is because Python doesn't remove backslashes from string literals if
				19	they are followed by an unrecognized escape character.
				20	\emph{However}, if you want to include a literal \dfn{backslash} in a
				21	regular expression represented as a string literal, you have to
Guido van Rossum	6c4f003	1995-03-07 10:14:09 +0000	[diff] [blame]	22	\emph{quadruple} it. E.g.\ to extract \LaTeX\ \samp{\e section\{{\rm
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	23	\ldots}\}} headers from a document, you can use this pattern:
				24	\code{'\e \e \e\e section\{\e (.*\e )\}'}.
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	25
				26	The module defines these functions, and an exception:
				27
				28	\renewcommand{\indexsubitem}{(in module regex)}
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	29
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	30	\begin{funcdesc}{match}{pattern\, string}
				31	Return how many characters at the beginning of \var{string} match
				32	the regular expression \var{pattern}. Return \code{-1} if the
				33	string does not match the pattern (this is different from a
				34	zero-length match!).
				35	\end{funcdesc}
				36
				37	\begin{funcdesc}{search}{pattern\, string}
				38	Return the first position in \var{string} that matches the regular
				39	expression \var{pattern}. Return -1 if no position in the string
				40	matches the pattern (this is different from a zero-length match
				41	anywhere!).
				42	\end{funcdesc}
				43
Guido van Rossum	16d6e71	1994-08-08 12:30:22 +0000	[diff] [blame]	44	\begin{funcdesc}{compile}{pattern\optional{\, translate}}
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	45	Compile a regular expression pattern into a regular expression
				46	object, which can be used for matching using its \code{match} and
Guido van Rossum	470be14	1995-03-17 16:07:09 +0000	[diff] [blame^]	47	\code{search} methods, described below. The optional argument
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	48	\var{translate}, if present, must be a 256-character string
				49	indicating how characters (both of the pattern and of the strings to
				50	be matched) are translated before comparing them; the \code{i}-th
				51	element of the string gives the translation for the character with
Guido van Rossum	470be14	1995-03-17 16:07:09 +0000	[diff] [blame^]	52	\ASCII{} code \code{i}. This can be used to implement
				53	case-insensitive matching; see the \code{casefold} data item below.
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	54
				55	The sequence
				56
				57	\bcode\begin{verbatim}
				58	prog = regex.compile(pat)
				59	result = prog.match(str)
				60	\end{verbatim}\ecode
				61
				62	is equivalent to
				63
				64	\bcode\begin{verbatim}
				65	result = regex.match(pat, str)
				66	\end{verbatim}\ecode
				67
				68	but the version using \code{compile()} is more efficient when multiple
				69	regular expressions are used concurrently in a single program. (The
				70	compiled version of the last pattern passed to \code{regex.match()} or
				71	\code{regex.search()} is cached, so programs that use only a single
				72	regular expression at a time needn't worry about compiling regular
				73	expressions.)
				74	\end{funcdesc}
				75
				76	\begin{funcdesc}{set_syntax}{flags}
				77	Set the syntax to be used by future calls to \code{compile},
				78	\code{match} and \code{search}. (Already compiled expression objects
				79	are not affected.) The argument is an integer which is the OR of
				80	several flag bits. The return value is the previous value of
				81	the syntax flags. Names for the flags are defined in the standard
				82	module \code{regex_syntax}; read the file \file{regex_syntax.py} for
				83	more information.
				84	\end{funcdesc}
				85
Guido van Rossum	16d6e71	1994-08-08 12:30:22 +0000	[diff] [blame]	86	\begin{funcdesc}{symcomp}{pattern\optional{\, translate}}
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	87	This is like \code{compile}, but supports symbolic group names: if a
Guido van Rossum	6c4f003	1995-03-07 10:14:09 +0000	[diff] [blame]	88	parenthesis-enclosed group begins with a group name in angular
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	89	brackets, e.g. \code{'\e(<id>[a-z][a-z0-9]*\e)'}, the group can
				90	be referenced by its name in arguments to the \code{group} method of
				91	the resulting compiled regular expression object, like this:
Guido van Rossum	7defee7	1995-02-27 17:52:35 +0000	[diff] [blame]	92	\code{p.group('id')}. Group names may contain alphanumeric characters
				93	and \code{'_'} only.
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	94	\end{funcdesc}
				95
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	96	\begin{excdesc}{error}
				97	Exception raised when a string passed to one of the functions here
				98	is not a valid regular expression (e.g., unmatched parentheses) or
				99	when some other error occurs during compilation or matching. (It is
				100	never an error if a string contains no match for a pattern.)
				101	\end{excdesc}
				102
				103	\begin{datadesc}{casefold}
				104	A string suitable to pass as \var{translate} argument to
				105	\code{compile} to map all upper case characters to their lowercase
				106	equivalents.
				107	\end{datadesc}
				108
				109	\noindent
				110	Compiled regular expression objects support these methods:
				111
				112	\renewcommand{\indexsubitem}{(regex method)}
Guido van Rossum	16d6e71	1994-08-08 12:30:22 +0000	[diff] [blame]	113	\begin{funcdesc}{match}{string\optional{\, pos}}
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	114	Return how many characters at the beginning of \var{string} match
				115	the compiled regular expression. Return \code{-1} if the string
				116	does not match the pattern (this is different from a zero-length
				117	match!).
				118
				119	The optional second parameter \var{pos} gives an index in the string
				120	where the search is to start; it defaults to \code{0}. This is not
				121	completely equivalent to slicing the string; the \code{'\^'} pattern
				122	character matches at the real begin of the string and at positions
				123	just after a newline, not necessarily at the index where the search
				124	is to start.
				125	\end{funcdesc}
				126
Guido van Rossum	16d6e71	1994-08-08 12:30:22 +0000	[diff] [blame]	127	\begin{funcdesc}{search}{string\optional{\, pos}}
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	128	Return the first position in \var{string} that matches the regular
				129	expression \code{pattern}. Return \code{-1} if no position in the
				130	string matches the pattern (this is different from a zero-length
				131	match anywhere!).
				132
				133	The optional second parameter has the same meaning as for the
				134	\code{match} method.
				135	\end{funcdesc}
				136
				137	\begin{funcdesc}{group}{index\, index\, ...}
				138	This method is only valid when the last call to the \code{match}
				139	or \code{search} method found a match. It returns one or more
				140	groups of the match. If there is a single \var{index} argument,
				141	the result is a single string; if there are multiple arguments, the
				142	result is a tuple with one item per argument. If the \var{index} is
				143	zero, the corresponding return value is the entire matching string; if
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	144	it is in the inclusive range [1..99], it is the string matching the
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	145	the corresponding parenthesized group (using the default syntax,
				146	groups are parenthesized using \code{\\(} and \code{\\)}). If no
				147	such group exists, the corresponding result is \code{None}.
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	148
				149	If the regular expression was compiled by \code{symcomp} instead of
				150	\code{compile}, the \var{index} arguments may also be strings
				151	identifying groups by their group name.
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	152	\end{funcdesc}
				153
				154	\noindent
				155	Compiled regular expressions support these data attributes:
				156
				157	\renewcommand{\indexsubitem}{(regex attribute)}
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	158
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	159	\begin{datadesc}{regs}
				160	When the last call to the \code{match} or \code{search} method found a
				161	match, this is a tuple of pairs of indices corresponding to the
				162	beginning and end of all parenthesized groups in the pattern. Indices
				163	are relative to the string argument passed to \code{match} or
				164	\code{search}. The 0-th tuple gives the beginning and end or the
				165	whole pattern. When the last match or search failed, this is
				166	\code{None}.
				167	\end{datadesc}
				168
				169	\begin{datadesc}{last}
				170	When the last call to the \code{match} or \code{search} method found a
				171	match, this is the string argument passed to that method. When the
				172	last match or search failed, this is \code{None}.
				173	\end{datadesc}
				174
				175	\begin{datadesc}{translate}
				176	This is the value of the \var{translate} argument to
				177	\code{regex.compile} that created this regular expression object. If
				178	the \var{translate} argument was omitted in the \code{regex.compile}
				179	call, this is \code{None}.
				180	\end{datadesc}
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	181
				182	\begin{datadesc}{givenpat}
				183	The regular expression pattern as passed to \code{compile} or
				184	\code{symcomp}.
				185	\end{datadesc}
				186
				187	\begin{datadesc}{realpat}
				188	The regular expression after stripping the group names for regular
				189	expressions compiled with \code{symcomp}. Same as \code{givenpat}
				190	otherwise.
				191	\end{datadesc}
				192
				193	\begin{datadesc}{groupindex}
				194	A dictionary giving the mapping from symbolic group names to numerical
				195	group indices for regular expressions compiled with \code{symcomp}.
				196	\code{None} otherwise.
				197	\end{datadesc}