Blame - Doc/libregex.tex - platform/external/python/cpython3

blob: 4c98e59fc3d24f2216993b149d4d9cf53f930eef [file] [log] [blame]

Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	1	\section{Built-in Module \sectcode{regex}}
				2
				3	\bimodindex{regex}
				4	This module provides regular expression matching operations similar to
				5	those found in Emacs. It is always available.
				6
Guido van Rossum	fe4254e	1995-08-11 00:31:57 +0000	[diff] [blame^]	7	By default the patterns are Emacs-style regular expressions,
				8	with one exception. There is
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	9	a way to change the syntax to match that of several well-known
Guido van Rossum	fe4254e	1995-08-11 00:31:57 +0000	[diff] [blame^]	10	\UNIX{} utilities. The exception is that Emacs' \samp{\e s}
				11	pattern is not supported, since the original implementation references
				12	the Emacs syntax tables.
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	13
				14	This module is 8-bit clean: both patterns and strings may contain null
				15	bytes and characters whose high bit is set.
				16
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	17	\strong{Please note:} There is a little-known fact about Python string
				18	literals which means that you don't usually have to worry about
				19	doubling backslashes, even though they are used to escape special
				20	characters in string literals as well as in regular expressions. This
				21	is because Python doesn't remove backslashes from string literals if
				22	they are followed by an unrecognized escape character.
				23	\emph{However}, if you want to include a literal \dfn{backslash} in a
				24	regular expression represented as a string literal, you have to
Guido van Rossum	6c4f003	1995-03-07 10:14:09 +0000	[diff] [blame]	25	\emph{quadruple} it. E.g.\ to extract \LaTeX\ \samp{\e section\{{\rm
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	26	\ldots}\}} headers from a document, you can use this pattern:
				27	\code{'\e \e \e\e section\{\e (.*\e )\}'}.
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	28
				29	The module defines these functions, and an exception:
				30
				31	\renewcommand{\indexsubitem}{(in module regex)}
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	32
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	33	\begin{funcdesc}{match}{pattern\, string}
				34	Return how many characters at the beginning of \var{string} match
				35	the regular expression \var{pattern}. Return \code{-1} if the
				36	string does not match the pattern (this is different from a
				37	zero-length match!).
				38	\end{funcdesc}
				39
				40	\begin{funcdesc}{search}{pattern\, string}
				41	Return the first position in \var{string} that matches the regular
				42	expression \var{pattern}. Return -1 if no position in the string
				43	matches the pattern (this is different from a zero-length match
				44	anywhere!).
				45	\end{funcdesc}
				46
Guido van Rossum	16d6e71	1994-08-08 12:30:22 +0000	[diff] [blame]	47	\begin{funcdesc}{compile}{pattern\optional{\, translate}}
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	48	Compile a regular expression pattern into a regular expression
				49	object, which can be used for matching using its \code{match} and
Guido van Rossum	470be14	1995-03-17 16:07:09 +0000	[diff] [blame]	50	\code{search} methods, described below. The optional argument
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	51	\var{translate}, if present, must be a 256-character string
				52	indicating how characters (both of the pattern and of the strings to
				53	be matched) are translated before comparing them; the \code{i}-th
				54	element of the string gives the translation for the character with
Guido van Rossum	470be14	1995-03-17 16:07:09 +0000	[diff] [blame]	55	\ASCII{} code \code{i}. This can be used to implement
				56	case-insensitive matching; see the \code{casefold} data item below.
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	57
				58	The sequence
				59
				60	\bcode\begin{verbatim}
				61	prog = regex.compile(pat)
				62	result = prog.match(str)
				63	\end{verbatim}\ecode
				64
				65	is equivalent to
				66
				67	\bcode\begin{verbatim}
				68	result = regex.match(pat, str)
				69	\end{verbatim}\ecode
				70
				71	but the version using \code{compile()} is more efficient when multiple
				72	regular expressions are used concurrently in a single program. (The
				73	compiled version of the last pattern passed to \code{regex.match()} or
				74	\code{regex.search()} is cached, so programs that use only a single
				75	regular expression at a time needn't worry about compiling regular
				76	expressions.)
				77	\end{funcdesc}
				78
				79	\begin{funcdesc}{set_syntax}{flags}
				80	Set the syntax to be used by future calls to \code{compile},
				81	\code{match} and \code{search}. (Already compiled expression objects
				82	are not affected.) The argument is an integer which is the OR of
				83	several flag bits. The return value is the previous value of
				84	the syntax flags. Names for the flags are defined in the standard
				85	module \code{regex_syntax}; read the file \file{regex_syntax.py} for
				86	more information.
				87	\end{funcdesc}
				88
Guido van Rossum	16d6e71	1994-08-08 12:30:22 +0000	[diff] [blame]	89	\begin{funcdesc}{symcomp}{pattern\optional{\, translate}}
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	90	This is like \code{compile}, but supports symbolic group names: if a
Guido van Rossum	6c4f003	1995-03-07 10:14:09 +0000	[diff] [blame]	91	parenthesis-enclosed group begins with a group name in angular
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	92	brackets, e.g. \code{'\e(<id>[a-z][a-z0-9]*\e)'}, the group can
				93	be referenced by its name in arguments to the \code{group} method of
				94	the resulting compiled regular expression object, like this:
Guido van Rossum	7defee7	1995-02-27 17:52:35 +0000	[diff] [blame]	95	\code{p.group('id')}. Group names may contain alphanumeric characters
				96	and \code{'_'} only.
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	97	\end{funcdesc}
				98
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	99	\begin{excdesc}{error}
				100	Exception raised when a string passed to one of the functions here
				101	is not a valid regular expression (e.g., unmatched parentheses) or
				102	when some other error occurs during compilation or matching. (It is
				103	never an error if a string contains no match for a pattern.)
				104	\end{excdesc}
				105
				106	\begin{datadesc}{casefold}
				107	A string suitable to pass as \var{translate} argument to
				108	\code{compile} to map all upper case characters to their lowercase
				109	equivalents.
				110	\end{datadesc}
				111
				112	\noindent
				113	Compiled regular expression objects support these methods:
				114
				115	\renewcommand{\indexsubitem}{(regex method)}
Guido van Rossum	16d6e71	1994-08-08 12:30:22 +0000	[diff] [blame]	116	\begin{funcdesc}{match}{string\optional{\, pos}}
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	117	Return how many characters at the beginning of \var{string} match
				118	the compiled regular expression. Return \code{-1} if the string
				119	does not match the pattern (this is different from a zero-length
				120	match!).
				121
				122	The optional second parameter \var{pos} gives an index in the string
				123	where the search is to start; it defaults to \code{0}. This is not
				124	completely equivalent to slicing the string; the \code{'\^'} pattern
				125	character matches at the real begin of the string and at positions
				126	just after a newline, not necessarily at the index where the search
				127	is to start.
				128	\end{funcdesc}
				129
Guido van Rossum	16d6e71	1994-08-08 12:30:22 +0000	[diff] [blame]	130	\begin{funcdesc}{search}{string\optional{\, pos}}
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	131	Return the first position in \var{string} that matches the regular
				132	expression \code{pattern}. Return \code{-1} if no position in the
				133	string matches the pattern (this is different from a zero-length
				134	match anywhere!).
				135
				136	The optional second parameter has the same meaning as for the
				137	\code{match} method.
				138	\end{funcdesc}
				139
				140	\begin{funcdesc}{group}{index\, index\, ...}
				141	This method is only valid when the last call to the \code{match}
				142	or \code{search} method found a match. It returns one or more
				143	groups of the match. If there is a single \var{index} argument,
				144	the result is a single string; if there are multiple arguments, the
				145	result is a tuple with one item per argument. If the \var{index} is
				146	zero, the corresponding return value is the entire matching string; if
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	147	it is in the inclusive range [1..99], it is the string matching the
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	148	the corresponding parenthesized group (using the default syntax,
				149	groups are parenthesized using \code{\\(} and \code{\\)}). If no
				150	such group exists, the corresponding result is \code{None}.
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	151
				152	If the regular expression was compiled by \code{symcomp} instead of
				153	\code{compile}, the \var{index} arguments may also be strings
				154	identifying groups by their group name.
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	155	\end{funcdesc}
				156
				157	\noindent
				158	Compiled regular expressions support these data attributes:
				159
				160	\renewcommand{\indexsubitem}{(regex attribute)}
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	161
Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	162	\begin{datadesc}{regs}
				163	When the last call to the \code{match} or \code{search} method found a
				164	match, this is a tuple of pairs of indices corresponding to the
				165	beginning and end of all parenthesized groups in the pattern. Indices
				166	are relative to the string argument passed to \code{match} or
				167	\code{search}. The 0-th tuple gives the beginning and end or the
				168	whole pattern. When the last match or search failed, this is
				169	\code{None}.
				170	\end{datadesc}
				171
				172	\begin{datadesc}{last}
				173	When the last call to the \code{match} or \code{search} method found a
				174	match, this is the string argument passed to that method. When the
				175	last match or search failed, this is \code{None}.
				176	\end{datadesc}
				177
				178	\begin{datadesc}{translate}
				179	This is the value of the \var{translate} argument to
				180	\code{regex.compile} that created this regular expression object. If
				181	the \var{translate} argument was omitted in the \code{regex.compile}
				182	call, this is \code{None}.
				183	\end{datadesc}
Guido van Rossum	326c0bc	1994-01-03 00:00:31 +0000	[diff] [blame]	184
				185	\begin{datadesc}{givenpat}
				186	The regular expression pattern as passed to \code{compile} or
				187	\code{symcomp}.
				188	\end{datadesc}
				189
				190	\begin{datadesc}{realpat}
				191	The regular expression after stripping the group names for regular
				192	expressions compiled with \code{symcomp}. Same as \code{givenpat}
				193	otherwise.
				194	\end{datadesc}
				195
				196	\begin{datadesc}{groupindex}
				197	A dictionary giving the mapping from symbolic group names to numerical
				198	group indices for regular expressions compiled with \code{symcomp}.
				199	\code{None} otherwise.
				200	\end{datadesc}