Blame - Doc/libregex.tex - platform/external/python/cpython2

blob: 544c20461601b95268a7e264f0d61cb9c38e1668 [file] [log] [blame]

Guido van Rossum	5fdeeea	1994-01-02 01:22:07 +0000	[diff] [blame]	1	\section{Built-in Module \sectcode{regex}}
				2
				3	\bimodindex{regex}
				4	This module provides regular expression matching operations similar to
				5	those found in Emacs. It is always available.
				6
				7	By default the patterns are Emacs-style regular expressions; there is
				8	a way to change the syntax to match that of several well-known
				9	\UNIX{} utilities.
				10
				11	This module is 8-bit clean: both patterns and strings may contain null
				12	bytes and characters whose high bit is set.
				13
				14	\strong{Please note:} There is a little-known fact about Python string literals
				15	which means that you don't usually have to worry about doubling
				16	backslashes, even though they are used to escape special characters in
				17	string literals as well as in regular expressions. This is because
				18	Python doesn't remove backslashes from string literals if they are
				19	followed by an unrecognized escape character. \emph{However}, if you
				20	want to include a literal \dfn{backslash} in a regular expression
				21	represented as a string literal, you have to \emph{quadruple} it. E.g.
				22	to extract LaTeX \samp{\e section\{{\rm \ldots}\}} headers from a document, you can
				23	use this pattern: \code{'\e \e \e\e section\{\e (.*\e )\}'}.
				24
				25	The module defines these functions, and an exception:
				26
				27	\renewcommand{\indexsubitem}{(in module regex)}
				28	\begin{funcdesc}{match}{pattern\, string}
				29	Return how many characters at the beginning of \var{string} match
				30	the regular expression \var{pattern}. Return \code{-1} if the
				31	string does not match the pattern (this is different from a
				32	zero-length match!).
				33	\end{funcdesc}
				34
				35	\begin{funcdesc}{search}{pattern\, string}
				36	Return the first position in \var{string} that matches the regular
				37	expression \var{pattern}. Return -1 if no position in the string
				38	matches the pattern (this is different from a zero-length match
				39	anywhere!).
				40	\end{funcdesc}
				41
				42	\begin{funcdesc}{compile}{pattern\, translate}
				43	Compile a regular expression pattern into a regular expression
				44	object, which can be used for matching using its \code{match} and
				45	\code{search} methods, described below. The optional
				46	\var{translate}, if present, must be a 256-character string
				47	indicating how characters (both of the pattern and of the strings to
				48	be matched) are translated before comparing them; the \code{i}-th
				49	element of the string gives the translation for the character with
				50	ASCII code \code{i}.
				51
				52	The sequence
				53
				54	\bcode\begin{verbatim}
				55	prog = regex.compile(pat)
				56	result = prog.match(str)
				57	\end{verbatim}\ecode
				58
				59	is equivalent to
				60
				61	\bcode\begin{verbatim}
				62	result = regex.match(pat, str)
				63	\end{verbatim}\ecode
				64
				65	but the version using \code{compile()} is more efficient when multiple
				66	regular expressions are used concurrently in a single program. (The
				67	compiled version of the last pattern passed to \code{regex.match()} or
				68	\code{regex.search()} is cached, so programs that use only a single
				69	regular expression at a time needn't worry about compiling regular
				70	expressions.)
				71	\end{funcdesc}
				72
				73	\begin{funcdesc}{set_syntax}{flags}
				74	Set the syntax to be used by future calls to \code{compile},
				75	\code{match} and \code{search}. (Already compiled expression objects
				76	are not affected.) The argument is an integer which is the OR of
				77	several flag bits. The return value is the previous value of
				78	the syntax flags. Names for the flags are defined in the standard
				79	module \code{regex_syntax}; read the file \file{regex_syntax.py} for
				80	more information.
				81	\end{funcdesc}
				82
				83	\begin{excdesc}{error}
				84	Exception raised when a string passed to one of the functions here
				85	is not a valid regular expression (e.g., unmatched parentheses) or
				86	when some other error occurs during compilation or matching. (It is
				87	never an error if a string contains no match for a pattern.)
				88	\end{excdesc}
				89
				90	\begin{datadesc}{casefold}
				91	A string suitable to pass as \var{translate} argument to
				92	\code{compile} to map all upper case characters to their lowercase
				93	equivalents.
				94	\end{datadesc}
				95
				96	\noindent
				97	Compiled regular expression objects support these methods:
				98
				99	\renewcommand{\indexsubitem}{(regex method)}
				100	\begin{funcdesc}{match}{string\, pos}
				101	Return how many characters at the beginning of \var{string} match
				102	the compiled regular expression. Return \code{-1} if the string
				103	does not match the pattern (this is different from a zero-length
				104	match!).
				105
				106	The optional second parameter \var{pos} gives an index in the string
				107	where the search is to start; it defaults to \code{0}. This is not
				108	completely equivalent to slicing the string; the \code{'\^'} pattern
				109	character matches at the real begin of the string and at positions
				110	just after a newline, not necessarily at the index where the search
				111	is to start.
				112	\end{funcdesc}
				113
				114	\begin{funcdesc}{search}{string\, pos}
				115	Return the first position in \var{string} that matches the regular
				116	expression \code{pattern}. Return \code{-1} if no position in the
				117	string matches the pattern (this is different from a zero-length
				118	match anywhere!).
				119
				120	The optional second parameter has the same meaning as for the
				121	\code{match} method.
				122	\end{funcdesc}
				123
				124	\begin{funcdesc}{group}{index\, index\, ...}
				125	This method is only valid when the last call to the \code{match}
				126	or \code{search} method found a match. It returns one or more
				127	groups of the match. If there is a single \var{index} argument,
				128	the result is a single string; if there are multiple arguments, the
				129	result is a tuple with one item per argument. If the \var{index} is
				130	zero, the corresponding return value is the entire matching string; if
				131	it is in the inclusive range [1..9], it is the string matching the
				132	the corresponding parenthesized group (using the default syntax,
				133	groups are parenthesized using \code{\\(} and \code{\\)}). If no
				134	such group exists, the corresponding result is \code{None}.
				135	\end{funcdesc}
				136
				137	\noindent
				138	Compiled regular expressions support these data attributes:
				139
				140	\renewcommand{\indexsubitem}{(regex attribute)}
				141	\begin{datadesc}{regs}
				142	When the last call to the \code{match} or \code{search} method found a
				143	match, this is a tuple of pairs of indices corresponding to the
				144	beginning and end of all parenthesized groups in the pattern. Indices
				145	are relative to the string argument passed to \code{match} or
				146	\code{search}. The 0-th tuple gives the beginning and end or the
				147	whole pattern. When the last match or search failed, this is
				148	\code{None}.
				149	\end{datadesc}
				150
				151	\begin{datadesc}{last}
				152	When the last call to the \code{match} or \code{search} method found a
				153	match, this is the string argument passed to that method. When the
				154	last match or search failed, this is \code{None}.
				155	\end{datadesc}
				156
				157	\begin{datadesc}{translate}
				158	This is the value of the \var{translate} argument to
				159	\code{regex.compile} that created this regular expression object. If
				160	the \var{translate} argument was omitted in the \code{regex.compile}
				161	call, this is \code{None}.
				162	\end{datadesc}