Blame - doc/pcre2pattern.3 - platform/external/pcre

blob: 3088ec0fb28578987eab3c1a71d0fcf4e8207df7 [file] [log] [blame]

Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	1	.TH PCRE2PATTERN 3 "12 January 2022" "PCRE2 10.40"
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	2	.SH NAME
				3	PCRE2 - Perl-compatible regular expressions (revised API)
				4	.SH "PCRE2 REGULAR EXPRESSION DETAILS"
				5	.rs
				6	.sp
				7	The syntax and semantics of the regular expressions that are supported by PCRE2
				8	are described in detail below. There is a quick-reference syntax summary in the
				9	.\" HREF
				10	\fBpcre2syntax\fP
				11	.\"
				12	page. PCRE2 tries to match Perl syntax and semantics as closely as it can.
				13	PCRE2 also supports some alternative regular expression syntax (which does not
				14	conflict with the Perl syntax) in order to provide some compatibility with
				15	regular expressions in Python, .NET, and Oniguruma.
				16	.P
				17	Perl's regular expressions are described in its own documentation, and regular
				18	expressions in general are covered in a number of books, some of which have
				19	copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published
				20	by O'Reilly, covers regular expressions in great detail. This description of
				21	PCRE2's regular expressions is intended as reference material.
				22	.P
				23	This document discusses the regular expression patterns that are supported by
				24	PCRE2 when its main matching function, \fBpcre2_match()\fP, is used. PCRE2 also
				25	has an alternative matching function, \fBpcre2_dfa_match()\fP, which matches
				26	using a different algorithm that is not Perl-compatible. Some of the features
				27	discussed below are not available when DFA matching is used. The advantages and
				28	disadvantages of the alternative function, and how it differs from the normal
				29	function, are discussed in the
				30	.\" HREF
				31	\fBpcre2matching\fP
				32	.\"
				33	page.
				34	.
				35	.
				36	.SH "SPECIAL START-OF-PATTERN ITEMS"
				37	.rs
				38	.sp
				39	A number of options that can be passed to \fBpcre2_compile()\fP can also be set
				40	by special items at the start of a pattern. These are not Perl-compatible, but
				41	are provided to make these options accessible to pattern writers who are not
				42	able to change the program that processes the pattern. Any number of these
				43	items may appear, but they must all be together right at the start of the
				44	pattern string, and the letters must be in upper case.
				45	.
				46	.
				47	.SS "UTF support"
				48	.rs
				49	.sp
				50	In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
				51	single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
				52	specified for the 32-bit library, in which case it constrains the character
				53	values to valid Unicode code points. To process UTF strings, PCRE2 must be
				54	built to include Unicode support (which is the default). When using UTF strings
				55	you must either call the compiling function with one or both of the PCRE2_UTF
				56	or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
				57	sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
				58	setting a UTF mode affects pattern matching is mentioned in several places
				59	below. There is also a summary of features in the
				60	.\" HREF
				61	\fBpcre2unicode\fP
				62	.\"
				63	page.
				64	.P
				65	Some applications that allow their users to supply patterns may wish to
				66	restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
				67	option is passed to \fBpcre2_compile()\fP, (*UTF) is not allowed, and its
				68	appearance in a pattern causes an error.
				69	.
				70	.
				71	.SS "Unicode property support"
				72	.rs
				73	.sp
				74	Another special sequence that may appear at the start of a pattern is (*UCP).
				75	This has the same effect as setting the PCRE2_UCP option: it causes sequences
				76	such as \ed and \ew to use Unicode properties to determine character types,
				77	instead of recognizing only characters with codes less than 256 via a lookup
				78	table. If also causes upper/lower casing operations to use Unicode properties
				79	for characters with code points greater than 127, even when UTF is not set.
				80	.P
				81	Some applications that allow their users to supply patterns may wish to
				82	restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
				83	\fBpcre2_compile()\fP, (*UCP) is not allowed, and its appearance in a pattern
				84	causes an error.
				85	.
				86	.
				87	.SS "Locking out empty string matching"
				88	.rs
				89	.sp
				90	Starting a pattern with (NOTEMPTY) or (NOTEMPTY_ATSTART) has the same effect
				91	as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
				92	matching function is subsequently called to match the pattern. These options
				93	lock out the matching of empty strings, either entirely, or only at the start
				94	of the subject.
				95	.
				96	.
				97	.SS "Disabling auto-possessification"
				98	.rs
				99	.sp
				100	If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
				101	the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making quantifiers
				102	possessive when what follows cannot match the repeated item. For example, by
				103	default a+b is treated as a++b. For more details, see the
				104	.\" HREF
				105	\fBpcre2api\fP
				106	.\"
				107	documentation.
				108	.
				109	.
				110	.SS "Disabling start-up optimizations"
				111	.rs
				112	.sp
				113	If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
				114	PCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly
				115	reaching "no match" results. For more details, see the
				116	.\" HREF
				117	\fBpcre2api\fP
				118	.\"
				119	documentation.
				120	.
				121	.
				122	.SS "Disabling automatic anchoring"
				123	.rs
				124	.sp
				125	If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
				126	setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
				127	apply to patterns whose top-level branches all start with .* (match any number
				128	of arbitrary characters). For more details, see the
				129	.\" HREF
				130	\fBpcre2api\fP
				131	.\"
				132	documentation.
				133	.
				134	.
				135	.SS "Disabling JIT compilation"
				136	.rs
				137	.sp
				138	If a pattern that starts with (*NO_JIT) is successfully compiled, an attempt by
				139	the application to apply the JIT optimization by calling
				140	\fBpcre2_jit_compile()\fP is ignored.
				141	.
				142	.
				143	.SS "Setting match resource limits"
				144	.rs
				145	.sp
				146	The \fBpcre2_match()\fP function contains a counter that is incremented every
				147	time it goes round its main loop. The caller of \fBpcre2_match()\fP can set a
				148	limit on this counter, which therefore limits the amount of computing resource
				149	used for a match. The maximum depth of nested backtracking can also be limited;
				150	this indirectly restricts the amount of heap memory that is used, but there is
				151	also an explicit memory limit that can be set.
				152	.P
				153	These facilities are provided to catch runaway matches that are provoked by
				154	patterns with huge matching trees. A common example is a pattern with nested
				155	unlimited repeats applied to a long string that does not match. When one of
				156	these limits is reached, \fBpcre2_match()\fP gives an error return. The limits
				157	can also be set by items at the start of the pattern of the form
				158	.sp
				159	(*LIMIT_HEAP=d)
				160	(*LIMIT_MATCH=d)
				161	(*LIMIT_DEPTH=d)
				162	.sp
				163	where d is any number of decimal digits. However, the value of the setting must
				164	be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
				165	for it to have any effect. In other words, the pattern writer can lower the
				166	limits set by the programmer, but not raise them. If there is more than one
				167	setting of one of these limits, the lower value is used. The heap limit is
				168	specified in kibibytes (units of 1024 bytes).
				169	.P
				170	Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
				171	still recognized for backwards compatibility.
				172	.P
				173	The heap limit applies only when the \fBpcre2_match()\fP or
				174	\fBpcre2_dfa_match()\fP interpreters are used for matching. It does not apply
				175	to JIT. The match limit is used (but in a different way) when JIT is being
				176	used, or when \fBpcre2_dfa_match()\fP is called, to limit computing resource
				177	usage by those matching functions. The depth limit is ignored by JIT but is
				178	relevant for DFA matching, which uses function recursion for recursions within
				179	the pattern and for lookaround assertions and atomic groups. In this case, the
				180	depth limit controls the depth of such recursion.
				181	.
				182	.
				183	.\" HTML <a name="newlines"></a>
				184	.SS "Newline conventions"
				185	.rs
				186	.sp
				187	PCRE2 supports six different conventions for indicating line breaks in
				188	strings: a single CR (carriage return) character, a single LF (linefeed)
				189	character, the two-character sequence CRLF, any of the three preceding, any
				190	Unicode newline sequence, or the NUL character (binary zero). The
				191	.\" HREF
				192	\fBpcre2api\fP
				193	.\"
				194	page has
				195	.\" HTML <a href="pcre2api.html#newlines">
				196	.\" </a>
				197	further discussion
				198	.\"
				199	about newlines, and shows how to set the newline convention when calling
				200	\fBpcre2_compile()\fP.
				201	.P
				202	It is also possible to specify a newline convention by starting a pattern
				203	string with one of the following sequences:
				204	.sp
				205	(*CR) carriage return
				206	(*LF) linefeed
				207	(*CRLF) carriage return, followed by linefeed
				208	(*ANYCRLF) any of the three above
				209	(*ANY) all Unicode newline sequences
				210	(*NUL) the NUL character (binary zero)
				211	.sp
				212	These override the default and the options given to the compiling function. For
				213	example, on a Unix system where LF is the default newline sequence, the pattern
				214	.sp
				215	(*CR)a.b
				216	.sp
				217	changes the convention to CR. That pattern matches "a\enb" because LF is no
				218	longer a newline. If more than one of these settings is present, the last one
				219	is used.
				220	.P
				221	The newline convention affects where the circumflex and dollar assertions are
				222	true. It also affects the interpretation of the dot metacharacter when
				223	PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an
				224	opening brace. However, it does not affect what the \eR escape sequence
				225	matches. By default, this is any Unicode newline sequence, for Perl
				226	compatibility. However, this can be changed; see the next section and the
				227	description of \eR in the section entitled
				228	.\" HTML <a href="#newlineseq">
				229	.\" </a>
				230	"Newline sequences"
				231	.\"
				232	below. A change of \eR setting can be combined with a change of newline
				233	convention.
				234	.
				235	.
				236	.SS "Specifying what \eR matches"
				237	.rs
				238	.sp
				239	It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
				240	complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
				241	at compile time. This effect can also be achieved by starting a pattern with
				242	(BSR_ANYCRLF). For completeness, (BSR_UNICODE) is also recognized,
				243	corresponding to PCRE2_BSR_UNICODE.
				244	.
				245	.
				246	.SH "EBCDIC CHARACTER CODES"
				247	.rs
				248	.sp
				249	PCRE2 can be compiled to run in an environment that uses EBCDIC as its
				250	character code instead of ASCII or Unicode (typically a mainframe system). In
				251	the sections below, character code values are ASCII or Unicode; in an EBCDIC
				252	environment these characters may have different code values, and there are no
				253	code points greater than 255.
				254	.
				255	.
				256	.SH "CHARACTERS AND METACHARACTERS"
				257	.rs
				258	.sp
				259	A regular expression is a pattern that is matched against a subject string from
				260	left to right. Most characters stand for themselves in a pattern, and match the
				261	corresponding characters in the subject. As a trivial example, the pattern
				262	.sp
				263	The quick brown fox
				264	.sp
				265	matches a portion of a subject string that is identical to itself. When
				266	caseless matching is specified (the PCRE2_CASELESS option or (?i) within the
				267	pattern), letters are matched independently of case. Note that there are two
				268	ASCII characters, K and S, that, in addition to their lower case ASCII
				269	equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
				270	(long S) respectively when either PCRE2_UTF or PCRE2_UCP is set.
				271	.P
				272	The power of regular expressions comes from the ability to include wild cards,
				273	character classes, alternatives, and repetitions in the pattern. These are
				274	encoded in the pattern by the use of \fImetacharacters\fP, which do not stand
				275	for themselves but instead are interpreted in some special way.
				276	.P
				277	There are two different sets of metacharacters: those that are recognized
				278	anywhere in the pattern except within square brackets, and those that are
				279	recognized within square brackets. Outside square brackets, the metacharacters
				280	are as follows:
				281	.sp
				282	\e general escape character with several uses
				283	^ assert start of string (or line, in multiline mode)
				284	$ assert end of string (or line, in multiline mode)
				285	. match any character except newline (by default)
				286	[ start character class definition
				287	\| start of alternative branch
				288	( start group or control verb
				289	) end group or control verb
				290	* 0 or more quantifier
				291	+ 1 or more quantifier; also "possessive quantifier"
				292	? 0 or 1 quantifier; also quantifier minimizer
				293	{ start min/max quantifier
				294	.sp
				295	Part of a pattern that is in square brackets is called a "character class". In
				296	a character class the only metacharacters are:
				297	.sp
				298	\e general escape character
				299	^ negate the class, but only if the first character
				300	- indicates character range
				301	[ POSIX character class (if followed by POSIX syntax)
				302	] terminates the character class
				303	.sp
				304	If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
				305	the pattern, other than in a character class, and characters between a #
				306	outside a character class and the next newline, inclusive, are ignored. An
				307	escaping backslash can be used to include a white space or a # character as
				308	part of the pattern. If the PCRE2_EXTENDED_MORE option is set, the same
				309	applies, but in addition unescaped space and horizontal tab characters are
				310	ignored inside a character class. Note: only these two characters are ignored,
				311	not the full set of pattern white space characters that are ignored outside a
				312	character class. Option settings can be changed within a pattern; see the
				313	section entitled
				314	.\" HTML <a href="#internaloptions">
				315	.\" </a>
				316	"Internal Option Setting"
				317	.\"
				318	below.
				319	.P
				320	The following sections describe the use of each of the metacharacters.
				321	.
				322	.
				323	.SH BACKSLASH
				324	.rs
				325	.sp
				326	The backslash character has several uses. Firstly, if it is followed by a
				327	character that is not a digit or a letter, it takes away any special meaning
				328	that character may have. This use of backslash as an escape character applies
				329	both inside and outside character classes.
				330	.P
				331	For example, if you want to match a * character, you must write \e* in the
				332	pattern. This escaping action applies whether or not the following character
				333	would otherwise be interpreted as a metacharacter, so it is always safe to
				334	precede a non-alphanumeric with backslash to specify that it stands for itself.
				335	In particular, if you want to match a backslash, you write \e\e.
				336	.P
				337	Only ASCII digits and letters have any special meaning after a backslash. All
				338	other characters (in particular, those whose code points are greater than 127)
				339	are treated as literals.
				340	.P
				341	If you want to treat all characters in a sequence as literals, you can do so by
				342	putting them between \eQ and \eE. This is different from Perl in that $ and @
				343	are handled as literals in \eQ...\eE sequences in PCRE2, whereas in Perl, $ and
				344	@ cause variable interpolation. Also, Perl does "double-quotish backslash
				345	interpolation" on any backslashes between \eQ and \eE which, its documentation
				346	says, "may lead to confusing results". PCRE2 treats a backslash between \eQ and
				347	\eE just like any other character. Note the following examples:
				348	.sp
				349	Pattern PCRE2 matches Perl matches
				350	.sp
				351	.\" JOIN
				352	\eQabc$xyz\eE abc$xyz abc followed by the
				353	contents of $xyz
				354	\eQabc\e$xyz\eE abc\e$xyz abc\e$xyz
				355	\eQabc\eE\e$\eQxyz\eE abc$xyz abc$xyz
				356	\eQA\eB\eE A\eB A\eB
				357	\eQ\e\eE \e \e\eE
				358	.sp
				359	The \eQ...\eE sequence is recognized both inside and outside character classes.
				360	An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed
				361	by \eE later in the pattern, the literal interpretation continues to the end of
				362	the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
				363	a character class, this causes an error, because the character class is not
				364	terminated by a closing square bracket.
				365	.
				366	.
				367	.\" HTML <a name="digitsafterbackslash"></a>
				368	.SS "Non-printing characters"
				369	.rs
				370	.sp
				371	A second use of backslash provides a way of encoding non-printing characters
				372	in patterns in a visible manner. There is no restriction on the appearance of
				373	non-printing characters in a pattern, but when a pattern is being prepared by
				374	text editing, it is often easier to use one of the following escape sequences
				375	instead of the binary character it represents. In an ASCII or Unicode
				376	environment, these escapes are as follows:
				377	.sp
				378	\ea alarm, that is, the BEL character (hex 07)
				379	\ecx "control-x", where x is any printable ASCII character
				380	\ee escape (hex 1B)
				381	\ef form feed (hex 0C)
				382	\en linefeed (hex 0A)
				383	\er carriage return (hex 0D) (but see below)
				384	\et tab (hex 09)
				385	\e0dd character with octal code 0dd
				386	\eddd character with octal code ddd, or backreference
				387	\eo{ddd..} character with octal code ddd..
				388	\exhh character with hex code hh
				389	\ex{hhh..} character with hex code hhh..
				390	\eN{U+hhh..} character with Unicode hex code point hhh..
				391	.sp
				392	By default, after \ex that is not followed by {, from zero to two hexadecimal
				393	digits are read (letters can be in upper or lower case). Any number of
				394	hexadecimal digits may appear between \ex{ and }. If a character other than a
				395	hexadecimal digit appears between \ex{ and }, or if there is no terminating },
				396	an error occurs.
				397	.P
				398	Characters whose code points are less than 256 can be defined by either of the
				399	two syntaxes for \ex or by an octal sequence. There is no difference in the way
				400	they are handled. For example, \exdc is exactly the same as \ex{dc} or \e334.
				401	However, using the braced versions does make such sequences easier to read.
				402	.P
				403	Support is available for some ECMAScript (aka JavaScript) escape sequences via
				404	two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \ex followed
				405	by { is not recognized. Only if \ex is followed by two hexadecimal digits is it
				406	recognized as a character escape. Otherwise it is interpreted as a literal "x"
				407	character. In this mode, support for code points greater than 256 is provided
				408	by \eu, which must be followed by four hexadecimal digits; otherwise it is
				409	interpreted as a literal "u" character.
				410	.P
				411	PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
				412	\eu{hhh..} is recognized as the character specified by hexadecimal code point.
				413	There may be any number of hexadecimal digits. This syntax is from ECMAScript
				414	6.
				415	.P
				416	The \eN{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
				417	UTF mode. Perl also uses \eN{name} to specify characters by Unicode name; PCRE2
				418	does not support this. Note that when \eN is not followed by an opening brace
				419	(curly bracket) it has an entirely different meaning, matching any character
				420	that is not a newline.
				421	.P
				422	There are some legacy applications where the escape sequence \er is expected to
				423	match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a
				424	pattern is converted to \en so that it matches a LF (linefeed) instead of a CR
				425	(carriage return) character.
				426	.P
				427	The precise effect of \ecx on ASCII characters is as follows: if x is a lower
				428	case letter, it is converted to upper case. Then bit 6 of the character (hex
				429	40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
				430	but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
				431	code unit following \ec has a value less than 32 or greater than 126, a
				432	compile-time error occurs.
				433	.P
				434	When PCRE2 is compiled in EBCDIC mode, \eN{U+hhh..} is not supported. \ea, \ee,
				435	\ef, \en, \er, and \et generate the appropriate EBCDIC code values. The \ec
				436	escape is processed as specified for Perl in the \fBperlebcdic\fP document. The
				437	only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
				438	^, _, or ?. Any other character provokes a compile-time error. The sequence
				439	\ec@ encodes character code 0; after \ec the letters (in either case) encode
				440	characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
				441	(hex 1B to hex 1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
				442	.P
				443	Thus, apart from \ec?, these escapes generate the same character code values as
				444	they do in an ASCII environment, though the meanings of the values mostly
				445	differ. For example, \ecG always generates code value 7, which is BEL in ASCII
				446	but DEL in EBCDIC.
				447	.P
				448	The sequence \ec? generates DEL (127, hex 7F) in an ASCII environment, but
				449	because 127 is not a control character in EBCDIC, Perl makes it generate the
				450	APC character. Unfortunately, there are several variants of EBCDIC. In most of
				451	them the APC character has the value 255 (hex FF), but in the one Perl calls
				452	POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
				453	values, PCRE2 makes \ec? generate 95; otherwise it generates 255.
				454	.P
				455	After \e0 up to two further octal digits are read. If there are fewer than two
				456	digits, just those that are present are used. Thus the sequence \e0\ex\e015
				457	specifies two binary zeros followed by a CR character (code value 13). Make
				458	sure you supply two digits after the initial zero if the pattern character that
				459	follows is itself an octal digit.
				460	.P
				461	The escape \eo must be followed by a sequence of octal digits, enclosed in
				462	braces. An error occurs if this is not the case. This escape is a recent
				463	addition to Perl; it provides way of specifying character code points as octal
				464	numbers greater than 0777, and it also allows octal numbers and backreferences
				465	to be unambiguously specified.
				466	.P
				467	For greater clarity and unambiguity, it is best to avoid following \e by a
				468	digit greater than zero. Instead, use \eo{} or \ex{} to specify numerical
				469	character code points, and \eg{} to specify backreferences. The following
				470	paragraphs describe the old, ambiguous syntax.
				471	.P
				472	The handling of a backslash followed by a digit other than 0 is complicated,
				473	and Perl has changed over time, causing PCRE2 also to change.
				474	.P
				475	Outside a character class, PCRE2 reads the digit and any following digits as a
				476	decimal number. If the number is less than 10, begins with the digit 8 or 9, or
				477	if there are at least that many previous capture groups in the expression, the
				478	entire sequence is taken as a \fIbackreference\fP. A description of how this
				479	works is given
				480	.\" HTML <a href="#backreferences">
				481	.\" </a>
				482	later,
				483	.\"
				484	following the discussion of
				485	.\" HTML <a href="#group">
				486	.\" </a>
				487	parenthesized groups.
				488	.\"
				489	Otherwise, up to three octal digits are read to form a character code.
				490	.P
				491	Inside a character class, PCRE2 handles \e8 and \e9 as the literal characters
				492	"8" and "9", and otherwise reads up to three octal digits following the
				493	backslash, using them to generate a data character. Any subsequent digits stand
				494	for themselves. For example, outside a character class:
				495	.sp
				496	\e040 is another way of writing an ASCII space
				497	.\" JOIN
				498	\e40 is the same, provided there are fewer than 40
				499	previous capture groups
				500	\e7 is always a backreference
				501	.\" JOIN
				502	\e11 might be a backreference, or another way of
				503	writing a tab
				504	\e011 is always a tab
				505	\e0113 is a tab followed by the character "3"
				506	.\" JOIN
				507	\e113 might be a backreference, otherwise the
				508	character with octal code 113
				509	.\" JOIN
				510	\e377 might be a backreference, otherwise
				511	the value 255 (decimal)
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	512	\e81 is always a backreference
				513	.sp
				514	Note that octal values of 100 or greater that are specified using this syntax
				515	must not be introduced by a leading zero, because no more than three octal
				516	digits are ever read.
				517	.
				518	.
				519	.SS "Constraints on character values"
				520	.rs
				521	.sp
				522	Characters that are specified using octal or hexadecimal numbers are
				523	limited to certain values, as follows:
				524	.sp
				525	8-bit non-UTF mode no greater than 0xff
				526	16-bit non-UTF mode no greater than 0xffff
				527	32-bit non-UTF mode no greater than 0xffffffff
				528	All UTF modes no greater than 0x10ffff and a valid code point
				529	.sp
				530	Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
				531	so-called "surrogate" code points). The check for these can be disabled by the
				532	caller of \fBpcre2_compile()\fP by setting the option
				533	PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
				534	and UTF-32 modes, because these values are not representable in UTF-16.
				535	.
				536	.
				537	.SS "Escape sequences in character classes"
				538	.rs
				539	.sp
				540	All the sequences that define a single character value can be used both inside
				541	and outside character classes. In addition, inside a character class, \eb is
				542	interpreted as the backspace character (hex 08).
				543	.P
				544	When not followed by an opening brace, \eN is not allowed in a character class.
				545	\eB, \eR, and \eX are not special inside a character class. Like other
				546	unrecognized alphabetic escape sequences, they cause an error. Outside a
				547	character class, these sequences have different meanings.
				548	.
				549	.
				550	.SS "Unsupported escape sequences"
				551	.rs
				552	.sp
				553	In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string
				554	handler and used to modify the case of following characters. By default, PCRE2
				555	does not support these escape sequences in patterns. However, if either of the
				556	PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \eU matches a "U"
				557	character, and \eu can be used to define a character by code point, as
				558	described above.
				559	.
				560	.
				561	.SS "Absolute and relative backreferences"
				562	.rs
				563	.sp
				564	The sequence \eg followed by a signed or unsigned number, optionally enclosed
				565	in braces, is an absolute or relative backreference. A named backreference
				566	can be coded as \eg{name}. Backreferences are discussed
				567	.\" HTML <a href="#backreferences">
				568	.\" </a>
				569	later,
				570	.\"
				571	following the discussion of
				572	.\" HTML <a href="#group">
				573	.\" </a>
				574	parenthesized groups.
				575	.\"
				576	.
				577	.
				578	.SS "Absolute and relative subroutine calls"
				579	.rs
				580	.sp
				581	For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
				582	a number enclosed either in angle brackets or single quotes, is an alternative
				583	syntax for referencing a capture group as a subroutine. Details are discussed
				584	.\" HTML <a href="#onigurumasubroutines">
				585	.\" </a>
				586	later.
				587	.\"
				588	Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
				589	synonymous. The former is a backreference; the latter is a
				590	.\" HTML <a href="#groupsassubroutines">
				591	.\" </a>
				592	subroutine
				593	.\"
				594	call.
				595	.
				596	.
				597	.\" HTML <a name="genericchartypes"></a>
				598	.SS "Generic character types"
				599	.rs
				600	.sp
				601	Another use of backslash is for specifying generic character types:
				602	.sp
				603	\ed any decimal digit
				604	\eD any character that is not a decimal digit
				605	\eh any horizontal white space character
				606	\eH any character that is not a horizontal white space character
				607	\eN any character that is not a newline
				608	\es any white space character
				609	\eS any character that is not a white space character
				610	\ev any vertical white space character
				611	\eV any character that is not a vertical white space character
				612	\ew any "word" character
				613	\eW any "non-word" character
				614	.sp
				615	The \eN escape sequence has the same meaning as
				616	.\" HTML <a href="#fullstopdot">
				617	.\" </a>
				618	the "." metacharacter
				619	.\"
				620	when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
				621	meaning of \eN. Note that when \eN is followed by an opening brace it has a
				622	different meaning. See the section entitled
				623	.\" HTML <a href="#digitsafterbackslash">
				624	.\" </a>
				625	"Non-printing characters"
				626	.\"
				627	above for details. Perl also uses \eN{name} to specify characters by Unicode
				628	name; PCRE2 does not support this.
				629	.P
				630	Each pair of lower and upper case escape sequences partitions the complete set
				631	of characters into two disjoint sets. Any given character matches one, and only
				632	one, of each pair. The sequences can appear both inside and outside character
				633	classes. They each match one character of the appropriate type. If the current
				634	matching point is at the end of the subject string, all of them fail, because
				635	there is no character to match.
				636	.P
				637	The default \es characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
				638	space (32), which are defined as white space in the "C" locale. This list may
				639	vary if locale-specific matching is taking place. For example, in some locales
				640	the "non-breaking space" character (\exA0) is recognized as white space, and in
				641	others the VT character is not.
				642	.P
				643	A "word" character is an underscore or any character that is a letter or digit.
				644	By default, the definition of letters and digits is controlled by PCRE2's
				645	low-valued character tables, and may vary if locale-specific matching is taking
				646	place (see
				647	.\" HTML <a href="pcre2api.html#localesupport">
				648	.\" </a>
				649	"Locale support"
				650	.\"
				651	in the
				652	.\" HREF
				653	\fBpcre2api\fP
				654	.\"
				655	page). For example, in a French locale such as "fr_FR" in Unix-like systems,
				656	or "french" in Windows, some character codes greater than 127 are used for
				657	accented letters, and these are then matched by \ew. The use of locales with
				658	Unicode is discouraged.
				659	.P
				660	By default, characters whose code points are greater than 127 never match \ed,
				661	\es, or \ew, and always match \eD, \eS, and \eW, although this may be different
				662	for characters in the range 128-255 when locale-specific matching is happening.
				663	These escape sequences retain their original meanings from before Unicode
				664	support was available, mainly for efficiency reasons. If the PCRE2_UCP option
				665	is set, the behaviour is changed so that Unicode properties are used to
				666	determine character types, as follows:
				667	.sp
				668	\ed any character that matches \ep{Nd} (decimal digit)
				669	\es any character that matches \ep{Z} or \eh or \ev
				670	\ew any character that matches \ep{L} or \ep{N}, plus underscore
				671	.sp
				672	The upper case escapes match the inverse sets of characters. Note that \ed
				673	matches only decimal digits, whereas \ew matches any Unicode digit, as well as
				674	any Unicode letter, and underscore. Note also that PCRE2_UCP affects \eb, and
				675	\eB because they are defined in terms of \ew and \eW. Matching these sequences
				676	is noticeably slower when PCRE2_UCP is set.
				677	.P
				678	The sequences \eh, \eH, \ev, and \eV, in contrast to the other sequences, which
				679	match only ASCII characters by default, always match a specific list of code
				680	points, whether or not PCRE2_UCP is set. The horizontal space characters are:
				681	.sp
				682	U+0009 Horizontal tab (HT)
				683	U+0020 Space
				684	U+00A0 Non-break space
				685	U+1680 Ogham space mark
				686	U+180E Mongolian vowel separator
				687	U+2000 En quad
				688	U+2001 Em quad
				689	U+2002 En space
				690	U+2003 Em space
				691	U+2004 Three-per-em space
				692	U+2005 Four-per-em space
				693	U+2006 Six-per-em space
				694	U+2007 Figure space
				695	U+2008 Punctuation space
				696	U+2009 Thin space
				697	U+200A Hair space
				698	U+202F Narrow no-break space
				699	U+205F Medium mathematical space
				700	U+3000 Ideographic space
				701	.sp
				702	The vertical space characters are:
				703	.sp
				704	U+000A Linefeed (LF)
				705	U+000B Vertical tab (VT)
				706	U+000C Form feed (FF)
				707	U+000D Carriage return (CR)
				708	U+0085 Next line (NEL)
				709	U+2028 Line separator
				710	U+2029 Paragraph separator
				711	.sp
				712	In 8-bit, non-UTF-8 mode, only the characters with code points less than 256
				713	are relevant.
				714	.
				715	.
				716	.\" HTML <a name="newlineseq"></a>
				717	.SS "Newline sequences"
				718	.rs
				719	.sp
				720	Outside a character class, by default, the escape sequence \eR matches any
				721	Unicode newline sequence. In 8-bit non-UTF-8 mode \eR is equivalent to the
				722	following:
				723	.sp
				724	(?>\er\en\|\en\|\ex0b\|\ef\|\er\|\ex85)
				725	.sp
				726	This is an example of an "atomic group", details of which are given
				727	.\" HTML <a href="#atomicgroup">
				728	.\" </a>
				729	below.
				730	.\"
				731	This particular group matches either the two-character sequence CR followed by
				732	LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
				733	U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
				734	line, U+0085). Because this is an atomic group, the two-character sequence is
				735	treated as a single unit that cannot be split.
				736	.P
				737	In other modes, two additional characters whose code points are greater than 255
				738	are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
				739	Unicode support is not needed for these characters to be recognized.
				740	.P
				741	It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
				742	complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
				743	at compile time. (BSR is an abbreviation for "backslash R".) This can be made
				744	the default when PCRE2 is built; if this is the case, the other behaviour can
				745	be requested via the PCRE2_BSR_UNICODE option. It is also possible to specify
				746	these settings by starting a pattern string with one of the following
				747	sequences:
				748	.sp
				749	(*BSR_ANYCRLF) CR, LF, or CRLF only
				750	(*BSR_UNICODE) any Unicode newline sequence
				751	.sp
				752	These override the default and the options given to the compiling function.
				753	Note that these special settings, which are not Perl-compatible, are recognized
				754	only at the very start of a pattern, and that they must be in upper case. If
				755	more than one of them is present, the last one is used. They can be combined
				756	with a change of newline convention; for example, a pattern can start with:
				757	.sp
				758	(ANY)(BSR_ANYCRLF)
				759	.sp
				760	They can also be combined with the (UTF) or (UCP) special sequences. Inside a
				761	character class, \eR is treated as an unrecognized escape sequence, and causes
				762	an error.
				763	.
				764	.
				765	.\" HTML <a name="uniextseq"></a>
				766	.SS Unicode character properties
				767	.rs
				768	.sp
				769	When PCRE2 is built with Unicode support (the default), three additional escape
				770	sequences that match characters with specific properties are available. They
				771	can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
				772	sequences are of course limited to testing characters whose code points are
				773	less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points
				774	greater than 0x10ffff (the Unicode limit) may be encountered. These are all
Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	775	treated as being in the Unknown script and with an unassigned type.
				776	.P
				777	Matching characters by Unicode property is not fast, because PCRE2 has to do a
				778	multistage table lookup in order to find a character's property. That is why
				779	the traditional escape sequences such as \ed and \ew do not use Unicode
				780	properties in PCRE2 by default, though you can make them do so by setting the
				781	PCRE2_UCP option or by starting the pattern with (*UCP).
				782	.P
				783	The extra escape sequences that provide property support are:
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	784	.sp
				785	\ep{\fIxx\fP} a character with the \fIxx\fP property
				786	\eP{\fIxx\fP} a character without the \fIxx\fP property
				787	\eX a Unicode extended grapheme cluster
				788	.sp
Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	789	The property names represented by \fIxx\fP above are not case-sensitive, and in
				790	accordance with Unicode's "loose matching" rules, spaces, hyphens, and
				791	underscores are ignored. There is support for Unicode script names, Unicode
				792	general category properties, "Any", which matches any character (including
				793	newline), Bidi_Class, a number of binary (yes/no) properties, and some special
				794	PCRE2 properties (described
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	795	.\" HTML <a href="#extraprops">
				796	.\" </a>
Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	797	below).
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	798	.\"
Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	799	Certain other Perl properties such as "InMusicalSymbols" are not supported by
				800	PCRE2. Note that \eP{Any} does not match any characters, so always causes a
				801	match failure.
				802	.
				803	.
				804	.
				805	.SS "Script properties for \ep and \eP"
				806	.rs
				807	.sp
				808	There are three different syntax forms for matching a script. Each Unicode
				809	character has a basic script and, optionally, a list of other scripts ("Script
				810	Extensions") with which it is commonly used. Using the Adlam script as an
				811	example, \ep{sc:Adlam} matches characters whose basic script is Adlam, whereas
				812	\ep{scx:Adlam} matches, in addition, characters that have Adlam in their
				813	extensions list. The full names "script" and "script extensions" for the
				814	property types are recognized, and a equals sign is an alternative to the
				815	colon. If a script name is given without a property type, for example,
				816	\ep{Adlam}, it is treated as \ep{scx:Adlam}. Perl changed to this
				817	interpretation at release 5.26 and PCRE2 changed at release 10.40.
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	818	.P
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	819	Unassigned characters (and in non-UTF 32-bit mode, characters with code points
				820	greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
				821	part of an identified script are lumped together as "Common". The current list
Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	822	of recognized script names and their 4-character abbreviations can be obtained
				823	by running this command:
				824	.sp
				825	pcre2test -LS
				826	.sp
				827	.
				828	.
				829	.
				830	.SS "The general category property for \ep and \eP"
				831	.rs
				832	.sp
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	833	Each character has exactly one Unicode general category property, specified by
				834	a two-letter abbreviation. For compatibility with Perl, negation can be
				835	specified by including a circumflex between the opening brace and the property
				836	name. For example, \ep{^Lu} is the same as \eP{Lu}.
				837	.P
				838	If only one letter is specified with \ep or \eP, it includes all the general
				839	category properties that start with that letter. In this case, in the absence
				840	of negation, the curly brackets in the escape sequence are optional; these two
				841	examples have the same effect:
				842	.sp
				843	\ep{L}
				844	\epL
				845	.sp
				846	The following general category property codes are supported:
				847	.sp
				848	C Other
				849	Cc Control
				850	Cf Format
				851	Cn Unassigned
				852	Co Private use
				853	Cs Surrogate
				854	.sp
				855	L Letter
				856	Ll Lower case letter
				857	Lm Modifier letter
				858	Lo Other letter
				859	Lt Title case letter
				860	Lu Upper case letter
				861	.sp
				862	M Mark
				863	Mc Spacing mark
				864	Me Enclosing mark
				865	Mn Non-spacing mark
				866	.sp
				867	N Number
				868	Nd Decimal number
				869	Nl Letter number
				870	No Other number
				871	.sp
				872	P Punctuation
				873	Pc Connector punctuation
				874	Pd Dash punctuation
				875	Pe Close punctuation
				876	Pf Final punctuation
				877	Pi Initial punctuation
				878	Po Other punctuation
				879	Ps Open punctuation
				880	.sp
				881	S Symbol
				882	Sc Currency symbol
				883	Sk Modifier symbol
				884	Sm Mathematical symbol
				885	So Other symbol
				886	.sp
				887	Z Separator
				888	Zl Line separator
				889	Zp Paragraph separator
				890	Zs Space separator
				891	.sp
Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	892	The special property LC, which has the synonym L&, is also supported: it
				893	matches a character that has the Lu, Ll, or Lt property, in other words, a
				894	letter that is not classified as a modifier or "other".
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	895	.P
				896	The Cs (Surrogate) property applies only to characters whose code points are in
				897	the range U+D800 to U+DFFF. These characters are no different to any other
				898	character when PCRE2 is not in UTF mode (using the 16-bit or 32-bit library).
				899	However, they are not valid in Unicode strings and so cannot be tested by PCRE2
				900	in UTF mode, unless UTF validity checking has been turned off (see the
				901	discussion of PCRE2_NO_UTF_CHECK in the
				902	.\" HREF
				903	\fBpcre2api\fP
				904	.\"
				905	page).
				906	.P
				907	The long synonyms for property names that Perl supports (such as \ep{Letter})
				908	are not supported by PCRE2, nor is it permitted to prefix any of these
				909	properties with "Is".
				910	.P
				911	No character that is in the Unicode table has the Cn (unassigned) property.
				912	Instead, this property is assumed for any code point that is not in the
				913	Unicode table.
				914	.P
				915	Specifying caseless matching does not affect these escape sequences. For
				916	example, \ep{Lu} always matches only upper case letters. This is different from
				917	the behaviour of current versions of Perl.
Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	918	.
				919	.
				920	.SS "Binary (yes/no) properties for \ep and \eP"
				921	.rs
				922	.sp
				923	Unicode defines a number of binary properties, that is, properties whose only
				924	values are true or false. You can obtain a list of those that are recognized by
				925	\ep and \eP, along with their abbreviations, by running this command:
				926	.sp
				927	pcre2test -LP
				928	.sp
				929	.
				930	.
				931	.SS "The Bidi_Class property for \ep and \eP"
				932	.rs
				933	.sp
				934	\ep{Bidi_Class:<class>} matches a character with the given class
				935	\ep{BC:<class>} matches a character with the given class
				936	.sp
				937	The recognized classes are:
				938	.sp
				939	AL Arabic letter
				940	AN Arabic number
				941	B paragraph separator
				942	BN boundary neutral
				943	CS common separator
				944	EN European number
				945	ES European separator
				946	ET European terminator
				947	FSI first strong isolate
				948	L left-to-right
				949	LRE left-to-right embedding
				950	LRI left-to-right isolate
				951	LRO left-to-right override
				952	NSM non-spacing mark
				953	ON other neutral
				954	PDF pop directional format
				955	PDI pop directional isolate
				956	R right-to-left
				957	RLE right-to-left embedding
				958	RLI right-to-left isolate
				959	RLO right-to-left override
				960	S segment separator
				961	WS which space
				962	.sp
				963	An equals sign may be used instead of a colon. The class names are
				964	case-insensitive; only the short names listed above are recognized.
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	965	.
				966	.
				967	.SS Extended grapheme clusters
				968	.rs
				969	.sp
				970	The \eX escape matches any number of Unicode characters that form an "extended
				971	grapheme cluster", and treats the sequence as an atomic group
				972	.\" HTML <a href="#atomicgroup">
				973	.\" </a>
				974	(see below).
				975	.\"
				976	Unicode supports various kinds of composite character by giving each character
				977	a grapheme breaking property, and having rules that use these properties to
				978	define the boundaries of extended grapheme clusters. The rules are defined in
				979	Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
				980	abandoned the use of some previous properties that had been used for emojis.
				981	Instead it introduced various emoji-specific properties. PCRE2 uses only the
				982	Extended Pictographic property.
				983	.P
				984	\eX always matches at least one character. Then it decides whether to add
				985	additional characters according to the following rules for ending a cluster:
				986	.P
				987	1. End at the end of the subject string.
				988	.P
				989	2. Do not end between CR and LF; otherwise end after any control character.
				990	.P
				991	3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters
				992	are of five types: L, V, T, LV, and LVT. An L character may be followed by an
				993	L, V, LV, or LVT character; an LV or V character may be followed by a V or T
				994	character; an LVT or T character may be followed only by a T character.
				995	.P
				996	4. Do not end before extending characters or spacing marks or the "zero-width
				997	joiner" character. Characters with the "mark" property always have the
				998	"extend" grapheme breaking property.
				999	.P
				1000	5. Do not end after prepend characters.
				1001	.P
				1002	6. Do not break within emoji modifier sequences or emoji zwj sequences. That
				1003	is, do not break between characters with the Extended_Pictographic property.
				1004	Extend and ZWJ characters are allowed between the characters.
				1005	.P
				1006	7. Do not break within emoji flag sequences. That is, do not break between
				1007	regional indicator (RI) characters if there are an odd number of RI characters
				1008	before the break point.
				1009	.P
				1010	8. Otherwise, end the cluster.
				1011	.
				1012	.
				1013	.\" HTML <a name="extraprops"></a>
				1014	.SS PCRE2's additional properties
				1015	.rs
				1016	.sp
				1017	As well as the standard Unicode properties described above, PCRE2 supports four
				1018	more that make it possible to convert traditional escape sequences such as \ew
				1019	and \es to use Unicode properties. PCRE2 uses these non-standard, non-Perl
				1020	properties internally when PCRE2_UCP is set. However, they may also be used
				1021	explicitly. These properties are:
				1022	.sp
				1023	Xan Any alphanumeric character
				1024	Xps Any POSIX space character
				1025	Xsp Any Perl space character
				1026	Xwd Any Perl "word" character
				1027	.sp
				1028	Xan matches characters that have either the L (letter) or the N (number)
				1029	property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
				1030	carriage return, and any other character that has the Z (separator) property.
				1031	Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl
				1032	compatibility, but Perl changed. Xwd matches the same characters as Xan, plus
				1033	underscore.
				1034	.P
				1035	There is another non-standard property, Xuc, which matches any character that
				1036	can be represented by a Universal Character Name in C++ and other programming
				1037	languages. These are the characters $, @, ` (grave accent), and all characters
				1038	with Unicode code points greater than or equal to U+00A0, except for the
				1039	surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
				1040	excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH
				1041	where H is a hexadecimal digit. Note that the Xuc property does not match these
				1042	sequences but the characters that they represent.)
				1043	.
				1044	.
				1045	.\" HTML <a name="resetmatchstart"></a>
				1046	.SS "Resetting the match start"
				1047	.rs
				1048	.sp
				1049	In normal use, the escape sequence \eK causes any previously matched characters
				1050	not to be included in the final matched sequence that is returned. For example,
				1051	the pattern:
				1052	.sp
				1053	foo\eKbar
				1054	.sp
				1055	matches "foobar", but reports that it has matched "bar". \eK does not interact
				1056	with anchoring in any way. The pattern:
				1057	.sp
				1058	^foo\eKbar
				1059	.sp
				1060	matches only when the subject begins with "foobar" (in single line mode),
				1061	though it again reports the matched string as "bar". This feature is similar to
				1062	a lookbehind assertion
				1063	.\" HTML <a href="#lookbehind">
				1064	.\" </a>
				1065	(described below).
				1066	.\"
				1067	However, in this case, the part of the subject before the real match does not
				1068	have to be of fixed length, as lookbehind assertions do. The use of \eK does
				1069	not interfere with the setting of
				1070	.\" HTML <a href="#group">
				1071	.\" </a>
				1072	captured substrings.
				1073	.\"
				1074	For example, when the pattern
				1075	.sp
				1076	(foo)\eKbar
				1077	.sp
				1078	matches "foobar", the first substring is still set to "foo".
				1079	.P
				1080	From version 5.32.0 Perl forbids the use of \eK in lookaround assertions. From
				1081	release 10.38 PCRE2 also forbids this by default. However, the
				1082	PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be used when calling
				1083	\fBpcre2_compile()\fP to re-enable the previous behaviour. When this option is
				1084	set, \eK is acted upon when it occurs inside positive assertions, but is
				1085	ignored in negative assertions. Note that when a pattern such as (?=ab\eK)
				1086	matches, the reported start of the match can be greater than the end of the
				1087	match. Using \eK in a lookbehind assertion at the start of a pattern can also
				1088	lead to odd effects. For example, consider this pattern:
				1089	.sp
				1090	(?<=\eKfoo)bar
				1091	.sp
				1092	If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting
				1093	offset of 3 succeeds and reports the matching string as "foobar", that is, the
				1094	start of the reported match is earlier than where the match started.
				1095	.
				1096	.
				1097	.\" HTML <a name="smallassertions"></a>
				1098	.SS "Simple assertions"
				1099	.rs
				1100	.sp
				1101	The final use of backslash is for certain simple assertions. An assertion
				1102	specifies a condition that has to be met at a particular point in a match,
				1103	without consuming any characters from the subject string. The use of
				1104	groups for more complicated assertions is described
				1105	.\" HTML <a href="#bigassertions">
				1106	.\" </a>
				1107	below.
				1108	.\"
				1109	The backslashed assertions are:
				1110	.sp
				1111	\eb matches at a word boundary
				1112	\eB matches when not at a word boundary
				1113	\eA matches at the start of the subject
				1114	\eZ matches at the end of the subject
				1115	also matches before a newline at the end of the subject
				1116	\ez matches only at the end of the subject
				1117	\eG matches at the first matching position in the subject
				1118	.sp
				1119	Inside a character class, \eb has a different meaning; it matches the backspace
				1120	character. If any other of these assertions appears in a character class, an
				1121	"invalid escape sequence" error is generated.
				1122	.P
				1123	A word boundary is a position in the subject string where the current character
				1124	and the previous character do not both match \ew or \eW (i.e. one matches
				1125	\ew and the other matches \eW), or the start or end of the string if the
				1126	first or last character matches \ew, respectively. When PCRE2 is built with
				1127	Unicode support, the meanings of \ew and \eW can be changed by setting the
				1128	PCRE2_UCP option. When this is done, it also affects \eb and \eB. Neither PCRE2
				1129	nor Perl has a separate "start of word" or "end of word" metasequence. However,
				1130	whatever follows \eb normally determines which it is. For example, the fragment
				1131	\eba matches "a" at the start of a word.
				1132	.P
				1133	The \eA, \eZ, and \ez assertions differ from the traditional circumflex and
				1134	dollar (described in the next section) in that they only ever match at the very
				1135	start and end of the subject string, whatever options are set. Thus, they are
				1136	independent of multiline mode. These three assertions are not affected by the
				1137	PCRE2_NOTBOL or PCRE2_NOTEOL options, which affect only the behaviour of the
				1138	circumflex and dollar metacharacters. However, if the \fIstartoffset\fP
				1139	argument of \fBpcre2_match()\fP is non-zero, indicating that matching is to
				1140	start at a point other than the beginning of the subject, \eA can never match.
				1141	The difference between \eZ and \ez is that \eZ matches before a newline at the
				1142	end of the string as well as at the very end, whereas \ez matches only at the
				1143	end.
				1144	.P
				1145	The \eG assertion is true only when the current matching position is at the
				1146	start point of the matching process, as specified by the \fIstartoffset\fP
				1147	argument of \fBpcre2_match()\fP. It differs from \eA when the value of
				1148	\fIstartoffset\fP is non-zero. By calling \fBpcre2_match()\fP multiple times
				1149	with appropriate arguments, you can mimic Perl's /g option, and it is in this
				1150	kind of implementation where \eG can be useful.
				1151	.P
				1152	Note, however, that PCRE2's implementation of \eG, being true at the starting
				1153	character of the matching process, is subtly different from Perl's, which
				1154	defines it as true at the end of the previous match. In Perl, these can be
				1155	different when the previously matched string was empty. Because PCRE2 does just
				1156	one match at a time, it cannot reproduce this behaviour.
				1157	.P
				1158	If all the alternatives of a pattern begin with \eG, the expression is anchored
				1159	to the starting match position, and the "anchored" flag is set in the compiled
				1160	regular expression.
				1161	.
				1162	.
				1163	.SH "CIRCUMFLEX AND DOLLAR"
				1164	.rs
				1165	.sp
				1166	The circumflex and dollar metacharacters are zero-width assertions. That is,
				1167	they test for a particular condition being true without consuming any
				1168	characters from the subject string. These two metacharacters are concerned with
				1169	matching the starts and ends of lines. If the newline convention is set so that
				1170	only the two-character sequence CRLF is recognized as a newline, isolated CR
				1171	and LF characters are treated as ordinary data characters, and are not
				1172	recognized as newlines.
				1173	.P
				1174	Outside a character class, in the default matching mode, the circumflex
				1175	character is an assertion that is true only if the current matching point is at
				1176	the start of the subject string. If the \fIstartoffset\fP argument of
				1177	\fBpcre2_match()\fP is non-zero, or if PCRE2_NOTBOL is set, circumflex can
				1178	never match if the PCRE2_MULTILINE option is unset. Inside a character class,
				1179	circumflex has an entirely different meaning
				1180	.\" HTML <a href="#characterclass">
				1181	.\" </a>
				1182	(see below).
				1183	.\"
				1184	.P
				1185	Circumflex need not be the first character of the pattern if a number of
				1186	alternatives are involved, but it should be the first thing in each alternative
				1187	in which it appears if the pattern is ever to match that branch. If all
				1188	possible alternatives start with a circumflex, that is, if the pattern is
				1189	constrained to match only at the start of the subject, it is said to be an
				1190	"anchored" pattern. (There are also other constructs that can cause a pattern
				1191	to be anchored.)
				1192	.P
				1193	The dollar character is an assertion that is true only if the current matching
				1194	point is at the end of the subject string, or immediately before a newline at
				1195	the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
				1196	that it does not actually match the newline. Dollar need not be the last
				1197	character of the pattern if a number of alternatives are involved, but it
				1198	should be the last item in any branch in which it appears. Dollar has no
				1199	special meaning in a character class.
				1200	.P
				1201	The meaning of dollar can be changed so that it matches only at the very end of
				1202	the string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This
				1203	does not affect the \eZ assertion.
				1204	.P
				1205	The meanings of the circumflex and dollar metacharacters are changed if the
				1206	PCRE2_MULTILINE option is set. When this is the case, a dollar character
				1207	matches before any newlines in the string, as well as at the very end, and a
				1208	circumflex matches immediately after internal newlines as well as at the start
				1209	of the subject string. It does not match after a newline that ends the string,
				1210	for compatibility with Perl. However, this can be changed by setting the
				1211	PCRE2_ALT_CIRCUMFLEX option.
				1212	.P
				1213	For example, the pattern /^abc$/ matches the subject string "def\enabc" (where
				1214	\en represents a newline) in multiline mode, but not otherwise. Consequently,
				1215	patterns that are anchored in single line mode because all branches start with
				1216	^ are not anchored in multiline mode, and a match for circumflex is possible
				1217	when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
				1218	PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
				1219	.P
				1220	When the newline convention (see
				1221	.\" HTML <a href="#newlines">
				1222	.\" </a>
				1223	"Newline conventions"
				1224	.\"
				1225	below) recognizes the two-character sequence CRLF as a newline, this is
				1226	preferred, even if the single characters CR and LF are also recognized as
				1227	newlines. For example, if the newline convention is "any", a multiline mode
				1228	circumflex matches before "xyz" in the string "abc\er\enxyz" rather than after
				1229	CR, even though CR on its own is a valid newline. (It also matches at the very
				1230	start of the string, of course.)
				1231	.P
				1232	Note that the sequences \eA, \eZ, and \ez can be used to match the start and
				1233	end of the subject in both modes, and if all branches of a pattern start with
				1234	\eA it is always anchored, whether or not PCRE2_MULTILINE is set.
				1235	.
				1236	.
				1237	.\" HTML <a name="fullstopdot"></a>
				1238	.SH "FULL STOP (PERIOD, DOT) AND \eN"
				1239	.rs
				1240	.sp
				1241	Outside a character class, a dot in the pattern matches any one character in
				1242	the subject string except (by default) a character that signifies the end of a
Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	1243	line. One or more characters may be specified as line terminators (see
				1244	.\" HTML <a href="#newlines">
				1245	.\" </a>
				1246	"Newline conventions"
				1247	.\"
				1248	above).
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	1249	.P
Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	1250	Dot never matches a single line-ending character. When the two-character
				1251	sequence CRLF is the only line ending, dot does not match CR if it is
				1252	immediately followed by LF, but otherwise it matches all characters (including
				1253	isolated CRs and LFs). When ANYCRLF is selected for line endings, no occurences
				1254	of CR of LF match dot. When all Unicode line endings are being recognized, dot
				1255	does not match CR or LF or any of the other line ending characters.
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	1256	.P
				1257	The behaviour of dot with regard to newlines can be changed. If the
				1258	PCRE2_DOTALL option is set, a dot matches any one character, without exception.
				1259	If the two-character sequence CRLF is present in the subject string, it takes
				1260	two dots to match it.
				1261	.P
				1262	The handling of dot is entirely independent of the handling of circumflex and
				1263	dollar, the only relationship being that they both involve newlines. Dot has no
				1264	special meaning in a character class.
				1265	.P
				1266	The escape sequence \eN when not followed by an opening brace behaves like a
				1267	dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
				1268	it matches any character except one that signifies the end of a line.
				1269	.P
				1270	When \eN is followed by an opening brace it has a different meaning. See the
				1271	section entitled
				1272	.\" HTML <a href="digitsafterbackslash">
				1273	.\" </a>
				1274	"Non-printing characters"
				1275	.\"
				1276	above for details. Perl also uses \eN{name} to specify characters by Unicode
				1277	name; PCRE2 does not support this.
				1278	.
				1279	.
				1280	.SH "MATCHING A SINGLE CODE UNIT"
				1281	.rs
				1282	.sp
				1283	Outside a character class, the escape sequence \eC matches any one code unit,
				1284	whether or not a UTF mode is set. In the 8-bit library, one code unit is one
				1285	byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
				1286	32-bit unit. Unlike a dot, \eC always matches line-ending characters. The
				1287	feature is provided in Perl in order to match individual bytes in UTF-8 mode,
				1288	but it is unclear how it can usefully be used.
				1289	.P
				1290	Because \eC breaks up characters into individual code units, matching one unit
				1291	with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
				1292	with a malformed UTF character. This has undefined results, because PCRE2
				1293	assumes that it is matching character by character in a valid UTF string (by
				1294	default it checks the subject string's validity at the start of processing
				1295	unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
				1296	.P
				1297	An application can lock out the use of \eC by setting the
				1298	PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
				1299	build PCRE2 with the use of \eC permanently disabled.
				1300	.P
				1301	PCRE2 does not allow \eC to appear in lookbehind assertions
				1302	.\" HTML <a href="#lookbehind">
				1303	.\" </a>
				1304	(described below)
				1305	.\"
				1306	in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
				1307	the length of the lookbehind. Neither the alternative matching function
				1308	\fBpcre2_dfa_match()\fP nor the JIT optimizer support \eC in these UTF modes.
				1309	The former gives a match-time error; the latter fails to optimize and so the
				1310	match is always run using the interpreter.
				1311	.P
				1312	In the 32-bit library, however, \eC is always supported (when not explicitly
				1313	locked out) because it always matches a single code unit, whether or not UTF-32
				1314	is specified.
				1315	.P
				1316	In general, the \eC escape sequence is best avoided. However, one way of using
				1317	it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
				1318	lookahead to check the length of the next character, as in this pattern, which
				1319	could be used with a UTF-8 string (ignore white space and line breaks):
				1320	.sp
				1321	(?\| (?=[\ex00-\ex7f])(\eC) \|
				1322	(?=[\ex80-\ex{7ff}])(\eC)(\eC) \|
				1323	(?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) \|
				1324	(?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
				1325	.sp
				1326	In this example, a group that starts with (?\| resets the capturing parentheses
				1327	numbers in each alternative (see
				1328	.\" HTML <a href="#dupgroupnumber">
				1329	.\" </a>
				1330	"Duplicate Group Numbers"
				1331	.\"
				1332	below). The assertions at the start of each branch check the next UTF-8
				1333	character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
				1334	character's individual bytes are then captured by the appropriate number of
				1335	\eC groups.
				1336	.
				1337	.
				1338	.\" HTML <a name="characterclass"></a>
				1339	.SH "SQUARE BRACKETS AND CHARACTER CLASSES"
				1340	.rs
				1341	.sp
				1342	An opening square bracket introduces a character class, terminated by a closing
				1343	square bracket. A closing square bracket on its own is not special by default.
				1344	If a closing square bracket is required as a member of the class, it should be
				1345	the first data character in the class (after an initial circumflex, if present)
				1346	or escaped with a backslash. This means that, by default, an empty class cannot
				1347	be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing
				1348	square bracket at the start does end the (empty) class.
				1349	.P
				1350	A character class matches a single character in the subject. A matched
				1351	character must be in the set of characters defined by the class, unless the
				1352	first character in the class definition is a circumflex, in which case the
				1353	subject character must not be in the set defined by the class. If a circumflex
				1354	is actually required as a member of the class, ensure it is not the first
				1355	character, or escape it with a backslash.
				1356	.P
				1357	For example, the character class [aeiou] matches any lower case vowel, while
				1358	[^aeiou] matches any character that is not a lower case vowel. Note that a
				1359	circumflex is just a convenient notation for specifying the characters that
				1360	are in the class by enumerating those that are not. A class that starts with a
				1361	circumflex is not an assertion; it still consumes a character from the subject
				1362	string, and therefore it fails if the current pointer is at the end of the
				1363	string.
				1364	.P
				1365	Characters in a class may be specified by their code points using \eo, \ex, or
				1366	\eN{U+hh..} in the usual way. When caseless matching is set, any letters in a
				1367	class represent both their upper case and lower case versions, so for example,
				1368	a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
				1369	match "A", whereas a caseful version would. Note that there are two ASCII
				1370	characters, K and S, that, in addition to their lower case ASCII equivalents,
				1371	are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S)
				1372	respectively when either PCRE2_UTF or PCRE2_UCP is set.
				1373	.P
				1374	Characters that might indicate line breaks are never treated in any special way
				1375	when matching character classes, whatever line-ending sequence is in use, and
				1376	whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
				1377	class such as [^a] always matches one of these characters.
				1378	.P
				1379	The generic character type escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es,
				1380	\eS, \ev, \eV, \ew, and \eW may appear in a character class, and add the
				1381	characters that they match to the class. For example, [\edABCDEF] matches any
				1382	hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
				1383	\ed, \es, \ew and their upper case partners, just as it does when they appear
				1384	outside a character class, as described in the section entitled
				1385	.\" HTML <a href="#genericchartypes">
				1386	.\" </a>
				1387	"Generic character types"
				1388	.\"
				1389	above. The escape sequence \eb has a different meaning inside a character
				1390	class; it matches the backspace character. The sequences \eB, \eR, and \eX are
				1391	not special inside a character class. Like any other unrecognized escape
				1392	sequences, they cause an error. The same is true for \eN when not followed by
				1393	an opening brace.
				1394	.P
				1395	The minus (hyphen) character can be used to specify a range of characters in a
				1396	character class. For example, [d-m] matches any letter between d and m,
				1397	inclusive. If a minus character is required in a class, it must be escaped with
				1398	a backslash or appear in a position where it cannot be interpreted as
				1399	indicating a range, typically as the first or last character in the class,
				1400	or immediately after a range. For example, [b-d-z] matches letters in the range
				1401	b to d, a hyphen character, or z.
				1402	.P
				1403	Perl treats a hyphen as a literal if it appears before or after a POSIX class
				1404	(see below) or before or after a character type escape such as as \ed or \eH.
				1405	However, unless the hyphen is the last character in the class, Perl outputs a
				1406	warning in its warning mode, as this is most likely a user error. As PCRE2 has
				1407	no facility for warning, an error is given in these cases.
				1408	.P
				1409	It is not possible to have the literal character "]" as the end character of a
				1410	range. A pattern such as [W-]46] is interpreted as a class of two characters
				1411	("W" and "-") followed by a literal string "46]", so it would match "W46]" or
				1412	"-46]". However, if the "]" is escaped with a backslash it is interpreted as
				1413	the end of range, so [W-\e]46] is interpreted as a class containing a range
				1414	followed by two other characters. The octal or hexadecimal representation of
				1415	"]" can also be used to end a range.
				1416	.P
				1417	Ranges normally include all code points between the start and end characters,
				1418	inclusive. They can also be used for code points specified numerically, for
				1419	example [\e000-\e037]. Ranges can include any characters that are valid for the
				1420	current mode. In any UTF mode, the so-called "surrogate" characters (those
				1421	whose code points lie between 0xd800 and 0xdfff inclusive) may not be specified
				1422	explicitly by default (the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables
				1423	this check). However, ranges such as [\ex{d7ff}-\ex{e000}], which include the
				1424	surrogates, are always permitted.
				1425	.P
				1426	There is a special case in EBCDIC environments for ranges whose end points are
				1427	both specified as literal letters in the same case. For compatibility with
				1428	Perl, EBCDIC code points within the range that are not letters are omitted. For
				1429	example, [h-k] matches only four characters, even though the codes for h and k
				1430	are 0x88 and 0x92, a range of 11 code points. However, if the range is
				1431	specified numerically, for example, [\ex88-\ex92] or [h-\ex92], all code points
				1432	are included.
				1433	.P
				1434	If a range that includes letters is used when caseless matching is set, it
				1435	matches the letters in either case. For example, [W-c] is equivalent to
				1436	[][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
				1437	tables for a French locale are in use, [\exc8-\excb] matches accented E
				1438	characters in both cases.
				1439	.P
				1440	A circumflex can conveniently be used with the upper case character types to
				1441	specify a more restricted set of characters than the matching lower case type.
				1442	For example, the class [^\eW_] matches any letter or digit, but not underscore,
				1443	whereas [\ew] includes underscore. A positive character class should be read as
				1444	"something OR something OR ..." and a negative class as "NOT something AND NOT
				1445	something AND NOT ...".
				1446	.P
				1447	The only metacharacters that are recognized in character classes are backslash,
				1448	hyphen (only where it can be interpreted as specifying a range), circumflex
				1449	(only at the start), opening square bracket (only when it can be interpreted as
				1450	introducing a POSIX class name, or for a special compatibility feature - see
				1451	the next two sections), and the terminating closing square bracket. However,
				1452	escaping other non-alphanumeric characters does no harm.
				1453	.
				1454	.
				1455	.SH "POSIX CHARACTER CLASSES"
				1456	.rs
				1457	.sp
				1458	Perl supports the POSIX notation for character classes. This uses names
				1459	enclosed by [: and :] within the enclosing square brackets. PCRE2 also supports
				1460	this notation. For example,
				1461	.sp
				1462	[01[:alpha:]%]
				1463	.sp
				1464	matches "0", "1", any alphabetic character, or "%". The supported class names
				1465	are:
				1466	.sp
				1467	alnum letters and digits
				1468	alpha letters
				1469	ascii character codes 0 - 127
				1470	blank space or tab only
				1471	cntrl control characters
				1472	digit decimal digits (same as \ed)
				1473	graph printing characters, excluding space
				1474	lower lower case letters
				1475	print printing characters, including space
				1476	punct printing characters, excluding letters and digits and space
				1477	space white space (the same as \es from PCRE2 8.34)
				1478	upper upper case letters
				1479	word "word" characters (same as \ew)
				1480	xdigit hexadecimal digits
				1481	.sp
				1482	The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
				1483	and space (32). If locale-specific matching is taking place, the list of space
				1484	characters may be different; there may be fewer or more of them. "Space" and
				1485	\es match the same set of characters.
				1486	.P
				1487	The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
				1488	5.8. Another Perl extension is negation, which is indicated by a ^ character
				1489	after the colon. For example,
				1490	.sp
				1491	[12[:^digit:]]
				1492	.sp
				1493	matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
				1494	syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
				1495	supported, and an error is given if they are encountered.
				1496	.P
				1497	By default, characters with values greater than 127 do not match any of the
				1498	POSIX character classes, although this may be different for characters in the
				1499	range 128-255 when locale-specific matching is happening. However, if the
				1500	PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
				1501	changed so that Unicode character properties are used. This is achieved by
				1502	replacing certain POSIX classes with other sequences, as follows:
				1503	.sp
				1504	[:alnum:] becomes \ep{Xan}
				1505	[:alpha:] becomes \ep{L}
				1506	[:blank:] becomes \eh
				1507	[:cntrl:] becomes \ep{Cc}
				1508	[:digit:] becomes \ep{Nd}
				1509	[:lower:] becomes \ep{Ll}
				1510	[:space:] becomes \ep{Xps}
				1511	[:upper:] becomes \ep{Lu}
				1512	[:word:] becomes \ep{Xwd}
				1513	.sp
				1514	Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX
				1515	classes are handled specially in UCP mode:
				1516	.TP 10
				1517	[:graph:]
				1518	This matches characters that have glyphs that mark the page when printed. In
				1519	Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
				1520	properties, except for:
				1521	.sp
				1522	U+061C Arabic Letter Mark
				1523	U+180E Mongolian Vowel Separator
				1524	U+2066 - U+2069 Various "isolate"s
				1525	.sp
				1526	.TP 10
				1527	[:print:]
				1528	This matches the same characters as [:graph:] plus space characters that are
				1529	not controls, that is, characters with the Zs property.
				1530	.TP 10
				1531	[:punct:]
				1532	This matches all characters that have the Unicode P (punctuation) property,
				1533	plus those characters with code points less than 256 that have the S (Symbol)
				1534	property.
				1535	.P
				1536	The other POSIX classes are unchanged, and match only characters with code
				1537	points less than 256.
				1538	.
				1539	.
				1540	.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
				1541	.rs
				1542	.sp
				1543	In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
				1544	syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of
				1545	word". PCRE2 treats these items as follows:
				1546	.sp
				1547	[[:<:]] is converted to \eb(?=\ew)
				1548	[[:>:]] is converted to \eb(?<=\ew)
				1549	.sp
				1550	Only these exact character sequences are recognized. A sequence such as
				1551	[a[:<:]b] provokes error for an unrecognized POSIX class name. This support is
				1552	not compatible with Perl. It is provided to help migrations from other
				1553	environments, and is best not used in any new patterns. Note that \eb matches
				1554	at the start and the end of a word (see
				1555	.\" HTML <a href="#smallassertions">
				1556	.\" </a>
				1557	"Simple assertions"
				1558	.\"
				1559	above), and in a Perl-style pattern the preceding or following character
				1560	normally shows which is wanted, without the need for the assertions that are
				1561	used above in order to give exactly the POSIX behaviour.
				1562	.
				1563	.
				1564	.SH "VERTICAL BAR"
				1565	.rs
				1566	.sp
				1567	Vertical bar characters are used to separate alternative patterns. For example,
				1568	the pattern
				1569	.sp
				1570	gilbert\|sullivan
				1571	.sp
				1572	matches either "gilbert" or "sullivan". Any number of alternatives may appear,
				1573	and an empty alternative is permitted (matching the empty string). The matching
				1574	process tries each alternative in turn, from left to right, and the first one
				1575	that succeeds is used. If the alternatives are within a group
				1576	.\" HTML <a href="#group">
				1577	.\" </a>
				1578	(defined below),
				1579	.\"
				1580	"succeeds" means matching the rest of the main pattern as well as the
				1581	alternative in the group.
				1582	.
				1583	.
				1584	.\" HTML <a name="internaloptions"></a>
				1585	.SH "INTERNAL OPTION SETTING"
				1586	.rs
				1587	.sp
				1588	The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
				1589	PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
				1590	changed from within the pattern by a sequence of letters enclosed between "(?"
				1591	and ")". These options are Perl-compatible, and are described in detail in the
				1592	.\" HREF
				1593	\fBpcre2api\fP
				1594	.\"
				1595	documentation. The option letters are:
				1596	.sp
				1597	i for PCRE2_CASELESS
				1598	m for PCRE2_MULTILINE
				1599	n for PCRE2_NO_AUTO_CAPTURE
				1600	s for PCRE2_DOTALL
				1601	x for PCRE2_EXTENDED
				1602	xx for PCRE2_EXTENDED_MORE
				1603	.sp
				1604	For example, (?im) sets caseless, multiline matching. It is also possible to
				1605	unset these options by preceding the relevant letters with a hyphen, for
				1606	example (?-im). The two "extended" options are not independent; unsetting either
				1607	one cancels the effects of both of them.
				1608	.P
				1609	A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
				1610	and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
				1611	permitted. Only one hyphen may appear in the options string. If a letter
				1612	appears both before and after the hyphen, the option is unset. An empty options
				1613	setting "(?)" is allowed. Needless to say, it has no effect.
				1614	.P
				1615	If the first character following (? is a circumflex, it causes all of the above
				1616	options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
				1617	the circumflex to cause some options to be re-instated, but a hyphen may not
				1618	appear.
				1619	.P
				1620	The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
				1621	the same way as the Perl-compatible options by using the characters J and U
				1622	respectively. However, these are not unset by (?^).
				1623	.P
				1624	When one of these option changes occurs at top level (that is, not inside
				1625	group parentheses), the change applies to the remainder of the pattern
				1626	that follows. An option change within a group (see below for a description
				1627	of groups) affects only that part of the group that follows it, so
				1628	.sp
				1629	(a(?i)b)c
				1630	.sp
				1631	matches abc and aBc and no other strings (assuming PCRE2_CASELESS is not used).
				1632	By this means, options can be made to have different settings in different
				1633	parts of the pattern. Any changes made in one alternative do carry on
				1634	into subsequent branches within the same group. For example,
				1635	.sp
				1636	(a(?i)b\|c)
				1637	.sp
				1638	matches "ab", "aB", "c", and "C", even though when matching "C" the first
				1639	branch is abandoned before the option setting. This is because the effects of
				1640	option settings happen at compile time. There would be some very weird
				1641	behaviour otherwise.
				1642	.P
				1643	As a convenient shorthand, if any option settings are required at the start of
				1644	a non-capturing group (see the next section), the option letters may
				1645	appear between the "?" and the ":". Thus the two patterns
				1646	.sp
				1647	(?i:saturday\|sunday)
				1648	(?:(?i)saturday\|sunday)
				1649	.sp
				1650	match exactly the same set of strings.
				1651	.P
				1652	\fBNote:\fP There are other PCRE2-specific options, applying to the whole
				1653	pattern, which can be set by the application when the compiling function is
				1654	called. In addition, the pattern can contain special leading sequences such as
				1655	(*CRLF) to override what the application has set or what has been defaulted.
				1656	Details are given in the section entitled
				1657	.\" HTML <a href="#newlineseq">
				1658	.\" </a>
				1659	"Newline sequences"
				1660	.\"
				1661	above. There are also the (UTF) and (UCP) leading sequences that can be used
				1662	to set UTF and Unicode property modes; they are equivalent to setting the
				1663	PCRE2_UTF and PCRE2_UCP options, respectively. However, the application can set
				1664	the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use of the
				1665	(UTF) and (UCP) sequences.
				1666	.
				1667	.
				1668	.\" HTML <a name="group"></a>
				1669	.SH GROUPS
				1670	.rs
				1671	.sp
				1672	Groups are delimited by parentheses (round brackets), which can be nested.
				1673	Turning part of a pattern into a group does two things:
				1674	.sp
				1675	1. It localizes a set of alternatives. For example, the pattern
				1676	.sp
				1677	cat(aract\|erpillar\|)
				1678	.sp
				1679	matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
				1680	match "cataract", "erpillar" or an empty string.
				1681	.sp
				1682	2. It creates a "capture group". This means that, when the whole pattern
				1683	matches, the portion of the subject string that matched the group is passed
				1684	back to the caller, separately from the portion that matched the whole pattern.
				1685	(This applies only to the traditional matching function; the DFA matching
				1686	function does not support capturing.)
				1687	.P
				1688	Opening parentheses are counted from left to right (starting from 1) to obtain
				1689	numbers for capture groups. For example, if the string "the red king" is
				1690	matched against the pattern
				1691	.sp
				1692	the ((red\|white) (king\|queen))
				1693	.sp
				1694	the captured substrings are "red king", "red", and "king", and are numbered 1,
				1695	2, and 3, respectively.
				1696	.P
				1697	The fact that plain parentheses fulfil two functions is not always helpful.
				1698	There are often times when grouping is required without capturing. If an
				1699	opening parenthesis is followed by a question mark and a colon, the group
				1700	does not do any capturing, and is not counted when computing the number of any
				1701	subsequent capture groups. For example, if the string "the white queen"
				1702	is matched against the pattern
				1703	.sp
				1704	the ((?:red\|white) (king\|queen))
				1705	.sp
				1706	the captured substrings are "white queen" and "queen", and are numbered 1 and
				1707	2. The maximum number of capture groups is 65535.
				1708	.P
				1709	As a convenient shorthand, if any option settings are required at the start of
				1710	a non-capturing group, the option letters may appear between the "?" and the
				1711	":". Thus the two patterns
				1712	.sp
				1713	(?i:saturday\|sunday)
				1714	(?:(?i)saturday\|sunday)
				1715	.sp
				1716	match exactly the same set of strings. Because alternative branches are tried
				1717	from left to right, and options are not reset until the end of the group is
				1718	reached, an option setting in one branch does affect subsequent branches, so
				1719	the above patterns match "SUNDAY" as well as "Saturday".
				1720	.
				1721	.
				1722	.\" HTML <a name="dupgroupnumber"></a>
				1723	.SH "DUPLICATE GROUP NUMBERS"
				1724	.rs
				1725	.sp
				1726	Perl 5.10 introduced a feature whereby each alternative in a group uses the
				1727	same numbers for its capturing parentheses. Such a group starts with (?\| and is
				1728	itself a non-capturing group. For example, consider this pattern:
				1729	.sp
				1730	(?\|(Sat)ur\|(Sun))day
				1731	.sp
				1732	Because the two alternatives are inside a (?\| group, both sets of capturing
				1733	parentheses are numbered one. Thus, when the pattern matches, you can look
				1734	at captured substring number one, whichever alternative matched. This construct
				1735	is useful when you want to capture part, but not all, of one of a number of
				1736	alternatives. Inside a (?\| group, parentheses are numbered as usual, but the
				1737	number is reset at the start of each branch. The numbers of any capturing
				1738	parentheses that follow the whole group start after the highest number used in
				1739	any branch. The following example is taken from the Perl documentation. The
				1740	numbers underneath show in which buffer the captured content will be stored.
				1741	.sp
				1742	# before ---------------branch-reset----------- after
				1743	/ ( a ) (?\| x ( y ) z \| (p (q) r) \| (t) u (v) ) ( z ) /x
				1744	# 1 2 2 3 2 3 4
				1745	.sp
				1746	A backreference to a capture group uses the most recent value that is set for
				1747	the group. The following pattern matches "abcabc" or "defdef":
				1748	.sp
				1749	/(?\|(abc)\|(def))\e1/
				1750	.sp
				1751	In contrast, a subroutine call to a capture group always refers to the
				1752	first one in the pattern with the given number. The following pattern matches
				1753	"abcabc" or "defabc":
				1754	.sp
				1755	/(?\|(abc)\|(def))(?1)/
				1756	.sp
				1757	A relative reference such as (?-1) is no different: it is just a convenient way
				1758	of computing an absolute group number.
				1759	.P
				1760	If a
				1761	.\" HTML <a href="#conditions">
				1762	.\" </a>
				1763	condition test
				1764	.\"
				1765	for a group's having matched refers to a non-unique number, the test is
				1766	true if any group with that number has matched.
				1767	.P
				1768	An alternative approach to using this "branch reset" feature is to use
				1769	duplicate named groups, as described in the next section.
				1770	.
				1771	.
				1772	.SH "NAMED CAPTURE GROUPS"
				1773	.rs
				1774	.sp
				1775	Identifying capture groups by number is simple, but it can be very hard to keep
				1776	track of the numbers in complicated patterns. Furthermore, if an expression is
				1777	modified, the numbers may change. To help with this difficulty, PCRE2 supports
				1778	the naming of capture groups. This feature was not added to Perl until release
				1779	5.10. Python had the feature earlier, and PCRE1 introduced it at release 4.0,
				1780	using the Python syntax. PCRE2 supports both the Perl and the Python syntax.
				1781	.P
				1782	In PCRE2, a capture group can be named in one of three ways: (?<name>...) or
				1783	(?'name'...) as in Perl, or (?P<name>...) as in Python. Names may be up to 32
				1784	code units long. When PCRE2_UTF is not set, they may contain only ASCII
				1785	alphanumeric characters and underscores, but must start with a non-digit. When
				1786	PCRE2_UTF is set, the syntax of group names is extended to allow any Unicode
				1787	letter or Unicode decimal digit. In other words, group names must match one of
				1788	these patterns:
				1789	.sp
				1790	^[_A-Za-z][_A-Za-z0-9]*\ez when PCRE2_UTF is not set
				1791	^[_\ep{L}][_\ep{L}\ep{Nd}]*\ez when PCRE2_UTF is set
				1792	.sp
				1793	References to capture groups from other parts of the pattern, such as
				1794	.\" HTML <a href="#backreferences">
				1795	.\" </a>
				1796	backreferences,
				1797	.\"
				1798	.\" HTML <a href="#recursion">
				1799	.\" </a>
				1800	recursion,
				1801	.\"
				1802	and
				1803	.\" HTML <a href="#conditions">
				1804	.\" </a>
				1805	conditions,
				1806	.\"
				1807	can all be made by name as well as by number.
				1808	.P
				1809	Named capture groups are allocated numbers as well as names, exactly as
				1810	if the names were not present. In both PCRE2 and Perl, capture groups
				1811	are primarily identified by numbers; any names are just aliases for these
				1812	numbers. The PCRE2 API provides function calls for extracting the complete
				1813	name-to-number translation table from a compiled pattern, as well as
				1814	convenience functions for extracting captured substrings by name.
				1815	.P
				1816	\fBWarning:\fP When more than one capture group has the same number, as
				1817	described in the previous section, a name given to one of them applies to all
				1818	of them. Perl allows identically numbered groups to have different names.
				1819	Consider this pattern, where there are two capture groups, both numbered 1:
				1820	.sp
				1821	(?\|(?<AA>aa)\|(?<BB>bb))
				1822	.sp
				1823	Perl allows this, with both names AA and BB as aliases of group 1. Thus, after
				1824	a successful match, both names yield the same value (either "aa" or "bb").
				1825	.P
				1826	In an attempt to reduce confusion, PCRE2 does not allow the same group number
				1827	to be associated with more than one name. The example above provokes a
				1828	compile-time error. However, there is still scope for confusion. Consider this
				1829	pattern:
				1830	.sp
				1831	(?\|(?<AA>aa)\|(bb))
				1832	.sp
				1833	Although the second group number 1 is not explicitly named, the name AA is
				1834	still an alias for any group 1. Whether the pattern matches "aa" or "bb", a
				1835	reference by name to group AA yields the matched string.
				1836	.P
				1837	By default, a name must be unique within a pattern, except that duplicate names
				1838	are permitted for groups with the same number, for example:
				1839	.sp
				1840	(?\|(?<AA>aa)\|(?<AA>bb))
				1841	.sp
				1842	The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
				1843	option at compile time, or by the use of (?J) within the pattern, as described
				1844	in the section entitled
				1845	.\" HTML <a href="#internaloptions">
				1846	.\" </a>
				1847	"Internal Option Setting"
				1848	.\"
				1849	above.
				1850	.P
				1851	Duplicate names can be useful for patterns where only one instance of the named
				1852	capture group can match. Suppose you want to match the name of a weekday,
				1853	either as a 3-letter abbreviation or as the full name, and in both cases you
				1854	want to extract the abbreviation. This pattern (ignoring the line breaks) does
				1855	the job:
				1856	.sp
				1857	(?J)
				1858	(?<DN>Mon\|Fri\|Sun)(?:day)?\|
				1859	(?<DN>Tue)(?:sday)?\|
				1860	(?<DN>Wed)(?:nesday)?\|
				1861	(?<DN>Thu)(?:rsday)?\|
				1862	(?<DN>Sat)(?:urday)?
				1863	.sp
				1864	There are five capture groups, but only one is ever set after a match. The
				1865	convenience functions for extracting the data by name returns the substring for
				1866	the first (and in this example, the only) group of that name that matched. This
				1867	saves searching to find which numbered group it was. (An alternative way of
				1868	solving this problem is to use a "branch reset" group, as described in the
				1869	previous section.)
				1870	.P
				1871	If you make a backreference to a non-unique named group from elsewhere in the
				1872	pattern, the groups to which the name refers are checked in the order in which
				1873	they appear in the overall pattern. The first one that is set is used for the
				1874	reference. For example, this pattern matches both "foofoo" and "barbar" but not
				1875	"foobar" or "barfoo":
				1876	.sp
				1877	(?J)(?:(?<n>foo)\|(?<n>bar))\ek<n>
				1878	.sp
				1879	.P
				1880	If you make a subroutine call to a non-unique named group, the one that
				1881	corresponds to the first occurrence of the name is used. In the absence of
				1882	duplicate numbers this is the one with the lowest number.
				1883	.P
				1884	If you use a named reference in a condition
				1885	test (see the
				1886	.\"
				1887	.\" HTML <a href="#conditions">
				1888	.\" </a>
				1889	section about conditions
				1890	.\"
				1891	below), either to check whether a capture group has matched, or to check for
				1892	recursion, all groups with the same name are tested. If the condition is true
				1893	for any one of them, the overall condition is true. This is the same behaviour
				1894	as testing by number. For further details of the interfaces for handling named
				1895	capture groups, see the
				1896	.\" HREF
				1897	\fBpcre2api\fP
				1898	.\"
				1899	documentation.
				1900	.
				1901	.
				1902	.SH REPETITION
				1903	.rs
				1904	.sp
				1905	Repetition is specified by quantifiers, which can follow any of the following
				1906	items:
				1907	.sp
				1908	a literal data character
				1909	the dot metacharacter
				1910	the \eC escape sequence
				1911	the \eR escape sequence
				1912	the \eX escape sequence
				1913	an escape such as \ed or \epL that matches a single character
				1914	a character class
				1915	a backreference
				1916	a parenthesized group (including lookaround assertions)
				1917	a subroutine call (recursive or otherwise)
				1918	.sp
				1919	The general repetition quantifier specifies a minimum and maximum number of
				1920	permitted matches, by giving the two numbers in curly brackets (braces),
				1921	separated by a comma. The numbers must be less than 65536, and the first must
				1922	be less than or equal to the second. For example,
				1923	.sp
				1924	z{2,4}
				1925	.sp
				1926	matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
				1927	character. If the second number is omitted, but the comma is present, there is
				1928	no upper limit; if the second number and the comma are both omitted, the
				1929	quantifier specifies an exact number of required matches. Thus
				1930	.sp
				1931	[aeiou]{3,}
				1932	.sp
				1933	matches at least 3 successive vowels, but may match many more, whereas
				1934	.sp
				1935	\ed{8}
				1936	.sp
				1937	matches exactly 8 digits. An opening curly bracket that appears in a position
				1938	where a quantifier is not allowed, or one that does not match the syntax of a
				1939	quantifier, is taken as a literal character. For example, {,6} is not a
				1940	quantifier, but a literal string of four characters.
				1941	.P
				1942	In UTF modes, quantifiers apply to characters rather than to individual code
				1943	units. Thus, for example, \ex{100}{2} matches two characters, each of
				1944	which is represented by a two-byte sequence in a UTF-8 string. Similarly,
				1945	\eX{3} matches three Unicode extended grapheme clusters, each of which may be
				1946	several code units long (and they may be of different lengths).
				1947	.P
				1948	The quantifier {0} is permitted, causing the expression to behave as if the
				1949	previous item and the quantifier were not present. This may be useful for
				1950	capture groups that are referenced as
				1951	.\" HTML <a href="#groupsassubroutines">
				1952	.\" </a>
				1953	subroutines
				1954	.\"
				1955	from elsewhere in the pattern (but see also the section entitled
				1956	.\" HTML <a href="#subdefine">
				1957	.\" </a>
				1958	"Defining capture groups for use by reference only"
				1959	.\"
				1960	below). Except for parenthesized groups, items that have a {0} quantifier are
				1961	omitted from the compiled pattern.
				1962	.P
				1963	For convenience, the three most common quantifiers have single-character
				1964	abbreviations:
				1965	.sp
				1966	* is equivalent to {0,}
				1967	+ is equivalent to {1,}
				1968	? is equivalent to {0,1}
				1969	.sp
				1970	It is possible to construct infinite loops by following a group that can match
				1971	no characters with a quantifier that has no upper limit, for example:
				1972	.sp
				1973	(a?)*
				1974	.sp
				1975	Earlier versions of Perl and PCRE1 used to give an error at compile time for
				1976	such patterns. However, because there are cases where this can be useful, such
				1977	patterns are now accepted, but whenever an iteration of such a group matches no
				1978	characters, matching moves on to the next item in the pattern instead of
				1979	repeatedly matching an empty string. This does not prevent backtracking into
				1980	any of the iterations if a subsequent item fails to match.
				1981	.P
				1982	By default, quantifiers are "greedy", that is, they match as much as possible
				1983	(up to the maximum number of permitted times), without causing the rest of the
				1984	pattern to fail. The classic example of where this gives problems is in trying
				1985	to match comments in C programs. These appear between /* and */ and within the
				1986	comment, individual * and / characters may appear. An attempt to match C
				1987	comments by applying the pattern
				1988	.sp
				1989	/\e.\e*/
				1990	.sp
				1991	to the string
				1992	.sp
				1993	/* first comment / not comment / second comment */
				1994	.sp
				1995	fails, because it matches the entire string owing to the greediness of the .*
				1996	item. However, if a quantifier is followed by a question mark, it ceases to be
				1997	greedy, and instead matches the minimum number of times possible, so the
				1998	pattern
				1999	.sp
				2000	/\e.?\e*/
				2001	.sp
				2002	does the right thing with the C comments. The meaning of the various
				2003	quantifiers is not otherwise changed, just the preferred number of matches.
				2004	Do not confuse this use of question mark with its use as a quantifier in its
				2005	own right. Because it has two uses, it can sometimes appear doubled, as in
				2006	.sp
				2007	\ed??\ed
				2008	.sp
				2009	which matches one digit by preference, but can match two if that is the only
				2010	way the rest of the pattern matches.
				2011	.P
				2012	If the PCRE2_UNGREEDY option is set (an option that is not available in Perl),
				2013	the quantifiers are not greedy by default, but individual ones can be made
				2014	greedy by following them with a question mark. In other words, it inverts the
				2015	default behaviour.
				2016	.P
				2017	When a parenthesized group is quantified with a minimum repeat count that
				2018	is greater than 1 or with a limited maximum, more memory is required for the
				2019	compiled pattern, in proportion to the size of the minimum or maximum.
				2020	.P
				2021	If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option (equivalent
				2022	to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
				2023	implicitly anchored, because whatever follows will be tried against every
				2024	character position in the subject string, so there is no point in retrying the
				2025	overall match at any position after the first. PCRE2 normally treats such a
				2026	pattern as though it were preceded by \eA.
				2027	.P
				2028	In cases where it is known that the subject string contains no newlines, it is
				2029	worth setting PCRE2_DOTALL in order to obtain this optimization, or
				2030	alternatively, using ^ to indicate anchoring explicitly.
				2031	.P
				2032	However, there are some cases where the optimization cannot be used. When .*
				2033	is inside capturing parentheses that are the subject of a backreference
				2034	elsewhere in the pattern, a match at the start may fail where a later one
				2035	succeeds. Consider, for example:
				2036	.sp
				2037	(.*)abc\e1
				2038	.sp
				2039	If the subject is "xyz123abc123" the match point is the fourth character. For
				2040	this reason, such a pattern is not implicitly anchored.
				2041	.P
				2042	Another case where implicit anchoring is not applied is when the leading .* is
				2043	inside an atomic group. Once again, a match at the start may fail where a later
				2044	one succeeds. Consider this pattern:
				2045	.sp
				2046	(?>.*?a)b
				2047	.sp
				2048	It matches "ab" in the subject "aab". The use of the backtracking control verbs
				2049	(PRUNE) and (SKIP) also disable this optimization, and there is an option,
				2050	PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
				2051	.P
				2052	When a capture group is repeated, the value captured is the substring that
				2053	matched the final iteration. For example, after
				2054	.sp
				2055	(tweedle[dume]{3}\es*)+
				2056	.sp
				2057	has matched "tweedledum tweedledee" the value of the captured substring is
				2058	"tweedledee". However, if there are nested capture groups, the corresponding
				2059	captured values may have been set in previous iterations. For example, after
				2060	.sp
				2061	(a\|(b))+
				2062	.sp
				2063	matches "aba" the value of the second captured substring is "b".
				2064	.
				2065	.
				2066	.\" HTML <a name="atomicgroup"></a>
				2067	.SH "ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS"
				2068	.rs
				2069	.sp
				2070	With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
				2071	repetition, failure of what follows normally causes the repeated item to be
				2072	re-evaluated to see if a different number of repeats allows the rest of the
				2073	pattern to match. Sometimes it is useful to prevent this, either to change the
				2074	nature of the match, or to cause it fail earlier than it otherwise might, when
				2075	the author of the pattern knows there is no point in carrying on.
				2076	.P
				2077	Consider, for example, the pattern \ed+foo when applied to the subject line
				2078	.sp
				2079	123456bar
				2080	.sp
				2081	After matching all 6 digits and then failing to match "foo", the normal
				2082	action of the matcher is to try again with only 5 digits matching the \ed+
				2083	item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
				2084	(a term taken from Jeffrey Friedl's book) provides the means for specifying
				2085	that once a group has matched, it is not to be re-evaluated in this way.
				2086	.P
				2087	If we use atomic grouping for the previous example, the matcher gives up
				2088	immediately on failing to match "foo" the first time. The notation is a kind of
				2089	special parenthesis, starting with (?> as in this example:
				2090	.sp
				2091	(?>\ed+)foo
				2092	.sp
				2093	Perl 5.28 introduced an experimental alphabetic form starting with (* which may
				2094	be easier to remember:
				2095	.sp
				2096	(*atomic:\ed+)foo
				2097	.sp
Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	2098	This kind of parenthesized group "locks up" the part of the pattern it contains
				2099	once it has matched, and a failure further into the pattern is prevented from
				2100	backtracking into it. Backtracking past it to previous items, however, works as
				2101	normal.
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	2102	.P
				2103	An alternative description is that a group of this type matches exactly the
				2104	string of characters that an identical standalone pattern would match, if
				2105	anchored at the current point in the subject string.
				2106	.P
				2107	Atomic groups are not capture groups. Simple cases such as the above example
				2108	can be thought of as a maximizing repeat that must swallow everything it can.
				2109	So, while both \ed+ and \ed+? are prepared to adjust the number of digits they
				2110	match in order to make the rest of the pattern match, (?>\ed+) can only match
				2111	an entire sequence of digits.
				2112	.P
				2113	Atomic groups in general can of course contain arbitrarily complicated
				2114	expressions, and can be nested. However, when the contents of an atomic
				2115	group is just a single repeated item, as in the example above, a simpler
				2116	notation, called a "possessive quantifier" can be used. This consists of an
				2117	additional + character following a quantifier. Using this notation, the
				2118	previous example can be rewritten as
				2119	.sp
				2120	\ed++foo
				2121	.sp
				2122	Note that a possessive quantifier can be used with an entire group, for
				2123	example:
				2124	.sp
				2125	(abc\|xyz){2,3}+
				2126	.sp
				2127	Possessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY
				2128	option is ignored. They are a convenient notation for the simpler forms of
				2129	atomic group. However, there is no difference in the meaning of a possessive
				2130	quantifier and the equivalent atomic group, though there may be a performance
				2131	difference; possessive quantifiers should be slightly faster.
				2132	.P
				2133	The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
				2134	Jeffrey Friedl originated the idea (and the name) in the first edition of his
				2135	book. Mike McCloskey liked it, so implemented it when he built Sun's Java
				2136	package, and PCRE1 copied it from there. It found its way into Perl at release
				2137	5.10.
				2138	.P
				2139	PCRE2 has an optimization that automatically "possessifies" certain simple
				2140	pattern constructs. For example, the sequence A+B is treated as A++B because
				2141	there is no point in backtracking into a sequence of A's when B must follow.
				2142	This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting
				2143	the pattern with (*NO_AUTO_POSSESS).
				2144	.P
				2145	When a pattern contains an unlimited repeat inside a group that can itself be
				2146	repeated an unlimited number of times, the use of an atomic group is the only
				2147	way to avoid some failing matches taking a very long time indeed. The pattern
				2148	.sp
				2149	(\eD+\|<\ed+>)*[!?]
				2150	.sp
				2151	matches an unlimited number of substrings that either consist of non-digits, or
				2152	digits enclosed in <>, followed by either ! or ?. When it matches, it runs
				2153	quickly. However, if it is applied to
				2154	.sp
				2155	aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
				2156	.sp
				2157	it takes a long time before reporting failure. This is because the string can
				2158	be divided between the internal \eD+ repeat and the external * repeat in a
				2159	large number of ways, and all have to be tried. (The example uses [!?] rather
				2160	than a single character at the end, because both PCRE2 and Perl have an
				2161	optimization that allows for fast failure when a single character is used. They
				2162	remember the last single character that is required for a match, and fail early
				2163	if it is not present in the string.) If the pattern is changed so that it uses
				2164	an atomic group, like this:
				2165	.sp
				2166	((?>\eD+)\|<\ed+>)*[!?]
				2167	.sp
				2168	sequences of non-digits cannot be broken, and failure happens quickly.
				2169	.
				2170	.
				2171	.\" HTML <a name="backreferences"></a>
				2172	.SH "BACKREFERENCES"
				2173	.rs
				2174	.sp
				2175	Outside a character class, a backslash followed by a digit greater than 0 (and
				2176	possibly further digits) is a backreference to a capture group earlier (that
				2177	is, to its left) in the pattern, provided there have been that many previous
				2178	capture groups.
				2179	.P
				2180	However, if the decimal number following the backslash is less than 8, it is
				2181	always taken as a backreference, and causes an error only if there are not that
				2182	many capture groups in the entire pattern. In other words, the group that is
				2183	referenced need not be to the left of the reference for numbers less than 8. A
				2184	"forward backreference" of this type can make sense when a repetition is
				2185	involved and the group to the right has participated in an earlier iteration.
				2186	.P
				2187	It is not possible to have a numerical "forward backreference" to a group whose
				2188	number is 8 or more using this syntax because a sequence such as \e50 is
				2189	interpreted as a character defined in octal. See the subsection entitled
				2190	"Non-printing characters"
				2191	.\" HTML <a href="#digitsafterbackslash">
				2192	.\" </a>
				2193	above
				2194	.\"
				2195	for further details of the handling of digits following a backslash. Other
				2196	forms of backreferencing do not suffer from this restriction. In particular,
				2197	there is no problem when named capture groups are used (see below).
				2198	.P
				2199	Another way of avoiding the ambiguity inherent in the use of digits following a
				2200	backslash is to use the \eg escape sequence. This escape must be followed by a
				2201	signed or unsigned number, optionally enclosed in braces. These examples are
				2202	all identical:
				2203	.sp
				2204	(ring), \e1
				2205	(ring), \eg1
				2206	(ring), \eg{1}
				2207	.sp
				2208	An unsigned number specifies an absolute reference without the ambiguity that
				2209	is present in the older syntax. It is also useful when literal digits follow
				2210	the reference. A signed number is a relative reference. Consider this example:
				2211	.sp
				2212	(abc(def)ghi)\eg{-1}
				2213	.sp
				2214	The sequence \eg{-1} is a reference to the most recently started capture group
				2215	before \eg, that is, is it equivalent to \e2 in this example. Similarly,
				2216	\eg{-2} would be equivalent to \e1. The use of relative references can be
				2217	helpful in long patterns, and also in patterns that are created by joining
				2218	together fragments that contain references within themselves.
				2219	.P
				2220	The sequence \eg{+1} is a reference to the next capture group. This kind of
				2221	forward reference can be useful in patterns that repeat. Perl does not support
				2222	the use of + in this way.
				2223	.P
				2224	A backreference matches whatever actually most recently matched the capture
				2225	group in the current subject string, rather than anything at all that matches
				2226	the group (see
				2227	.\" HTML <a href="#groupsassubroutines">
				2228	.\" </a>
				2229	"Groups as subroutines"
				2230	.\"
				2231	below for a way of doing that). So the pattern
				2232	.sp
				2233	(sens\|respons)e and \e1ibility
				2234	.sp
				2235	matches "sense and sensibility" and "response and responsibility", but not
				2236	"sense and responsibility". If caseful matching is in force at the time of the
				2237	backreference, the case of letters is relevant. For example,
				2238	.sp
				2239	((?i)rah)\es+\e1
				2240	.sp
				2241	matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
				2242	capture group is matched caselessly.
				2243	.P
				2244	There are several different ways of writing backreferences to named capture
				2245	groups. The .NET syntax \ek{name} and the Perl syntax \ek<name> or \ek'name'
				2246	are supported, as is the Python syntax (?P=name). Perl 5.10's unified
				2247	backreference syntax, in which \eg can be used for both numeric and named
				2248	references, is also supported. We could rewrite the above example in any of the
				2249	following ways:
				2250	.sp
				2251	(?<p1>(?i)rah)\es+\ek<p1>
				2252	(?'p1'(?i)rah)\es+\ek{p1}
				2253	(?P<p1>(?i)rah)\es+(?P=p1)
				2254	(?<p1>(?i)rah)\es+\eg{p1}
				2255	.sp
				2256	A capture group that is referenced by name may appear in the pattern before or
				2257	after the reference.
				2258	.P
				2259	There may be more than one backreference to the same group. If a group has not
				2260	actually been used in a particular match, backreferences to it always fail by
				2261	default. For example, the pattern
				2262	.sp
				2263	(a\|(bc))\e2
				2264	.sp
				2265	always fails if it starts to match "a" rather than "bc". However, if the
				2266	PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an
				2267	unset value matches an empty string.
				2268	.P
				2269	Because there may be many capture groups in a pattern, all digits following a
				2270	backslash are taken as part of a potential backreference number. If the pattern
				2271	continues with a digit character, some delimiter must be used to terminate the
				2272	backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this
				2273	can be white space. Otherwise, the \eg{} syntax or an empty comment (see
				2274	.\" HTML <a href="#comments">
				2275	.\" </a>
				2276	"Comments"
				2277	.\"
				2278	below) can be used.
				2279	.
				2280	.
				2281	.SS "Recursive backreferences"
				2282	.rs
				2283	.sp
				2284	A backreference that occurs inside the group to which it refers fails when the
				2285	group is first used, so, for example, (a\e1) never matches. However, such
				2286	references can be useful inside repeated groups. For example, the pattern
				2287	.sp
				2288	(a\|b\e1)+
				2289	.sp
				2290	matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
				2291	the group, the backreference matches the character string corresponding to the
				2292	previous iteration. In order for this to work, the pattern must be such that
				2293	the first iteration does not need to match the backreference. This can be done
				2294	using alternation, as in the example above, or by a quantifier with a minimum
				2295	of zero.
				2296	.P
				2297	For versions of PCRE2 less than 10.25, backreferences of this type used to
				2298	cause the group that they reference to be treated as an
				2299	.\" HTML <a href="#atomicgroup">
				2300	.\" </a>
				2301	atomic group.
				2302	.\"
				2303	This restriction no longer applies, and backtracking into such groups can occur
				2304	as normal.
				2305	.
				2306	.
				2307	.\" HTML <a name="bigassertions"></a>
				2308	.SH ASSERTIONS
				2309	.rs
				2310	.sp
				2311	An assertion is a test on the characters following or preceding the current
				2312	matching point that does not consume any characters. The simple assertions
				2313	coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described
				2314	.\" HTML <a href="#smallassertions">
				2315	.\" </a>
				2316	above.
				2317	.\"
				2318	.P
				2319	More complicated assertions are coded as parenthesized groups. There are two
				2320	kinds: those that look ahead of the current position in the subject string, and
				2321	those that look behind it, and in each case an assertion may be positive (must
				2322	match for the assertion to be true) or negative (must not match for the
				2323	assertion to be true). An assertion group is matched in the normal way,
				2324	and if it is true, matching continues after it, but with the matching position
				2325	in the subject string reset to what it was before the assertion was processed.
				2326	.P
				2327	The Perl-compatible lookaround assertions are atomic. If an assertion is true,
				2328	but there is a subsequent matching failure, there is no backtracking into the
				2329	assertion. However, there are some cases where non-atomic assertions can be
				2330	useful. PCRE2 has some support for these, described in the section entitled
				2331	.\" HTML <a href="#nonatomicassertions">
				2332	.\" </a>
				2333	"Non-atomic assertions"
				2334	.\"
				2335	below, but they are not Perl-compatible.
				2336	.P
				2337	A lookaround assertion may appear as the condition in a
				2338	.\" HTML <a href="#conditions">
				2339	.\" </a>
				2340	conditional group
				2341	.\"
				2342	(see below). In this case, the result of matching the assertion determines
				2343	which branch of the condition is followed.
				2344	.P
				2345	Assertion groups are not capture groups. If an assertion contains capture
				2346	groups within it, these are counted for the purposes of numbering the capture
				2347	groups in the whole pattern. Within each branch of an assertion, locally
				2348	captured substrings may be referenced in the usual way. For example, a sequence
				2349	such as (.)\eg{-1} can be used to check that two adjacent characters are the
				2350	same.
				2351	.P
				2352	When a branch within an assertion fails to match, any substrings that were
				2353	captured are discarded (as happens with any pattern branch that fails to
				2354	match). A negative assertion is true only when all its branches fail to match;
				2355	this means that no captured substrings are ever retained after a successful
				2356	negative assertion. When an assertion contains a matching branch, what happens
				2357	depends on the type of assertion.
				2358	.P
				2359	For a positive assertion, internally captured substrings in the successful
				2360	branch are retained, and matching continues with the next pattern item after
				2361	the assertion. For a negative assertion, a matching branch means that the
				2362	assertion is not true. If such an assertion is being used as a condition in a
				2363	.\" HTML <a href="#conditions">
				2364	.\" </a>
				2365	conditional group
				2366	.\"
				2367	(see below), captured substrings are retained, because matching continues with
				2368	the "no" branch of the condition. For other failing negative assertions,
				2369	control passes to the previous backtracking point, thus discarding any captured
				2370	strings within the assertion.
				2371	.P
				2372	Most assertion groups may be repeated; though it makes no sense to assert the
				2373	same thing several times, the side effect of capturing in positive assertions
				2374	may occasionally be useful. However, an assertion that forms the condition for
				2375	a conditional group may not be quantified. PCRE2 used to restrict the
				2376	repetition of assertions, but from release 10.35 the only restriction is that
				2377	an unlimited maximum repetition is changed to be one more than the minimum. For
				2378	example, {3,} is treated as {3,4}.
				2379	.
				2380	.
				2381	.SS "Alphabetic assertion names"
				2382	.rs
				2383	.sp
				2384	Traditionally, symbolic sequences such as (?= and (?<= have been used to
				2385	specify lookaround assertions. Perl 5.28 introduced some experimental
				2386	alphabetic alternatives which might be easier to remember. They all start with
				2387	(* instead of (? and must be written using lower case letters. PCRE2 supports
				2388	the following synonyms:
				2389	.sp
				2390	(positive_lookahead: or (pla: is the same as (?=
				2391	(negative_lookahead: or (nla: is the same as (?!
				2392	(positive_lookbehind: or (plb: is the same as (?<=
				2393	(negative_lookbehind: or (nlb: is the same as (?<!
				2394	.sp
				2395	For example, (*pla:foo) is the same assertion as (?=foo). In the following
				2396	sections, the various assertions are described using the original symbolic
				2397	forms.
				2398	.
				2399	.
				2400	.SS "Lookahead assertions"
				2401	.rs
				2402	.sp
				2403	Lookahead assertions start with (?= for positive assertions and (?! for
				2404	negative assertions. For example,
				2405	.sp
				2406	\ew+(?=;)
				2407	.sp
				2408	matches a word followed by a semicolon, but does not include the semicolon in
				2409	the match, and
				2410	.sp
				2411	foo(?!bar)
				2412	.sp
				2413	matches any occurrence of "foo" that is not followed by "bar". Note that the
				2414	apparently similar pattern
				2415	.sp
				2416	(?!foo)bar
				2417	.sp
				2418	does not find an occurrence of "bar" that is preceded by something other than
				2419	"foo"; it finds any occurrence of "bar" whatsoever, because the assertion
				2420	(?!foo) is always true when the next three characters are "bar". A
				2421	lookbehind assertion is needed to achieve the other effect.
				2422	.P
				2423	If you want to force a matching failure at some point in a pattern, the most
				2424	convenient way to do it is with (?!) because an empty string always matches, so
				2425	an assertion that requires there not to be an empty string must always fail.
				2426	The backtracking control verb (FAIL) or (F) is a synonym for (?!).
				2427	.
				2428	.
				2429	.\" HTML <a name="lookbehind"></a>
				2430	.SS "Lookbehind assertions"
				2431	.rs
				2432	.sp
				2433	Lookbehind assertions start with (?<= for positive assertions and (?<! for
				2434	negative assertions. For example,
				2435	.sp
				2436	(?<!foo)bar
				2437	.sp
				2438	does find an occurrence of "bar" that is not preceded by "foo". The contents of
				2439	a lookbehind assertion are restricted such that all the strings it matches must
				2440	have a fixed length. However, if there are several top-level alternatives, they
				2441	do not all have to have the same fixed length. Thus
				2442	.sp
				2443	(?<=bullock\|donkey)
				2444	.sp
				2445	is permitted, but
				2446	.sp
				2447	(?<!dogs?\|cats?)
				2448	.sp
				2449	causes an error at compile time. Branches that match different length strings
				2450	are permitted only at the top level of a lookbehind assertion. This is an
				2451	extension compared with Perl, which requires all branches to match the same
				2452	length of string. An assertion such as
				2453	.sp
				2454	(?<=ab(c\|de))
				2455	.sp
				2456	is not permitted, because its single top-level branch can match two different
				2457	lengths, but it is acceptable to PCRE2 if rewritten to use two top-level
				2458	branches:
				2459	.sp
				2460	(?<=abc\|abde)
				2461	.sp
				2462	In some cases, the escape sequence \eK
				2463	.\" HTML <a href="#resetmatchstart">
				2464	.\" </a>
				2465	(see above)
				2466	.\"
				2467	can be used instead of a lookbehind assertion to get round the fixed-length
				2468	restriction.
				2469	.P
				2470	The implementation of lookbehind assertions is, for each alternative, to
				2471	temporarily move the current position back by the fixed length and then try to
				2472	match. If there are insufficient characters before the current position, the
				2473	assertion fails.
				2474	.P
				2475	In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
				2476	single code unit even in a UTF mode) to appear in lookbehind assertions,
				2477	because it makes it impossible to calculate the length of the lookbehind. The
				2478	\eX and \eR escapes, which can match different numbers of code units, are never
				2479	permitted in lookbehinds.
				2480	.P
				2481	.\" HTML <a href="#groupsassubroutines">
				2482	.\" </a>
				2483	"Subroutine"
				2484	.\"
				2485	calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
				2486	as the called capture group matches a fixed-length string. However,
				2487	.\" HTML <a href="#recursion">
				2488	.\" </a>
				2489	recursion,
				2490	.\"
				2491	that is, a "subroutine" call into a group that is already active,
				2492	is not supported.
				2493	.P
				2494	Perl does not support backreferences in lookbehinds. PCRE2 does support them,
				2495	but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
				2496	must not be set, there must be no use of (?\| in the pattern (it creates
				2497	duplicate group numbers), and if the backreference is by name, the name
				2498	must be unique. Of course, the referenced group must itself match a fixed
				2499	length substring. The following pattern matches words containing at least two
				2500	characters that begin and end with the same character:
				2501	.sp
				2502	\eb(\ew)\ew++(?<=\e1)
				2503	.P
				2504	Possessive quantifiers can be used in conjunction with lookbehind assertions to
				2505	specify efficient matching of fixed-length strings at the end of subject
				2506	strings. Consider a simple pattern such as
				2507	.sp
				2508	abcd$
				2509	.sp
				2510	when applied to a long string that does not match. Because matching proceeds
				2511	from left to right, PCRE2 will look for each "a" in the subject and then see if
				2512	what follows matches the rest of the pattern. If the pattern is specified as
				2513	.sp
				2514	^.*abcd$
				2515	.sp
				2516	the initial .* matches the entire string at first, but when this fails (because
				2517	there is no following "a"), it backtracks to match all but the last character,
				2518	then all but the last two characters, and so on. Once again the search for "a"
				2519	covers the entire string, from right to left, so we are no better off. However,
				2520	if the pattern is written as
				2521	.sp
				2522	^.*+(?<=abcd)
				2523	.sp
				2524	there can be no backtracking for the .*+ item because of the possessive
				2525	quantifier; it can match only the entire string. The subsequent lookbehind
				2526	assertion does a single test on the last four characters. If it fails, the
				2527	match fails immediately. For long strings, this approach makes a significant
				2528	difference to the processing time.
				2529	.
				2530	.
				2531	.SS "Using multiple assertions"
				2532	.rs
				2533	.sp
				2534	Several assertions (of any sort) may occur in succession. For example,
				2535	.sp
				2536	(?<=\ed{3})(?<!999)foo
				2537	.sp
				2538	matches "foo" preceded by three digits that are not "999". Notice that each of
				2539	the assertions is applied independently at the same point in the subject
				2540	string. First there is a check that the previous three characters are all
				2541	digits, and then there is a check that the same three characters are not "999".
				2542	This pattern does \fInot\fP match "foo" preceded by six characters, the first
				2543	of which are digits and the last three of which are not "999". For example, it
				2544	doesn't match "123abcfoo". A pattern to do that is
				2545	.sp
				2546	(?<=\ed{3}...)(?<!999)foo
				2547	.sp
				2548	This time the first assertion looks at the preceding six characters, checking
				2549	that the first three are digits, and then the second assertion checks that the
				2550	preceding three characters are not "999".
				2551	.P
				2552	Assertions can be nested in any combination. For example,
				2553	.sp
				2554	(?<=(?<!foo)bar)baz
				2555	.sp
				2556	matches an occurrence of "baz" that is preceded by "bar" which in turn is not
				2557	preceded by "foo", while
				2558	.sp
				2559	(?<=\ed{3}(?!999)...)foo
				2560	.sp
				2561	is another pattern that matches "foo" preceded by three digits and any three
				2562	characters that are not "999".
				2563	.
				2564	.
				2565	.\" HTML <a name="nonatomicassertions"></a>
				2566	.SH "NON-ATOMIC ASSERTIONS"
				2567	.rs
				2568	.sp
				2569	The traditional Perl-compatible lookaround assertions are atomic. That is, if
				2570	an assertion is true, but there is a subsequent matching failure, there is no
				2571	backtracking into the assertion. However, there are some cases where non-atomic
				2572	positive assertions can be useful. PCRE2 provides these using the following
				2573	syntax:
				2574	.sp
				2575	(non_atomic_positive_lookahead: or (napla: or (?*
				2576	(non_atomic_positive_lookbehind: or (naplb: or (?<*
				2577	.sp
				2578	Consider the problem of finding the right-most word in a string that also
				2579	appears earlier in the string, that is, it must appear at least twice in total.
				2580	This pattern returns the required result as captured substring 1:
				2581	.sp
				2582	^(?x)(napla: . \eb(\ew++)) (?> .*? \eb\e1\eb ){2}
				2583	.sp
				2584	For a subject such as "word1 word2 word3 word2 word3 word4" the result is
				2585	"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
				2586	"x" option, which causes white space (introduced for readability) to be
				2587	ignored. Inside the assertion, the greedy .* at first consumes the entire
				2588	string, but then has to backtrack until the rest of the assertion can match a
				2589	word, which is captured by group 1. In other words, when the assertion first
				2590	succeeds, it captures the right-most word in the string.
				2591	.P
				2592	The current matching point is then reset to the start of the subject, and the
				2593	rest of the pattern match checks for two occurrences of the captured word,
				2594	using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
				2595	if the last word in the string does not occur twice, this part of the pattern
				2596	fails. If a traditional atomic lookhead (?= or (*pla: had been used, the
				2597	assertion could not be re-entered, and the whole match would fail. The pattern
				2598	would succeed only if the very last word in the subject was found twice.
				2599	.P
				2600	Using a non-atomic lookahead, however, means that when the last word does not
				2601	occur twice in the string, the lookahead can backtrack and find the second-last
				2602	word, and so on, until either the match succeeds, or all words have been
				2603	tested.
				2604	.P
				2605	Two conditions must be met for a non-atomic assertion to be useful: the
				2606	contents of one or more capturing groups must change after a backtrack into the
				2607	assertion, and there must be a backreference to a changed group later in the
				2608	pattern. If this is not the case, the rest of the pattern match fails exactly
				2609	as before because nothing has changed, so using a non-atomic assertion just
				2610	wastes resources.
				2611	.P
				2612	There is one exception to backtracking into a non-atomic assertion. If an
				2613	(*ACCEPT) control verb is triggered, the assertion succeeds atomically. That
				2614	is, a subsequent match failure cannot backtrack into the assertion.
				2615	.P
				2616	Non-atomic assertions are not supported by the alternative matching function
				2617	\fBpcre2_dfa_match()\fP. They are supported by JIT, but only if they do not
				2618	contain any control verbs such as (*ACCEPT). (This may change in future). Note
				2619	that assertions that appear as conditions for
				2620	.\" HTML <a href="#conditions">
				2621	.\" </a>
				2622	conditional groups
				2623	.\"
				2624	(see below) must be atomic.
				2625	.
				2626	.
				2627	.SH "SCRIPT RUNS"
				2628	.rs
				2629	.sp
				2630	In concept, a script run is a sequence of characters that are all from the same
				2631	Unicode script such as Latin or Greek. However, because some scripts are
				2632	commonly used together, and because some diacritical and other marks are used
				2633	with multiple scripts, it is not that simple. There is a full description of
				2634	the rules that PCRE2 uses in the section entitled
				2635	.\" HTML <a href="pcre2unicode.html#scriptruns">
				2636	.\" </a>
				2637	"Script Runs"
				2638	.\"
				2639	in the
				2640	.\" HREF
				2641	\fBpcre2unicode\fP
				2642	.\"
				2643	documentation.
				2644	.P
				2645	If part of a pattern is enclosed between (script_run: or (sr: and a closing
				2646	parenthesis, it fails if the sequence of characters that it matches are not a
				2647	script run. After a failure, normal backtracking occurs. Script runs can be
				2648	used to detect spoofing attacks using characters that look the same, but are
				2649	from different scripts. The string "paypal.com" is an infamous example, where
				2650	the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
				2651	the matched characters in a sequence of non-spaces that follow white space are
				2652	a script run:
				2653	.sp
				2654	\es+(*sr:\eS+)
				2655	.sp
				2656	To be sure that they are all from the Latin script (for example), a lookahead
				2657	can be used:
				2658	.sp
				2659	\es+(?=\ep{Latin})(*sr:\eS+)
				2660	.sp
				2661	This works as long as the first character is expected to be a character in that
				2662	script, and not (for example) punctuation, which is allowed with any script. If
				2663	this is not the case, a more creative lookahead is needed. For example, if
				2664	digits, underscore, and dots are permitted at the start:
				2665	.sp
				2666	\es+(?=[0-9_.]\ep{Latin})(sr:\eS+)
				2667	.sp
				2668	.P
				2669	In many cases, backtracking into a script run pattern fragment is not
				2670	desirable. The script run can employ an atomic group to prevent this. Because
				2671	this is a common requirement, a shorthand notation is provided by
				2672	(atomic_script_run: or (asr:
				2673	.sp
				2674	(asr:...) is the same as (sr:(?>...))
				2675	.sp
				2676	Note that the atomic group is inside the script run. Putting it outside would
				2677	not prevent backtracking into the script run pattern.
				2678	.P
				2679	Support for script runs is not available if PCRE2 is compiled without Unicode
				2680	support. A compile-time error is given if any of the above constructs is
				2681	encountered. Script runs are not supported by the alternate matching function,
				2682	\fBpcre2_dfa_match()\fP because they use the same mechanism as capturing
				2683	parentheses.
				2684	.P
				2685	\fBWarning:\fP The (*ACCEPT) control verb
				2686	.\" HTML <a href="#acceptverb">
				2687	.\" </a>
				2688	(see below)
				2689	.\"
				2690	should not be used within a script run group, because it causes an immediate
				2691	exit from the group, bypassing the script run checking.
				2692	.
				2693	.
				2694	.\" HTML <a name="conditions"></a>
				2695	.SH "CONDITIONAL GROUPS"
				2696	.rs
				2697	.sp
				2698	It is possible to cause the matching process to obey a pattern fragment
				2699	conditionally or to choose between two alternative fragments, depending on
				2700	the result of an assertion, or whether a specific capture group has
				2701	already been matched. The two possible forms of conditional group are:
				2702	.sp
				2703	(?(condition)yes-pattern)
				2704	(?(condition)yes-pattern\|no-pattern)
				2705	.sp
				2706	If the condition is satisfied, the yes-pattern is used; otherwise the
				2707	no-pattern (if present) is used. An absent no-pattern is equivalent to an empty
				2708	string (it always matches). If there are more than two alternatives in the
				2709	group, a compile-time error occurs. Each of the two alternatives may itself
				2710	contain nested groups of any form, including conditional groups; the
				2711	restriction to two alternatives applies only at the level of the condition
				2712	itself. This pattern fragment is an example where the alternatives are complex:
				2713	.sp
				2714	(?(1) (A\|B\|C) \| (D \| (?(2)E\|F) \| E) )
				2715	.sp
				2716	.P
				2717	There are five kinds of condition: references to capture groups, references to
				2718	recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
				2719	.
				2720	.
				2721	.SS "Checking for a used capture group by number"
				2722	.rs
				2723	.sp
				2724	If the text between the parentheses consists of a sequence of digits, the
				2725	condition is true if a capture group of that number has previously matched. If
				2726	there is more than one capture group with the same number (see the earlier
				2727	.\"
				2728	.\" HTML <a href="#recursion">
				2729	.\" </a>
				2730	section about duplicate group numbers),
				2731	.\"
				2732	the condition is true if any of them have matched. An alternative notation is
				2733	to precede the digits with a plus or minus sign. In this case, the group number
				2734	is relative rather than absolute. The most recently opened capture group can be
				2735	referenced by (?(-1), the next most recent by (?(-2), and so on. Inside loops
				2736	it can also make sense to refer to subsequent groups. The next capture group
				2737	can be referenced as (?(+1), and so on. (The value zero in any of these forms
				2738	is not used; it provokes a compile-time error.)
				2739	.P
				2740	Consider the following pattern, which contains non-significant white space to
				2741	make it more readable (assume the PCRE2_EXTENDED option) and to divide it into
				2742	three parts for ease of discussion:
				2743	.sp
				2744	( \e( )? [^()]+ (?(1) \e) )
				2745	.sp
				2746	The first part matches an optional opening parenthesis, and if that
				2747	character is present, sets it as the first captured substring. The second part
				2748	matches one or more characters that are not parentheses. The third part is a
				2749	conditional group that tests whether or not the first capture group
				2750	matched. If it did, that is, if subject started with an opening parenthesis,
				2751	the condition is true, and so the yes-pattern is executed and a closing
				2752	parenthesis is required. Otherwise, since no-pattern is not present, the
				2753	conditional group matches nothing. In other words, this pattern matches a
				2754	sequence of non-parentheses, optionally enclosed in parentheses.
				2755	.P
				2756	If you were embedding this pattern in a larger one, you could use a relative
				2757	reference:
				2758	.sp
				2759	...other stuff... ( \e( )? [^()]+ (?(-1) \e) ) ...
				2760	.sp
				2761	This makes the fragment independent of the parentheses in the larger pattern.
				2762	.
				2763	.
				2764	.SS "Checking for a used capture group by name"
				2765	.rs
				2766	.sp
				2767	Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
				2768	capture group by name. For compatibility with earlier versions of PCRE1, which
				2769	had this facility before Perl, the syntax (?(name)...) is also recognized.
				2770	Note, however, that undelimited names consisting of the letter R followed by
				2771	digits are ambiguous (see the following section). Rewriting the above example
				2772	to use a named group gives this:
				2773	.sp
				2774	(?<OPEN> \e( )? [^()]+ (?(<OPEN>) \e) )
				2775	.sp
				2776	If the name used in a condition of this kind is a duplicate, the test is
				2777	applied to all groups of the same name, and is true if any one of them has
				2778	matched.
				2779	.
				2780	.
				2781	.SS "Checking for pattern recursion"
				2782	.rs
				2783	.sp
				2784	"Recursion" in this sense refers to any subroutine-like call from one part of
				2785	the pattern to another, whether or not it is actually recursive. See the
				2786	sections entitled
				2787	.\" HTML <a href="#recursion">
				2788	.\" </a>
				2789	"Recursive patterns"
				2790	.\"
				2791	and
				2792	.\" HTML <a href="#groupsassubroutines">
				2793	.\" </a>
				2794	"Groups as subroutines"
				2795	.\"
				2796	below for details of recursion and subroutine calls.
				2797	.P
				2798	If a condition is the string (R), and there is no capture group with the name
				2799	R, the condition is true if matching is currently in a recursion or subroutine
				2800	call to the whole pattern or any capture group. If digits follow the letter R,
				2801	and there is no group with that name, the condition is true if the most recent
				2802	call is into a group with the given number, which must exist somewhere in the
				2803	overall pattern. This is a contrived example that is equivalent to a+b:
				2804	.sp
				2805	((?(R1)a+\|(?1)b))
				2806	.sp
				2807	However, in both cases, if there is a capture group with a matching name, the
				2808	condition tests for its being set, as described in the section above, instead
				2809	of testing for recursion. For example, creating a group with the name R1 by
				2810	adding (?<R1>) to the above pattern completely changes its meaning.
				2811	.P
				2812	If a name preceded by ampersand follows the letter R, for example:
				2813	.sp
				2814	(?(R&name)...)
				2815	.sp
				2816	the condition is true if the most recent recursion is into a group of that name
				2817	(which must exist within the pattern).
				2818	.P
				2819	This condition does not check the entire recursion stack. It tests only the
				2820	current level. If the name used in a condition of this kind is a duplicate, the
				2821	test is applied to all groups of the same name, and is true if any one of
				2822	them is the most recent recursion.
				2823	.P
				2824	At "top level", all these recursion test conditions are false.
				2825	.
				2826	.
				2827	.\" HTML <a name="subdefine"></a>
				2828	.SS "Defining capture groups for use by reference only"
				2829	.rs
				2830	.sp
				2831	If the condition is the string (DEFINE), the condition is always false, even if
				2832	there is a group with the name DEFINE. In this case, there may be only one
				2833	alternative in the rest of the conditional group. It is always skipped if
				2834	control reaches this point in the pattern; the idea of DEFINE is that it can be
				2835	used to define subroutines that can be referenced from elsewhere. (The use of
				2836	.\" HTML <a href="#groupsassubroutines">
				2837	.\" </a>
				2838	subroutines
				2839	.\"
				2840	is described below.) For example, a pattern to match an IPv4 address such as
				2841	"192.168.23.245" could be written like this (ignore white space and line
				2842	breaks):
				2843	.sp
				2844	(?(DEFINE) (?<byte> 2[0-4]\ed \| 25[0-5] \| 1\ed\ed \| [1-9]?\ed) )
				2845	\eb (?&byte) (\e.(?&byte)){3} \eb
				2846	.sp
Elliott Hughes	16619d6	2021-10-29 12:10:38 -0700	[diff] [blame]	2847	The first part of the pattern is a DEFINE group inside which another group
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	2848	named "byte" is defined. This matches an individual component of an IPv4
				2849	address (a number less than 256). When matching takes place, this part of the
				2850	pattern is skipped because DEFINE acts like a false condition. The rest of the
				2851	pattern uses references to the named group to match the four dot-separated
				2852	components of an IPv4 address, insisting on a word boundary at each end.
				2853	.
				2854	.
				2855	.SS "Checking the PCRE2 version"
				2856	.rs
				2857	.sp
				2858	Programs that link with a PCRE2 library can check the version by calling
				2859	\fBpcre2_config()\fP with appropriate arguments. Users of applications that do
				2860	not have access to the underlying code cannot do this. A special "condition"
				2861	called VERSION exists to allow such users to discover which version of PCRE2
				2862	they are dealing with by using this condition to match a string such as
				2863	"yesno". VERSION must be followed either by "=" or ">=" and a version number.
				2864	For example:
				2865	.sp
				2866	(?(VERSION>=10.4)yes\|no)
				2867	.sp
				2868	This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
				2869	"no" otherwise. The fractional part of the version number may not contain more
				2870	than two digits.
				2871	.
				2872	.
				2873	.SS "Assertion conditions"
				2874	.rs
				2875	.sp
				2876	If the condition is not in any of the above formats, it must be a parenthesized
				2877	assertion. This may be a positive or negative lookahead or lookbehind
				2878	assertion. However, it must be a traditional atomic assertion, not one of the
				2879	PCRE2-specific
				2880	.\" HTML <a href="#nonatomicassertions">
				2881	.\" </a>
				2882	non-atomic assertions.
				2883	.\"
				2884	.P
				2885	Consider this pattern, again containing non-significant white space, and with
				2886	the two alternatives on the second line:
				2887	.sp
				2888	(?(?=[^a-z]*[a-z])
				2889	\ed{2}-[a-z]{3}-\ed{2} \| \ed{2}-\ed{2}-\ed{2} )
				2890	.sp
				2891	The condition is a positive lookahead assertion that matches an optional
				2892	sequence of non-letters followed by a letter. In other words, it tests for the
				2893	presence of at least one letter in the subject. If a letter is found, the
				2894	subject is matched against the first alternative; otherwise it is matched
				2895	against the second. This pattern matches strings in one of the two forms
				2896	dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
				2897	.P
				2898	When an assertion that is a condition contains capture groups, any
				2899	capturing that occurs in a matching branch is retained afterwards, for both
				2900	positive and negative assertions, because matching always continues after the
				2901	assertion, whether it succeeds or fails. (Compare non-conditional assertions,
				2902	for which captures are retained only for positive assertions that succeed.)
				2903	.
				2904	.
				2905	.\" HTML <a name="comments"></a>
				2906	.SH COMMENTS
				2907	.rs
				2908	.sp
				2909	There are two ways of including comments in patterns that are processed by
				2910	PCRE2. In both cases, the start of the comment must not be in a character
				2911	class, nor in the middle of any other sequence of related characters such as
				2912	(?: or a group name or number. The characters that make up a comment play
				2913	no part in the pattern matching.
				2914	.P
				2915	The sequence (?# marks the start of a comment that continues up to the next
				2916	closing parenthesis. Nested parentheses are not permitted. If the
				2917	PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
				2918	also introduces a comment, which in this case continues to immediately after
				2919	the next newline character or character sequence in the pattern. Which
				2920	characters are interpreted as newlines is controlled by an option passed to the
				2921	compiling function or by a special sequence at the start of the pattern, as
				2922	described in the section entitled
				2923	.\" HTML <a href="#newlines">
				2924	.\" </a>
				2925	"Newline conventions"
				2926	.\"
				2927	above. Note that the end of this type of comment is a literal newline sequence
				2928	in the pattern; escape sequences that happen to represent a newline do not
				2929	count. For example, consider this pattern when PCRE2_EXTENDED is set, and the
				2930	default newline convention (a single linefeed character) is in force:
				2931	.sp
				2932	abc #comment \en still comment
				2933	.sp
				2934	On encountering the # character, \fBpcre2_compile()\fP skips along, looking for
				2935	a newline in the pattern. The sequence \en is still literal at this stage, so
				2936	it does not terminate the comment. Only an actual character with the code value
				2937	0x0a (the default newline) does so.
				2938	.
				2939	.
				2940	.\" HTML <a name="recursion"></a>
				2941	.SH "RECURSIVE PATTERNS"
				2942	.rs
				2943	.sp
				2944	Consider the problem of matching a string in parentheses, allowing for
				2945	unlimited nested parentheses. Without the use of recursion, the best that can
				2946	be done is to use a pattern that matches up to some fixed depth of nesting. It
				2947	is not possible to handle an arbitrary nesting depth.
				2948	.P
				2949	For some time, Perl has provided a facility that allows regular expressions to
				2950	recurse (amongst other things). It does this by interpolating Perl code in the
				2951	expression at run time, and the code can refer to the expression itself. A Perl
				2952	pattern using code interpolation to solve the parentheses problem can be
				2953	created like this:
				2954	.sp
				2955	$re = qr{\e( (?: (?>[^()]+) \| (?p{$re}) )* \e)}x;
				2956	.sp
				2957	The (?p{...}) item interpolates Perl code at run time, and in this case refers
				2958	recursively to the pattern in which it appears.
				2959	.P
				2960	Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it
				2961	supports special syntax for recursion of the entire pattern, and also for
				2962	individual capture group recursion. After its introduction in PCRE1 and Python,
				2963	this kind of recursion was subsequently introduced into Perl at release 5.10.
				2964	.P
				2965	A special item that consists of (? followed by a number greater than zero and a
				2966	closing parenthesis is a recursive subroutine call of the capture group of the
				2967	given number, provided that it occurs inside that group. (If not, it is a
				2968	.\" HTML <a href="#groupsassubroutines">
				2969	.\" </a>
				2970	non-recursive subroutine
				2971	.\"
				2972	call, which is described in the next section.) The special item (?R) or (?0) is
				2973	a recursive call of the entire regular expression.
				2974	.P
				2975	This PCRE2 pattern solves the nested parentheses problem (assume the
				2976	PCRE2_EXTENDED option is set so that white space is ignored):
				2977	.sp
				2978	\e( ( [^()]++ \| (?R) )* \e)
				2979	.sp
				2980	First it matches an opening parenthesis. Then it matches any number of
				2981	substrings which can either be a sequence of non-parentheses, or a recursive
				2982	match of the pattern itself (that is, a correctly parenthesized substring).
				2983	Finally there is a closing parenthesis. Note the use of a possessive quantifier
				2984	to avoid backtracking into sequences of non-parentheses.
				2985	.P
				2986	If this were part of a larger pattern, you would not want to recurse the entire
				2987	pattern, so instead you could use this:
				2988	.sp
				2989	( \e( ( [^()]++ \| (?1) )* \e) )
				2990	.sp
				2991	We have put the pattern into parentheses, and caused the recursion to refer to
				2992	them instead of the whole pattern.
				2993	.P
				2994	In a larger pattern, keeping track of parenthesis numbers can be tricky. This
				2995	is made easier by the use of relative references. Instead of (?1) in the
				2996	pattern above you can write (?-2) to refer to the second most recently opened
				2997	parentheses preceding the recursion. In other words, a negative number counts
				2998	capturing parentheses leftwards from the point at which it is encountered.
				2999	.P
				3000	Be aware however, that if
				3001	.\" HTML <a href="#dupgroupnumber">
				3002	.\" </a>
				3003	duplicate capture group numbers
				3004	.\"
				3005	are in use, relative references refer to the earliest group with the
				3006	appropriate number. Consider, for example:
				3007	.sp
				3008	(?\|(a)\|(b)) (c) (?-2)
				3009	.sp
				3010	The first two capture groups (a) and (b) are both numbered 1, and group (c)
				3011	is number 2. When the reference (?-2) is encountered, the second most recently
				3012	opened parentheses has the number 1, but it is the first such group (the (a)
				3013	group) to which the recursion refers. This would be the same if an absolute
				3014	reference (?1) was used. In other words, relative references are just a
				3015	shorthand for computing a group number.
				3016	.P
				3017	It is also possible to refer to subsequent capture groups, by writing
				3018	references such as (?+2). However, these cannot be recursive because the
				3019	reference is not inside the parentheses that are referenced. They are always
				3020	.\" HTML <a href="#groupsassubroutines">
				3021	.\" </a>
				3022	non-recursive subroutine
				3023	.\"
				3024	calls, as described in the next section.
				3025	.P
				3026	An alternative approach is to use named parentheses. The Perl syntax for this
				3027	is (?&name); PCRE1's earlier syntax (?P>name) is also supported. We could
				3028	rewrite the above example as follows:
				3029	.sp
				3030	(?<pn> \e( ( [^()]++ \| (?&pn) )* \e) )
				3031	.sp
				3032	If there is more than one group with the same name, the earliest one is
				3033	used.
				3034	.P
				3035	The example pattern that we have been looking at contains nested unlimited
				3036	repeats, and so the use of a possessive quantifier for matching strings of
				3037	non-parentheses is important when applying the pattern to strings that do not
				3038	match. For example, when this pattern is applied to
				3039	.sp
				3040	(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
				3041	.sp
				3042	it yields "no match" quickly. However, if a possessive quantifier is not used,
				3043	the match runs for a very long time indeed because there are so many different
				3044	ways the + and * repeats can carve up the subject, and all have to be tested
				3045	before failure can be reported.
				3046	.P
				3047	At the end of a match, the values of capturing parentheses are those from
				3048	the outermost level. If you want to obtain intermediate values, a callout
				3049	function can be used (see below and the
				3050	.\" HREF
				3051	\fBpcre2callout\fP
				3052	.\"
				3053	documentation). If the pattern above is matched against
				3054	.sp
				3055	(ab(cd)ef)
				3056	.sp
				3057	the value for the inner capturing parentheses (numbered 2) is "ef", which is
				3058	the last value taken on at the top level. If a capture group is not matched at
				3059	the top level, its final captured value is unset, even if it was (temporarily)
				3060	set at a deeper level during the matching process.
				3061	.P
				3062	Do not confuse the (?R) item with the condition (R), which tests for recursion.
				3063	Consider this pattern, which matches text in angle brackets, allowing for
				3064	arbitrary nesting. Only digits are allowed in nested brackets (that is, when
				3065	recursing), whereas any characters are permitted at the outer level.
				3066	.sp
				3067	< (?: (?(R) \ed++ \| [^<>]+) \| (?R)) >
				3068	.sp
				3069	In this pattern, (?(R) is the start of a conditional group, with two different
				3070	alternatives for the recursive and non-recursive cases. The (?R) item is the
				3071	actual recursive call.
				3072	.
				3073	.
				3074	.\" HTML <a name="recursiondifference"></a>
				3075	.SS "Differences in recursion processing between PCRE2 and Perl"
				3076	.rs
				3077	.sp
				3078	Some former differences between PCRE2 and Perl no longer exist.
				3079	.P
				3080	Before release 10.30, recursion processing in PCRE2 differed from Perl in that
				3081	a recursive subroutine call was always treated as an atomic group. That is,
				3082	once it had matched some of the subject string, it was never re-entered, even
				3083	if it contained untried alternatives and there was a subsequent matching
				3084	failure. (Historical note: PCRE implemented recursion before Perl did.)
				3085	.P
				3086	Starting with release 10.30, recursive subroutine calls are no longer treated
				3087	as atomic. That is, they can be re-entered to try unused alternatives if there
				3088	is a matching failure later in the pattern. This is now compatible with the way
				3089	Perl works. If you want a subroutine call to be atomic, you must explicitly
				3090	enclose it in an atomic group.
				3091	.P
				3092	Supporting backtracking into recursions simplifies certain types of recursive
				3093	pattern. For example, this pattern matches palindromic strings:
				3094	.sp
				3095	^((.)(?1)\e2\|.?)$
				3096	.sp
				3097	The second branch in the group matches a single central character in the
				3098	palindrome when there are an odd number of characters, or nothing when there
				3099	are an even number of characters, but in order to work it has to be able to try
				3100	the second case when the rest of the pattern match fails. If you want to match
				3101	typical palindromic phrases, the pattern has to ignore all non-word characters,
				3102	which can be done like this:
				3103	.sp
				3104	^\eW+((.)\eW+(?1)\eW+\e2\|\eW+.?)\eW*+$
				3105	.sp
				3106	If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
				3107	man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
				3108	avoid backtracking into sequences of non-word characters. Without this, PCRE2
				3109	takes a great deal longer (ten times or more) to match typical phrases, and
				3110	Perl takes so long that you think it has gone into a loop.
				3111	.P
				3112	Another way in which PCRE2 and Perl used to differ in their recursion
				3113	processing is in the handling of captured values. Formerly in Perl, when a
				3114	group was called recursively or as a subroutine (see the next section), it
				3115	had no access to any values that were captured outside the recursion, whereas
				3116	in PCRE2 these values can be referenced. Consider this pattern:
				3117	.sp
				3118	^(.)(\e1\|a(?2))
				3119	.sp
				3120	This pattern matches "bab". The first capturing parentheses match "b", then in
				3121	the second group, when the backreference \e1 fails to match "b", the second
				3122	alternative matches "a" and then recurses. In the recursion, \e1 does now match
				3123	"b" and so the whole match succeeds. This match used to fail in Perl, but in
				3124	later versions (I tried 5.024) it now works.
				3125	.
				3126	.
				3127	.\" HTML <a name="groupsassubroutines"></a>
				3128	.SH "GROUPS AS SUBROUTINES"
				3129	.rs
				3130	.sp
				3131	If the syntax for a recursive group call (either by number or by name) is used
				3132	outside the parentheses to which it refers, it operates a bit like a subroutine
				3133	in a programming language. More accurately, PCRE2 treats the referenced group
				3134	as an independent subpattern which it tries to match at the current matching
				3135	position. The called group may be defined before or after the reference. A
				3136	numbered reference can be absolute or relative, as in these examples:
				3137	.sp
				3138	(...(absolute)...)...(?2)...
				3139	(...(relative)...)...(?-1)...
				3140	(...(?+1)...(relative)...
				3141	.sp
				3142	An earlier example pointed out that the pattern
				3143	.sp
				3144	(sens\|respons)e and \e1ibility
				3145	.sp
				3146	matches "sense and sensibility" and "response and responsibility", but not
				3147	"sense and responsibility". If instead the pattern
				3148	.sp
				3149	(sens\|respons)e and (?1)ibility
				3150	.sp
				3151	is used, it does match "sense and responsibility" as well as the other two
				3152	strings. Another example is given in the discussion of DEFINE above.
				3153	.P
				3154	Like recursions, subroutine calls used to be treated as atomic, but this
				3155	changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
				3156	occur. However, any capturing parentheses that are set during the subroutine
				3157	call revert to their previous values afterwards.
				3158	.P
				3159	Processing options such as case-independence are fixed when a group is
				3160	defined, so if it is used as a subroutine, such options cannot be changed for
				3161	different calls. For example, consider this pattern:
				3162	.sp
				3163	(abc)(?i:(?-1))
				3164	.sp
				3165	It matches "abcabc". It does not match "abcABC" because the change of
				3166	processing option does not affect the called group.
				3167	.P
				3168	The behaviour of
				3169	.\" HTML <a href="#backtrackcontrol">
				3170	.\" </a>
				3171	backtracking control verbs
				3172	.\"
				3173	in groups when called as subroutines is described in the section entitled
				3174	.\" HTML <a href="#btsub">
				3175	.\" </a>
				3176	"Backtracking verbs in subroutines"
				3177	.\"
				3178	below.
				3179	.
				3180	.
				3181	.\" HTML <a name="onigurumasubroutines"></a>
				3182	.SH "ONIGURUMA SUBROUTINE SYNTAX"
				3183	.rs
				3184	.sp
				3185	For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
				3186	a number enclosed either in angle brackets or single quotes, is an alternative
				3187	syntax for calling a group as a subroutine, possibly recursively. Here are two
				3188	of the examples used above, rewritten using this syntax:
				3189	.sp
				3190	(?<pn> \e( ( (?>[^()]+) \| \eg<pn> )* \e) )
				3191	(sens\|respons)e and \eg'1'ibility
				3192	.sp
				3193	PCRE2 supports an extension to Oniguruma: if a number is preceded by a
				3194	plus or a minus sign it is taken as a relative reference. For example:
				3195	.sp
				3196	(abc)(?i:\eg<-1>)
				3197	.sp
				3198	Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
				3199	synonymous. The former is a backreference; the latter is a subroutine call.
				3200	.
				3201	.
				3202	.SH CALLOUTS
				3203	.rs
				3204	.sp
				3205	Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
				3206	code to be obeyed in the middle of matching a regular expression. This makes it
				3207	possible, amongst other things, to extract different substrings that match the
				3208	same pair of parentheses when there is a repetition.
				3209	.P
				3210	PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl
				3211	code. The feature is called "callout". The caller of PCRE2 provides an external
				3212	function by putting its entry point in a match context using the function
				3213	\fBpcre2_set_callout()\fP, and then passing that context to \fBpcre2_match()\fP
				3214	or \fBpcre2_dfa_match()\fP. If no match context is passed, or if the callout
				3215	entry point is set to NULL, callouts are disabled.
				3216	.P
				3217	Within a regular expression, (?C<arg>) indicates a point at which the external
				3218	function is to be called. There are two kinds of callout: those with a
				3219	numerical argument and those with a string argument. (?C) on its own with no
				3220	argument is treated as (?C0). A numerical argument allows the application to
				3221	distinguish between different callouts. String arguments were added for release
				3222	10.20 to make it possible for script languages that use PCRE2 to embed short
				3223	scripts within patterns in a similar way to Perl.
				3224	.P
				3225	During matching, when PCRE2 reaches a callout point, the external function is
				3226	called. It is provided with the number or string argument of the callout, the
				3227	position in the pattern, and one item of data that is also set in the match
				3228	block. The callout function may cause matching to proceed, to backtrack, or to
				3229	fail.
				3230	.P
				3231	By default, PCRE2 implements a number of optimizations at matching time, and
				3232	one side-effect is that sometimes callouts are skipped. If you need all
				3233	possible callouts to happen, you need to set options that disable the relevant
				3234	optimizations. More details, including a complete description of the
				3235	programming interface to the callout function, are given in the
				3236	.\" HREF
				3237	\fBpcre2callout\fP
				3238	.\"
				3239	documentation.
				3240	.
				3241	.
				3242	.SS "Callouts with numerical arguments"
				3243	.rs
				3244	.sp
				3245	If you just want to have a means of identifying different callout points, put a
				3246	number less than 256 after the letter C. For example, this pattern has two
				3247	callout points:
				3248	.sp
				3249	(?C1)abc(?C2)def
				3250	.sp
				3251	If the PCRE2_AUTO_CALLOUT flag is passed to \fBpcre2_compile()\fP, numerical
				3252	callouts are automatically installed before each item in the pattern. They are
				3253	all numbered 255. If there is a conditional group in the pattern whose
				3254	condition is an assertion, an additional callout is inserted just before the
				3255	condition. An explicit callout may also be set at this position, as in this
				3256	example:
				3257	.sp
				3258	(?(?C9)(?=a)abc\|def)
				3259	.sp
				3260	Note that this applies only to assertion conditions, not to other types of
				3261	condition.
				3262	.
				3263	.
				3264	.SS "Callouts with string arguments"
				3265	.rs
				3266	.sp
				3267	A delimited string may be used instead of a number as a callout argument. The
				3268	starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
				3269	the same as the start, except for {, where the ending delimiter is }. If the
				3270	ending delimiter is needed within the string, it must be doubled. For
				3271	example:
				3272	.sp
				3273	(?C'ab ''c'' d')xyz(?C{any text})pqr
				3274	.sp
				3275	The doubling is removed before the string is passed to the callout function.
				3276	.
				3277	.
				3278	.\" HTML <a name="backtrackcontrol"></a>
				3279	.SH "BACKTRACKING CONTROL"
				3280	.rs
				3281	.sp
				3282	There are a number of special "Backtracking Control Verbs" (to use Perl's
				3283	terminology) that modify the behaviour of backtracking during matching. They
				3284	are generally of the form (VERB) or (VERB:NAME). Some verbs take either form,
				3285	and may behave differently depending on whether or not a name argument is
				3286	present. The names are not required to be unique within the pattern.
				3287	.P
				3288	By default, for compatibility with Perl, a name is any sequence of characters
				3289	that does not include a closing parenthesis. The name is not processed in
				3290	any way, and it is not possible to include a closing parenthesis in the name.
				3291	This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
				3292	is no longer Perl-compatible.
				3293	.P
				3294	When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
				3295	and only an unescaped closing parenthesis terminates the name. However, the
				3296	only backslash items that are permitted are \eQ, \eE, and sequences such as
				3297	\ex{100} that define character code points. Character type escapes such as \ed
				3298	are faulted.
				3299	.P
				3300	A closing parenthesis can be included in a name either as \e) or between \eQ
				3301	and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or
				3302	PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
				3303	skipped, and #-comments are recognized, exactly as in the rest of the pattern.
				3304	PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
				3305	PCRE2_ALT_VERBNAMES is also set.
				3306	.P
				3307	The maximum length of a name is 255 in the 8-bit library and 65535 in the
				3308	16-bit and 32-bit libraries. If the name is empty, that is, if the closing
				3309	parenthesis immediately follows the colon, the effect is as if the colon were
				3310	not there. Any number of these verbs may occur in a pattern. Except for
				3311	(*ACCEPT), they may not be quantified.
				3312	.P
				3313	Since these verbs are specifically related to backtracking, most of them can be
				3314	used only when the pattern is to be matched using the traditional matching
				3315	function, because that uses a backtracking algorithm. With the exception of
				3316	(*FAIL), which behaves like a failing negative assertion, the backtracking
				3317	control verbs cause an error if encountered by the DFA matching function.
				3318	.P
				3319	The behaviour of these verbs in
				3320	.\" HTML <a href="#btrepeat">
				3321	.\" </a>
				3322	repeated groups,
				3323	.\"
				3324	.\" HTML <a href="#btassert">
				3325	.\" </a>
				3326	assertions,
				3327	.\"
				3328	and in
				3329	.\" HTML <a href="#btsub">
				3330	.\" </a>
				3331	capture groups called as subroutines
				3332	.\"
				3333	(whether or not recursively) is documented below.
				3334	.
				3335	.
				3336	.\" HTML <a name="nooptimize"></a>
				3337	.SS "Optimizations that affect backtracking verbs"
				3338	.rs
				3339	.sp
				3340	PCRE2 contains some optimizations that are used to speed up matching by running
				3341	some checks at the start of each match attempt. For example, it may know the
				3342	minimum length of matching subject, or that a particular character must be
				3343	present. When one of these optimizations bypasses the running of a match, any
				3344	included backtracking verbs will not, of course, be processed. You can suppress
				3345	the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
				3346	when calling \fBpcre2_compile()\fP, or by starting the pattern with
				3347	(*NO_START_OPT). There is more discussion of this option in the section
				3348	entitled
				3349	.\" HTML <a href="pcre2api.html#compiling">
				3350	.\" </a>
				3351	"Compiling a pattern"
				3352	.\"
				3353	in the
				3354	.\" HREF
				3355	\fBpcre2api\fP
				3356	.\"
				3357	documentation.
				3358	.P
				3359	Experiments with Perl suggest that it too has similar optimizations, and like
				3360	PCRE2, turning them off can change the result of a match.
				3361	.
				3362	.
				3363	.\" HTML <a name="acceptverb"></a>
				3364	.SS "Verbs that act immediately"
				3365	.rs
				3366	.sp
				3367	The following verbs act as soon as they are encountered.
				3368	.sp
				3369	(ACCEPT) or (ACCEPT:NAME)
				3370	.sp
				3371	This verb causes the match to end successfully, skipping the remainder of the
				3372	pattern. However, when it is inside a capture group that is called as a
				3373	subroutine, only that group is ended successfully. Matching then continues
				3374	at the outer level. If (*ACCEPT) in triggered in a positive assertion, the
				3375	assertion succeeds; in a negative assertion, the assertion fails.
				3376	.P
				3377	If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For
				3378	example:
				3379	.sp
				3380	A((?:A\|B(*ACCEPT)\|C)D)
				3381	.sp
				3382	This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
				3383	the outer parentheses.
				3384	.P
				3385	(*ACCEPT) is the only backtracking verb that is allowed to be quantified
				3386	because an ungreedy quantification with a minimum of zero acts only when a
				3387	backtrack happens. Consider, for example,
				3388	.sp
				3389	(A(*ACCEPT)??B)C
				3390	.sp
				3391	where A, B, and C may be complex expressions. After matching "A", the matcher
				3392	processes "BC"; if that fails, causing a backtrack, (*ACCEPT) is triggered and
				3393	the match succeeds. In both cases, all but C is captured. Whereas (*COMMIT)
				3394	(see below) means "fail on backtrack", a repeated (*ACCEPT) of this type means
				3395	"succeed on backtrack".
				3396	.P
				3397	\fBWarning:\fP (*ACCEPT) should not be used within a script run group, because
				3398	it causes an immediate exit from the group, bypassing the script run checking.
				3399	.sp
				3400	(FAIL) or (FAIL:NAME)
				3401	.sp
				3402	This verb causes a matching failure, forcing backtracking to occur. It may be
				3403	abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
				3404	documentation notes that it is probably useful only when combined with (?{}) or
				3405	(??{}). Those are, of course, Perl features that are not present in PCRE2. The
				3406	nearest equivalent is the callout feature, as for example in this pattern:
				3407	.sp
				3408	a+(?C)(*FAIL)
				3409	.sp
				3410	A match with the string "aaaa" always fails, but the callout is taken before
				3411	each backtrack happens (in this example, 10 times).
				3412	.P
				3413	(ACCEPT:NAME) and (FAIL:NAME) behave the same as (MARK:NAME)(ACCEPT) and
				3414	(MARK:NAME)(FAIL), respectively, that is, a (*MARK) is recorded just before
				3415	the verb acts.
				3416	.
				3417	.
				3418	.SS "Recording which path was taken"
				3419	.rs
				3420	.sp
				3421	There is one verb whose main purpose is to track how a match was arrived at,
				3422	though it also has a secondary use in conjunction with advancing the match
				3423	starting point (see (*SKIP) below).
				3424	.sp
				3425	(MARK:NAME) or (:NAME)
				3426	.sp
				3427	A name is always required with this verb. For all the other backtracking
				3428	control verbs, a NAME argument is optional.
				3429	.P
				3430	When a match succeeds, the name of the last-encountered mark name on the
				3431	matching path is passed back to the caller as described in the section entitled
				3432	.\" HTML <a href="pcre2api.html#matchotherdata">
				3433	.\" </a>
				3434	"Other information about the match"
				3435	.\"
				3436	in the
				3437	.\" HREF
				3438	\fBpcre2api\fP
				3439	.\"
				3440	documentation. This applies to all instances of (*MARK) and other verbs,
				3441	including those inside assertions and atomic groups. However, there are
				3442	differences in those cases when (MARK) is used in conjunction with (SKIP) as
				3443	described below.
				3444	.P
				3445	The mark name that was last encountered on the matching path is passed back. A
				3446	verb without a NAME argument is ignored for this purpose. Here is an example of
				3447	\fBpcre2test\fP output, where the "mark" modifier requests the retrieval and
				3448	outputting of (*MARK) data:
				3449	.sp
				3450	re> /X(MARK:A)Y\|X(MARK:B)Z/mark
				3451	data> XY
				3452	0: XY
				3453	MK: A
				3454	XZ
				3455	0: XZ
				3456	MK: B
				3457	.sp
				3458	The (*MARK) name is tagged with "MK:" in this output, and in this example it
				3459	indicates which of the two alternatives matched. This is a more efficient way
				3460	of obtaining this information than putting each alternative in its own
				3461	capturing parentheses.
				3462	.P
				3463	If a verb with a name is encountered in a positive assertion that is true, the
				3464	name is recorded and passed back if it is the last-encountered. This does not
				3465	happen for negative assertions or failing positive assertions.
				3466	.P
				3467	After a partial match or a failed match, the last encountered name in the
				3468	entire match process is returned. For example:
				3469	.sp
				3470	re> /X(MARK:A)Y\|X(MARK:B)Z/mark
				3471	data> XP
				3472	No match, mark = B
				3473	.sp
				3474	Note that in this unanchored example the mark is retained from the match
				3475	attempt that started at the letter "X" in the subject. Subsequent match
				3476	attempts starting at "P" and then with an empty string do not get as far as the
				3477	(*MARK) item, but nevertheless do not reset it.
				3478	.P
				3479	If you are interested in (*MARK) values after failed matches, you should
				3480	probably set the PCRE2_NO_START_OPTIMIZE option
				3481	.\" HTML <a href="#nooptimize">
				3482	.\" </a>
				3483	(see above)
				3484	.\"
				3485	to ensure that the match is always attempted.
				3486	.
				3487	.
				3488	.SS "Verbs that act after backtracking"
				3489	.rs
				3490	.sp
				3491	The following verbs do nothing when they are encountered. Matching continues
				3492	with what follows, but if there is a subsequent match failure, causing a
				3493	backtrack to the verb, a failure is forced. That is, backtracking cannot pass
				3494	to the left of the verb. However, when one of these verbs appears inside an
				3495	atomic group or in a lookaround assertion that is true, its effect is confined
				3496	to that group, because once the group has been matched, there is never any
				3497	backtracking into it. Backtracking from beyond an assertion or an atomic group
				3498	ignores the entire group, and seeks a preceding backtracking point.
				3499	.P
				3500	These verbs differ in exactly what kind of failure occurs when backtracking
				3501	reaches them. The behaviour described below is what happens when the verb is
				3502	not in a subroutine or an assertion. Subsequent sections cover these special
				3503	cases.
				3504	.sp
				3505	(COMMIT) or (COMMIT:NAME)
				3506	.sp
				3507	This verb causes the whole match to fail outright if there is a later matching
				3508	failure that causes backtracking to reach it. Even if the pattern is
				3509	unanchored, no further attempts to find a match by advancing the starting point
				3510	take place. If (*COMMIT) is the only backtracking verb that is encountered,
				3511	once it has been passed \fBpcre2_match()\fP is committed to finding a match at
				3512	the current starting point, or not at all. For example:
				3513	.sp
				3514	a+(*COMMIT)b
				3515	.sp
				3516	This matches "xxaab" but not "aacaab". It can be thought of as a kind of
				3517	dynamic anchor, or "I've started, so I must finish."
				3518	.P
				3519	The behaviour of (COMMIT:NAME) is not the same as (MARK:NAME)(*COMMIT). It is
				3520	like (*MARK:NAME) in that the name is remembered for passing back to the
				3521	caller. However, (*SKIP:NAME) searches only for names that are set with
				3522	(*MARK), ignoring those set by any of the other backtracking verbs.
				3523	.P
				3524	If there is more than one backtracking verb in a pattern, a different one that
				3525	follows (COMMIT) may be triggered first, so merely passing (COMMIT) during a
				3526	match does not always guarantee that a match must be at this starting point.
				3527	.P
				3528	Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
				3529	unless PCRE2's start-of-match optimizations are turned off, as shown in this
				3530	output from \fBpcre2test\fP:
				3531	.sp
				3532	re> /(*COMMIT)abc/
				3533	data> xyzabc
				3534	0: abc
				3535	data>
				3536	re> /(*COMMIT)abc/no_start_optimize
				3537	data> xyzabc
				3538	No match
				3539	.sp
				3540	For the first pattern, PCRE2 knows that any match must start with "a", so the
				3541	optimization skips along the subject to "a" before applying the pattern to the
				3542	first set of data. The match attempt then succeeds. The second pattern disables
				3543	the optimization that skips along to the first character. The pattern is now
				3544	applied starting at "x", and so the (*COMMIT) causes the match to fail without
				3545	trying any other starting points.
				3546	.sp
				3547	(PRUNE) or (PRUNE:NAME)
				3548	.sp
				3549	This verb causes the match to fail at the current starting position in the
				3550	subject if there is a later matching failure that causes backtracking to reach
				3551	it. If the pattern is unanchored, the normal "bumpalong" advance to the next
				3552	starting character then happens. Backtracking can occur as usual to the left of
				3553	(PRUNE), before it is reached, or when matching to the right of (PRUNE), but
				3554	if there is no match to the right, backtracking cannot cross (*PRUNE). In
				3555	simple cases, the use of (*PRUNE) is just an alternative to an atomic group or
				3556	possessive quantifier, but there are some uses of (*PRUNE) that cannot be
				3557	expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
				3558	as (*COMMIT).
				3559	.P
				3560	The behaviour of (PRUNE:NAME) is not the same as (MARK:NAME)(*PRUNE). It is
				3561	like (*MARK:NAME) in that the name is remembered for passing back to the
				3562	caller. However, (SKIP:NAME) searches only for names set with (MARK),
				3563	ignoring those set by other backtracking verbs.
				3564	.sp
				3565	(*SKIP)
				3566	.sp
				3567	This verb, when given without a name, is like (*PRUNE), except that if the
				3568	pattern is unanchored, the "bumpalong" advance is not to the next character,
				3569	but to the position in the subject where (SKIP) was encountered. (SKIP)
				3570	signifies that whatever text was matched leading up to it cannot be part of a
				3571	successful match if there is a later mismatch. Consider:
				3572	.sp
				3573	a+(*SKIP)b
				3574	.sp
				3575	If the subject is "aaaac...", after the first match attempt fails (starting at
				3576	the first character in the string), the starting point skips on to start the
				3577	next attempt at "c". Note that a possessive quantifier does not have the same
				3578	effect as this example; although it would suppress backtracking during the
				3579	first match attempt, the second attempt would start at the second character
				3580	instead of skipping on to "c".
				3581	.P
				3582	If (*SKIP) is used to specify a new starting position that is the same as the
				3583	starting position of the current match, or (by being inside a lookbehind)
				3584	earlier, the position specified by (*SKIP) is ignored, and instead the normal
				3585	"bumpalong" occurs.
				3586	.sp
				3587	(*SKIP:NAME)
				3588	.sp
				3589	When (*SKIP) has an associated name, its behaviour is modified. When such a
				3590	(*SKIP) is triggered, the previous path through the pattern is searched for the
				3591	most recent (*MARK) that has the same name. If one is found, the "bumpalong"
				3592	advance is to the subject position that corresponds to that (*MARK) instead of
				3593	to where (SKIP) was encountered. If no (MARK) with a matching name is found,
				3594	the (*SKIP) is ignored.
				3595	.P
				3596	The search for a (*MARK) name uses the normal backtracking mechanism, which
				3597	means that it does not see (*MARK) settings that are inside atomic groups or
				3598	assertions, because they are never re-entered by backtracking. Compare the
				3599	following \fBpcre2test\fP examples:
				3600	.sp
				3601	re> /a(?>(MARK:X))(SKIP:X)(*F)\|(.)/
				3602	data: abc
				3603	0: a
				3604	1: a
				3605	data:
				3606	re> /a(?:(MARK:X))(SKIP:X)(*F)\|(.)/
				3607	data: abc
				3608	0: b
				3609	1: b
				3610	.sp
				3611	In the first example, the (*MARK) setting is in an atomic group, so it is not
				3612	seen when (SKIP:X) triggers, causing the (SKIP) to be ignored. This allows
				3613	the second branch of the pattern to be tried at the first character position.
				3614	In the second example, the (*MARK) setting is not in an atomic group. This
				3615	allows (SKIP:X) to find the (MARK) when it backtracks, and this causes a new
				3616	matching attempt to start at the second character. This time, the (*MARK) is
				3617	never seen because "a" does not match "b", so the matcher immediately jumps to
				3618	the second branch of the pattern.
				3619	.P
				3620	Note that (SKIP:NAME) searches only for names set by (MARK:NAME). It ignores
				3621	names that are set by other backtracking verbs.
				3622	.sp
				3623	(THEN) or (THEN:NAME)
				3624	.sp
				3625	This verb causes a skip to the next innermost alternative when backtracking
				3626	reaches it. That is, it cancels any further backtracking within the current
				3627	alternative. Its name comes from the observation that it can be used for a
				3628	pattern-based if-then-else block:
				3629	.sp
				3630	( COND1 (THEN) FOO \| COND2 (THEN) BAR \| COND3 (*THEN) BAZ ) ...
				3631	.sp
				3632	If the COND1 pattern matches, FOO is tried (and possibly further items after
				3633	the end of the group if FOO succeeds); on failure, the matcher skips to the
				3634	second alternative and tries COND2, without backtracking into COND1. If that
				3635	succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
				3636	more alternatives, so there is a backtrack to whatever came before the entire
				3637	group. If (THEN) is not inside an alternation, it acts like (PRUNE).
				3638	.P
				3639	The behaviour of (THEN:NAME) is not the same as (MARK:NAME)(*THEN). It is
				3640	like (*MARK:NAME) in that the name is remembered for passing back to the
				3641	caller. However, (SKIP:NAME) searches only for names set with (MARK),
				3642	ignoring those set by other backtracking verbs.
				3643	.P
				3644	A group that does not contain a \| character is just a part of the enclosing
				3645	alternative; it is not a nested alternation with only one alternative. The
				3646	effect of (*THEN) extends beyond such a group to the enclosing alternative.
				3647	Consider this pattern, where A, B, etc. are complex pattern fragments that do
				3648	not contain any \| characters at this level:
				3649	.sp
				3650	A (B(*THEN)C) \| D
				3651	.sp
				3652	If A and B are matched, but there is a failure in C, matching does not
				3653	backtrack into A; instead it moves to the next alternative, that is, D.
				3654	However, if the group containing (*THEN) is given an alternative, it
				3655	behaves differently:
				3656	.sp
				3657	A (B(THEN)C \| (FAIL)) \| D
				3658	.sp
				3659	The effect of (*THEN) is now confined to the inner group. After a failure in C,
				3660	matching moves to (*FAIL), which causes the whole group to fail because there
				3661	are no more alternatives to try. In this case, matching does backtrack into A.
				3662	.P
				3663	Note that a conditional group is not considered as having two alternatives,
				3664	because only one is ever used. In other words, the \| character in a conditional
				3665	group has a different meaning. Ignoring white space, consider:
				3666	.sp
				3667	^.? (?(?=a) a \| b(THEN)c )
				3668	.sp
				3669	If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
				3670	it initially matches zero characters. The condition (?=a) then fails, the
				3671	character "b" is matched, but "c" is not. At this point, matching does not
				3672	backtrack to .*? as might perhaps be expected from the presence of the \|
				3673	character. The conditional group is part of the single alternative that
				3674	comprises the whole pattern, and so the match fails. (If there was a backtrack
				3675	into .*?, allowing it to match "b", the match would succeed.)
				3676	.P
				3677	The verbs just described provide four different "strengths" of control when
				3678	subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
				3679	next alternative. (*PRUNE) comes next, failing the match at the current
				3680	starting position, but allowing an advance to the next character (for an
				3681	unanchored pattern). (*SKIP) is similar, except that the advance may be more
				3682	than one character. (*COMMIT) is the strongest, causing the entire match to
				3683	fail.
				3684	.
				3685	.
				3686	.SS "More than one backtracking verb"
				3687	.rs
				3688	.sp
				3689	If more than one backtracking verb is present in a pattern, the one that is
				3690	backtracked onto first acts. For example, consider this pattern, where A, B,
				3691	etc. are complex pattern fragments:
				3692	.sp
				3693	(A(COMMIT)B(THEN)C\|ABD)
				3694	.sp
				3695	If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to
				3696	fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes
				3697	the next alternative (ABD) to be tried. This behaviour is consistent, but is
				3698	not always the same as Perl's. It means that if two or more backtracking verbs
				3699	appear in succession, all the the last of them has no effect. Consider this
				3700	example:
				3701	.sp
				3702	...(COMMIT)(PRUNE)...
				3703	.sp
				3704	If there is a matching failure to the right, backtracking onto (*PRUNE) causes
				3705	it to be triggered, and its action is taken. There can never be a backtrack
				3706	onto (*COMMIT).
				3707	.
				3708	.
				3709	.\" HTML <a name="btrepeat"></a>
				3710	.SS "Backtracking verbs in repeated groups"
				3711	.rs
				3712	.sp
				3713	PCRE2 sometimes differs from Perl in its handling of backtracking verbs in
				3714	repeated groups. For example, consider:
				3715	.sp
				3716	/(a(*COMMIT)b)+ac/
				3717	.sp
				3718	If the subject is "abac", Perl matches unless its optimizations are disabled,
				3719	but PCRE2 always fails because the (*COMMIT) in the second repeat of the group
				3720	acts.
				3721	.
				3722	.
				3723	.\" HTML <a name="btassert"></a>
				3724	.SS "Backtracking verbs in assertions"
				3725	.rs
				3726	.sp
				3727	(*FAIL) in any assertion has its normal effect: it forces an immediate
				3728	backtrack. The behaviour of the other backtracking verbs depends on whether or
				3729	not the assertion is standalone or acting as the condition in a conditional
				3730	group.
				3731	.P
				3732	(*ACCEPT) in a standalone positive assertion causes the assertion to succeed
				3733	without any further processing; captured strings and a mark name (if set) are
				3734	retained. In a standalone negative assertion, (*ACCEPT) causes the assertion to
				3735	fail without any further processing; captured substrings and any mark name are
				3736	discarded.
				3737	.P
				3738	If the assertion is a condition, (*ACCEPT) causes the condition to be true for
				3739	a positive assertion and false for a negative one; captured substrings are
				3740	retained in both cases.
				3741	.P
				3742	The remaining verbs act only when a later failure causes a backtrack to
				3743	reach them. This means that, for the Perl-compatible assertions, their effect
				3744	is confined to the assertion, because Perl lookaround assertions are atomic. A
				3745	backtrack that occurs after such an assertion is complete does not jump back
				3746	into the assertion. Note in particular that a (*MARK) name that is set in an
				3747	assertion is not "seen" by an instance of (*SKIP:NAME) later in the pattern.
				3748	.P
				3749	PCRE2 now supports non-atomic positive assertions, as described in the section
				3750	entitled
				3751	.\" HTML <a href="#nonatomicassertions">
				3752	.\" </a>
				3753	"Non-atomic assertions"
				3754	.\"
				3755	above. These assertions must be standalone (not used as conditions). They are
				3756	not Perl-compatible. For these assertions, a later backtrack does jump back
				3757	into the assertion, and therefore verbs such as (*COMMIT) can be triggered by
				3758	backtracks from later in the pattern.
				3759	.P
				3760	The effect of (*THEN) is not allowed to escape beyond an assertion. If there
				3761	are no more branches to try, (*THEN) causes a positive assertion to be false,
				3762	and a negative assertion to be true.
				3763	.P
				3764	The other backtracking verbs are not treated specially if they appear in a
				3765	standalone positive assertion. In a conditional positive assertion,
				3766	backtracking (from within the assertion) into (COMMIT), (SKIP), or (*PRUNE)
				3767	causes the condition to be false. However, for both standalone and conditional
				3768	negative assertions, backtracking into (COMMIT), (SKIP), or (*PRUNE) causes
				3769	the assertion to be true, without considering any further alternative branches.
				3770	.
				3771	.
				3772	.\" HTML <a name="btsub"></a>
				3773	.SS "Backtracking verbs in subroutines"
				3774	.rs
				3775	.sp
				3776	These behaviours occur whether or not the group is called recursively.
				3777	.P
				3778	(*ACCEPT) in a group called as a subroutine causes the subroutine match to
				3779	succeed without any further processing. Matching then continues after the
				3780	subroutine call. Perl documents this behaviour. Perl's treatment of the other
				3781	verbs in subroutines is different in some cases.
				3782	.P
				3783	(*FAIL) in a group called as a subroutine has its normal effect: it forces
				3784	an immediate backtrack.
				3785	.P
				3786	(COMMIT), (SKIP), and (*PRUNE) cause the subroutine match to fail when
				3787	triggered by being backtracked to in a group called as a subroutine. There is
				3788	then a backtrack at the outer level.
				3789	.P
				3790	(*THEN), when triggered, skips to the next alternative in the innermost
				3791	enclosing group that has alternatives (its normal behaviour). However, if there
				3792	is no such group within the subroutine's group, the subroutine match fails and
				3793	there is a backtrack at the outer level.
				3794	.
				3795	.
				3796	.SH "SEE ALSO"
				3797	.rs
				3798	.sp
				3799	\fBpcre2api\fP(3), \fBpcre2callout\fP(3), \fBpcre2matching\fP(3),
				3800	\fBpcre2syntax\fP(3), \fBpcre2\fP(3).
				3801	.
				3802	.
				3803	.SH AUTHOR
				3804	.rs
				3805	.sp
				3806	.nf
				3807	Philip Hazel
				3808	Retired from University Computing Service
				3809	Cambridge, England.
				3810	.fi
				3811	.
				3812	.
				3813	.SH REVISION
				3814	.rs
				3815	.sp
				3816	.nf
Elliott Hughes	4e19c8e	2022-04-15 15:11:02 -0700	[diff] [blame]	3817	Last updated: 12 January 2022
				3818	Copyright (c) 1997-2022 University of Cambridge.
Elliott Hughes	5b80804	2021-10-01 10:56:10 -0700	[diff] [blame]	3819	.fi