blob: 3088ec0fb28578987eab3c1a71d0fcf4e8207df7 [file] [log] [blame]
Elliott Hughes4e19c8e2022-04-15 15:11:02 -07001.TH PCRE2PATTERN 3 "12 January 2022" "PCRE2 10.40"
Elliott Hughes5b808042021-10-01 10:56:10 -07002.SH NAME
3PCRE2 - Perl-compatible regular expressions (revised API)
4.SH "PCRE2 REGULAR EXPRESSION DETAILS"
5.rs
6.sp
7The syntax and semantics of the regular expressions that are supported by PCRE2
8are described in detail below. There is a quick-reference syntax summary in the
9.\" HREF
10\fBpcre2syntax\fP
11.\"
12page. PCRE2 tries to match Perl syntax and semantics as closely as it can.
13PCRE2 also supports some alternative regular expression syntax (which does not
14conflict with the Perl syntax) in order to provide some compatibility with
15regular expressions in Python, .NET, and Oniguruma.
16.P
17Perl's regular expressions are described in its own documentation, and regular
18expressions in general are covered in a number of books, some of which have
19copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published
20by O'Reilly, covers regular expressions in great detail. This description of
21PCRE2's regular expressions is intended as reference material.
22.P
23This document discusses the regular expression patterns that are supported by
24PCRE2 when its main matching function, \fBpcre2_match()\fP, is used. PCRE2 also
25has an alternative matching function, \fBpcre2_dfa_match()\fP, which matches
26using a different algorithm that is not Perl-compatible. Some of the features
27discussed below are not available when DFA matching is used. The advantages and
28disadvantages of the alternative function, and how it differs from the normal
29function, are discussed in the
30.\" HREF
31\fBpcre2matching\fP
32.\"
33page.
34.
35.
36.SH "SPECIAL START-OF-PATTERN ITEMS"
37.rs
38.sp
39A number of options that can be passed to \fBpcre2_compile()\fP can also be set
40by special items at the start of a pattern. These are not Perl-compatible, but
41are provided to make these options accessible to pattern writers who are not
42able to change the program that processes the pattern. Any number of these
43items may appear, but they must all be together right at the start of the
44pattern string, and the letters must be in upper case.
45.
46.
47.SS "UTF support"
48.rs
49.sp
50In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
51single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
52specified for the 32-bit library, in which case it constrains the character
53values to valid Unicode code points. To process UTF strings, PCRE2 must be
54built to include Unicode support (which is the default). When using UTF strings
55you must either call the compiling function with one or both of the PCRE2_UTF
56or PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
57sequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
58setting a UTF mode affects pattern matching is mentioned in several places
59below. There is also a summary of features in the
60.\" HREF
61\fBpcre2unicode\fP
62.\"
63page.
64.P
65Some applications that allow their users to supply patterns may wish to
66restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
67option is passed to \fBpcre2_compile()\fP, (*UTF) is not allowed, and its
68appearance in a pattern causes an error.
69.
70.
71.SS "Unicode property support"
72.rs
73.sp
74Another special sequence that may appear at the start of a pattern is (*UCP).
75This has the same effect as setting the PCRE2_UCP option: it causes sequences
76such as \ed and \ew to use Unicode properties to determine character types,
77instead of recognizing only characters with codes less than 256 via a lookup
78table. If also causes upper/lower casing operations to use Unicode properties
79for characters with code points greater than 127, even when UTF is not set.
80.P
81Some applications that allow their users to supply patterns may wish to
82restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
83\fBpcre2_compile()\fP, (*UCP) is not allowed, and its appearance in a pattern
84causes an error.
85.
86.
87.SS "Locking out empty string matching"
88.rs
89.sp
90Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
91as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
92matching function is subsequently called to match the pattern. These options
93lock out the matching of empty strings, either entirely, or only at the start
94of the subject.
95.
96.
97.SS "Disabling auto-possessification"
98.rs
99.sp
100If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
101the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making quantifiers
102possessive when what follows cannot match the repeated item. For example, by
103default a+b is treated as a++b. For more details, see the
104.\" HREF
105\fBpcre2api\fP
106.\"
107documentation.
108.
109.
110.SS "Disabling start-up optimizations"
111.rs
112.sp
113If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
114PCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly
115reaching "no match" results. For more details, see the
116.\" HREF
117\fBpcre2api\fP
118.\"
119documentation.
120.
121.
122.SS "Disabling automatic anchoring"
123.rs
124.sp
125If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
126setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
127apply to patterns whose top-level branches all start with .* (match any number
128of arbitrary characters). For more details, see the
129.\" HREF
130\fBpcre2api\fP
131.\"
132documentation.
133.
134.
135.SS "Disabling JIT compilation"
136.rs
137.sp
138If a pattern that starts with (*NO_JIT) is successfully compiled, an attempt by
139the application to apply the JIT optimization by calling
140\fBpcre2_jit_compile()\fP is ignored.
141.
142.
143.SS "Setting match resource limits"
144.rs
145.sp
146The \fBpcre2_match()\fP function contains a counter that is incremented every
147time it goes round its main loop. The caller of \fBpcre2_match()\fP can set a
148limit on this counter, which therefore limits the amount of computing resource
149used for a match. The maximum depth of nested backtracking can also be limited;
150this indirectly restricts the amount of heap memory that is used, but there is
151also an explicit memory limit that can be set.
152.P
153These facilities are provided to catch runaway matches that are provoked by
154patterns with huge matching trees. A common example is a pattern with nested
155unlimited repeats applied to a long string that does not match. When one of
156these limits is reached, \fBpcre2_match()\fP gives an error return. The limits
157can also be set by items at the start of the pattern of the form
158.sp
159 (*LIMIT_HEAP=d)
160 (*LIMIT_MATCH=d)
161 (*LIMIT_DEPTH=d)
162.sp
163where d is any number of decimal digits. However, the value of the setting must
164be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
165for it to have any effect. In other words, the pattern writer can lower the
166limits set by the programmer, but not raise them. If there is more than one
167setting of one of these limits, the lower value is used. The heap limit is
168specified in kibibytes (units of 1024 bytes).
169.P
170Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
171still recognized for backwards compatibility.
172.P
173The heap limit applies only when the \fBpcre2_match()\fP or
174\fBpcre2_dfa_match()\fP interpreters are used for matching. It does not apply
175to JIT. The match limit is used (but in a different way) when JIT is being
176used, or when \fBpcre2_dfa_match()\fP is called, to limit computing resource
177usage by those matching functions. The depth limit is ignored by JIT but is
178relevant for DFA matching, which uses function recursion for recursions within
179the pattern and for lookaround assertions and atomic groups. In this case, the
180depth limit controls the depth of such recursion.
181.
182.
183.\" HTML <a name="newlines"></a>
184.SS "Newline conventions"
185.rs
186.sp
187PCRE2 supports six different conventions for indicating line breaks in
188strings: a single CR (carriage return) character, a single LF (linefeed)
189character, the two-character sequence CRLF, any of the three preceding, any
190Unicode newline sequence, or the NUL character (binary zero). The
191.\" HREF
192\fBpcre2api\fP
193.\"
194page has
195.\" HTML <a href="pcre2api.html#newlines">
196.\" </a>
197further discussion
198.\"
199about newlines, and shows how to set the newline convention when calling
200\fBpcre2_compile()\fP.
201.P
202It is also possible to specify a newline convention by starting a pattern
203string with one of the following sequences:
204.sp
205 (*CR) carriage return
206 (*LF) linefeed
207 (*CRLF) carriage return, followed by linefeed
208 (*ANYCRLF) any of the three above
209 (*ANY) all Unicode newline sequences
210 (*NUL) the NUL character (binary zero)
211.sp
212These override the default and the options given to the compiling function. For
213example, on a Unix system where LF is the default newline sequence, the pattern
214.sp
215 (*CR)a.b
216.sp
217changes the convention to CR. That pattern matches "a\enb" because LF is no
218longer a newline. If more than one of these settings is present, the last one
219is used.
220.P
221The newline convention affects where the circumflex and dollar assertions are
222true. It also affects the interpretation of the dot metacharacter when
223PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an
224opening brace. However, it does not affect what the \eR escape sequence
225matches. By default, this is any Unicode newline sequence, for Perl
226compatibility. However, this can be changed; see the next section and the
227description of \eR in the section entitled
228.\" HTML <a href="#newlineseq">
229.\" </a>
230"Newline sequences"
231.\"
232below. A change of \eR setting can be combined with a change of newline
233convention.
234.
235.
236.SS "Specifying what \eR matches"
237.rs
238.sp
239It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
240complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
241at compile time. This effect can also be achieved by starting a pattern with
242(*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
243corresponding to PCRE2_BSR_UNICODE.
244.
245.
246.SH "EBCDIC CHARACTER CODES"
247.rs
248.sp
249PCRE2 can be compiled to run in an environment that uses EBCDIC as its
250character code instead of ASCII or Unicode (typically a mainframe system). In
251the sections below, character code values are ASCII or Unicode; in an EBCDIC
252environment these characters may have different code values, and there are no
253code points greater than 255.
254.
255.
256.SH "CHARACTERS AND METACHARACTERS"
257.rs
258.sp
259A regular expression is a pattern that is matched against a subject string from
260left to right. Most characters stand for themselves in a pattern, and match the
261corresponding characters in the subject. As a trivial example, the pattern
262.sp
263 The quick brown fox
264.sp
265matches a portion of a subject string that is identical to itself. When
266caseless matching is specified (the PCRE2_CASELESS option or (?i) within the
267pattern), letters are matched independently of case. Note that there are two
268ASCII characters, K and S, that, in addition to their lower case ASCII
269equivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
270(long S) respectively when either PCRE2_UTF or PCRE2_UCP is set.
271.P
272The power of regular expressions comes from the ability to include wild cards,
273character classes, alternatives, and repetitions in the pattern. These are
274encoded in the pattern by the use of \fImetacharacters\fP, which do not stand
275for themselves but instead are interpreted in some special way.
276.P
277There are two different sets of metacharacters: those that are recognized
278anywhere in the pattern except within square brackets, and those that are
279recognized within square brackets. Outside square brackets, the metacharacters
280are as follows:
281.sp
282 \e general escape character with several uses
283 ^ assert start of string (or line, in multiline mode)
284 $ assert end of string (or line, in multiline mode)
285 . match any character except newline (by default)
286 [ start character class definition
287 | start of alternative branch
288 ( start group or control verb
289 ) end group or control verb
290 * 0 or more quantifier
291 + 1 or more quantifier; also "possessive quantifier"
292 ? 0 or 1 quantifier; also quantifier minimizer
293 { start min/max quantifier
294.sp
295Part of a pattern that is in square brackets is called a "character class". In
296a character class the only metacharacters are:
297.sp
298 \e general escape character
299 ^ negate the class, but only if the first character
300 - indicates character range
301 [ POSIX character class (if followed by POSIX syntax)
302 ] terminates the character class
303.sp
304If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
305the pattern, other than in a character class, and characters between a #
306outside a character class and the next newline, inclusive, are ignored. An
307escaping backslash can be used to include a white space or a # character as
308part of the pattern. If the PCRE2_EXTENDED_MORE option is set, the same
309applies, but in addition unescaped space and horizontal tab characters are
310ignored inside a character class. Note: only these two characters are ignored,
311not the full set of pattern white space characters that are ignored outside a
312character class. Option settings can be changed within a pattern; see the
313section entitled
314.\" HTML <a href="#internaloptions">
315.\" </a>
316"Internal Option Setting"
317.\"
318below.
319.P
320The following sections describe the use of each of the metacharacters.
321.
322.
323.SH BACKSLASH
324.rs
325.sp
326The backslash character has several uses. Firstly, if it is followed by a
327character that is not a digit or a letter, it takes away any special meaning
328that character may have. This use of backslash as an escape character applies
329both inside and outside character classes.
330.P
331For example, if you want to match a * character, you must write \e* in the
332pattern. This escaping action applies whether or not the following character
333would otherwise be interpreted as a metacharacter, so it is always safe to
334precede a non-alphanumeric with backslash to specify that it stands for itself.
335In particular, if you want to match a backslash, you write \e\e.
336.P
337Only ASCII digits and letters have any special meaning after a backslash. All
338other characters (in particular, those whose code points are greater than 127)
339are treated as literals.
340.P
341If you want to treat all characters in a sequence as literals, you can do so by
342putting them between \eQ and \eE. This is different from Perl in that $ and @
343are handled as literals in \eQ...\eE sequences in PCRE2, whereas in Perl, $ and
344@ cause variable interpolation. Also, Perl does "double-quotish backslash
345interpolation" on any backslashes between \eQ and \eE which, its documentation
346says, "may lead to confusing results". PCRE2 treats a backslash between \eQ and
347\eE just like any other character. Note the following examples:
348.sp
349 Pattern PCRE2 matches Perl matches
350.sp
351.\" JOIN
352 \eQabc$xyz\eE abc$xyz abc followed by the
353 contents of $xyz
354 \eQabc\e$xyz\eE abc\e$xyz abc\e$xyz
355 \eQabc\eE\e$\eQxyz\eE abc$xyz abc$xyz
356 \eQA\eB\eE A\eB A\eB
357 \eQ\e\eE \e \e\eE
358.sp
359The \eQ...\eE sequence is recognized both inside and outside character classes.
360An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed
361by \eE later in the pattern, the literal interpretation continues to the end of
362the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
363a character class, this causes an error, because the character class is not
364terminated by a closing square bracket.
365.
366.
367.\" HTML <a name="digitsafterbackslash"></a>
368.SS "Non-printing characters"
369.rs
370.sp
371A second use of backslash provides a way of encoding non-printing characters
372in patterns in a visible manner. There is no restriction on the appearance of
373non-printing characters in a pattern, but when a pattern is being prepared by
374text editing, it is often easier to use one of the following escape sequences
375instead of the binary character it represents. In an ASCII or Unicode
376environment, these escapes are as follows:
377.sp
378 \ea alarm, that is, the BEL character (hex 07)
379 \ecx "control-x", where x is any printable ASCII character
380 \ee escape (hex 1B)
381 \ef form feed (hex 0C)
382 \en linefeed (hex 0A)
383 \er carriage return (hex 0D) (but see below)
384 \et tab (hex 09)
385 \e0dd character with octal code 0dd
386 \eddd character with octal code ddd, or backreference
387 \eo{ddd..} character with octal code ddd..
388 \exhh character with hex code hh
389 \ex{hhh..} character with hex code hhh..
390 \eN{U+hhh..} character with Unicode hex code point hhh..
391.sp
392By default, after \ex that is not followed by {, from zero to two hexadecimal
393digits are read (letters can be in upper or lower case). Any number of
394hexadecimal digits may appear between \ex{ and }. If a character other than a
395hexadecimal digit appears between \ex{ and }, or if there is no terminating },
396an error occurs.
397.P
398Characters whose code points are less than 256 can be defined by either of the
399two syntaxes for \ex or by an octal sequence. There is no difference in the way
400they are handled. For example, \exdc is exactly the same as \ex{dc} or \e334.
401However, using the braced versions does make such sequences easier to read.
402.P
403Support is available for some ECMAScript (aka JavaScript) escape sequences via
404two compile-time options. If PCRE2_ALT_BSUX is set, the sequence \ex followed
405by { is not recognized. Only if \ex is followed by two hexadecimal digits is it
406recognized as a character escape. Otherwise it is interpreted as a literal "x"
407character. In this mode, support for code points greater than 256 is provided
408by \eu, which must be followed by four hexadecimal digits; otherwise it is
409interpreted as a literal "u" character.
410.P
411PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
412\eu{hhh..} is recognized as the character specified by hexadecimal code point.
413There may be any number of hexadecimal digits. This syntax is from ECMAScript
4146.
415.P
416The \eN{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
417UTF mode. Perl also uses \eN{name} to specify characters by Unicode name; PCRE2
418does not support this. Note that when \eN is not followed by an opening brace
419(curly bracket) it has an entirely different meaning, matching any character
420that is not a newline.
421.P
422There are some legacy applications where the escape sequence \er is expected to
423match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \er in a
424pattern is converted to \en so that it matches a LF (linefeed) instead of a CR
425(carriage return) character.
426.P
427The precise effect of \ecx on ASCII characters is as follows: if x is a lower
428case letter, it is converted to upper case. Then bit 6 of the character (hex
42940) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
430but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
431code unit following \ec has a value less than 32 or greater than 126, a
432compile-time error occurs.
433.P
434When PCRE2 is compiled in EBCDIC mode, \eN{U+hhh..} is not supported. \ea, \ee,
435\ef, \en, \er, and \et generate the appropriate EBCDIC code values. The \ec
436escape is processed as specified for Perl in the \fBperlebcdic\fP document. The
437only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
438^, _, or ?. Any other character provokes a compile-time error. The sequence
439\ec@ encodes character code 0; after \ec the letters (in either case) encode
440characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
441(hex 1B to hex 1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
442.P
443Thus, apart from \ec?, these escapes generate the same character code values as
444they do in an ASCII environment, though the meanings of the values mostly
445differ. For example, \ecG always generates code value 7, which is BEL in ASCII
446but DEL in EBCDIC.
447.P
448The sequence \ec? generates DEL (127, hex 7F) in an ASCII environment, but
449because 127 is not a control character in EBCDIC, Perl makes it generate the
450APC character. Unfortunately, there are several variants of EBCDIC. In most of
451them the APC character has the value 255 (hex FF), but in the one Perl calls
452POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
453values, PCRE2 makes \ec? generate 95; otherwise it generates 255.
454.P
455After \e0 up to two further octal digits are read. If there are fewer than two
456digits, just those that are present are used. Thus the sequence \e0\ex\e015
457specifies two binary zeros followed by a CR character (code value 13). Make
458sure you supply two digits after the initial zero if the pattern character that
459follows is itself an octal digit.
460.P
461The escape \eo must be followed by a sequence of octal digits, enclosed in
462braces. An error occurs if this is not the case. This escape is a recent
463addition to Perl; it provides way of specifying character code points as octal
464numbers greater than 0777, and it also allows octal numbers and backreferences
465to be unambiguously specified.
466.P
467For greater clarity and unambiguity, it is best to avoid following \e by a
468digit greater than zero. Instead, use \eo{} or \ex{} to specify numerical
469character code points, and \eg{} to specify backreferences. The following
470paragraphs describe the old, ambiguous syntax.
471.P
472The handling of a backslash followed by a digit other than 0 is complicated,
473and Perl has changed over time, causing PCRE2 also to change.
474.P
475Outside a character class, PCRE2 reads the digit and any following digits as a
476decimal number. If the number is less than 10, begins with the digit 8 or 9, or
477if there are at least that many previous capture groups in the expression, the
478entire sequence is taken as a \fIbackreference\fP. A description of how this
479works is given
480.\" HTML <a href="#backreferences">
481.\" </a>
482later,
483.\"
484following the discussion of
485.\" HTML <a href="#group">
486.\" </a>
487parenthesized groups.
488.\"
489Otherwise, up to three octal digits are read to form a character code.
490.P
491Inside a character class, PCRE2 handles \e8 and \e9 as the literal characters
492"8" and "9", and otherwise reads up to three octal digits following the
493backslash, using them to generate a data character. Any subsequent digits stand
494for themselves. For example, outside a character class:
495.sp
496 \e040 is another way of writing an ASCII space
497.\" JOIN
498 \e40 is the same, provided there are fewer than 40
499 previous capture groups
500 \e7 is always a backreference
501.\" JOIN
502 \e11 might be a backreference, or another way of
503 writing a tab
504 \e011 is always a tab
505 \e0113 is a tab followed by the character "3"
506.\" JOIN
507 \e113 might be a backreference, otherwise the
508 character with octal code 113
509.\" JOIN
510 \e377 might be a backreference, otherwise
511 the value 255 (decimal)
Elliott Hughes5b808042021-10-01 10:56:10 -0700512 \e81 is always a backreference
513.sp
514Note that octal values of 100 or greater that are specified using this syntax
515must not be introduced by a leading zero, because no more than three octal
516digits are ever read.
517.
518.
519.SS "Constraints on character values"
520.rs
521.sp
522Characters that are specified using octal or hexadecimal numbers are
523limited to certain values, as follows:
524.sp
525 8-bit non-UTF mode no greater than 0xff
526 16-bit non-UTF mode no greater than 0xffff
527 32-bit non-UTF mode no greater than 0xffffffff
528 All UTF modes no greater than 0x10ffff and a valid code point
529.sp
530Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
531so-called "surrogate" code points). The check for these can be disabled by the
532caller of \fBpcre2_compile()\fP by setting the option
533PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
534and UTF-32 modes, because these values are not representable in UTF-16.
535.
536.
537.SS "Escape sequences in character classes"
538.rs
539.sp
540All the sequences that define a single character value can be used both inside
541and outside character classes. In addition, inside a character class, \eb is
542interpreted as the backspace character (hex 08).
543.P
544When not followed by an opening brace, \eN is not allowed in a character class.
545\eB, \eR, and \eX are not special inside a character class. Like other
546unrecognized alphabetic escape sequences, they cause an error. Outside a
547character class, these sequences have different meanings.
548.
549.
550.SS "Unsupported escape sequences"
551.rs
552.sp
553In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string
554handler and used to modify the case of following characters. By default, PCRE2
555does not support these escape sequences in patterns. However, if either of the
556PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \eU matches a "U"
557character, and \eu can be used to define a character by code point, as
558described above.
559.
560.
561.SS "Absolute and relative backreferences"
562.rs
563.sp
564The sequence \eg followed by a signed or unsigned number, optionally enclosed
565in braces, is an absolute or relative backreference. A named backreference
566can be coded as \eg{name}. Backreferences are discussed
567.\" HTML <a href="#backreferences">
568.\" </a>
569later,
570.\"
571following the discussion of
572.\" HTML <a href="#group">
573.\" </a>
574parenthesized groups.
575.\"
576.
577.
578.SS "Absolute and relative subroutine calls"
579.rs
580.sp
581For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
582a number enclosed either in angle brackets or single quotes, is an alternative
583syntax for referencing a capture group as a subroutine. Details are discussed
584.\" HTML <a href="#onigurumasubroutines">
585.\" </a>
586later.
587.\"
588Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
589synonymous. The former is a backreference; the latter is a
590.\" HTML <a href="#groupsassubroutines">
591.\" </a>
592subroutine
593.\"
594call.
595.
596.
597.\" HTML <a name="genericchartypes"></a>
598.SS "Generic character types"
599.rs
600.sp
601Another use of backslash is for specifying generic character types:
602.sp
603 \ed any decimal digit
604 \eD any character that is not a decimal digit
605 \eh any horizontal white space character
606 \eH any character that is not a horizontal white space character
607 \eN any character that is not a newline
608 \es any white space character
609 \eS any character that is not a white space character
610 \ev any vertical white space character
611 \eV any character that is not a vertical white space character
612 \ew any "word" character
613 \eW any "non-word" character
614.sp
615The \eN escape sequence has the same meaning as
616.\" HTML <a href="#fullstopdot">
617.\" </a>
618the "." metacharacter
619.\"
620when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
621meaning of \eN. Note that when \eN is followed by an opening brace it has a
622different meaning. See the section entitled
623.\" HTML <a href="#digitsafterbackslash">
624.\" </a>
625"Non-printing characters"
626.\"
627above for details. Perl also uses \eN{name} to specify characters by Unicode
628name; PCRE2 does not support this.
629.P
630Each pair of lower and upper case escape sequences partitions the complete set
631of characters into two disjoint sets. Any given character matches one, and only
632one, of each pair. The sequences can appear both inside and outside character
633classes. They each match one character of the appropriate type. If the current
634matching point is at the end of the subject string, all of them fail, because
635there is no character to match.
636.P
637The default \es characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
638space (32), which are defined as white space in the "C" locale. This list may
639vary if locale-specific matching is taking place. For example, in some locales
640the "non-breaking space" character (\exA0) is recognized as white space, and in
641others the VT character is not.
642.P
643A "word" character is an underscore or any character that is a letter or digit.
644By default, the definition of letters and digits is controlled by PCRE2's
645low-valued character tables, and may vary if locale-specific matching is taking
646place (see
647.\" HTML <a href="pcre2api.html#localesupport">
648.\" </a>
649"Locale support"
650.\"
651in the
652.\" HREF
653\fBpcre2api\fP
654.\"
655page). For example, in a French locale such as "fr_FR" in Unix-like systems,
656or "french" in Windows, some character codes greater than 127 are used for
657accented letters, and these are then matched by \ew. The use of locales with
658Unicode is discouraged.
659.P
660By default, characters whose code points are greater than 127 never match \ed,
661\es, or \ew, and always match \eD, \eS, and \eW, although this may be different
662for characters in the range 128-255 when locale-specific matching is happening.
663These escape sequences retain their original meanings from before Unicode
664support was available, mainly for efficiency reasons. If the PCRE2_UCP option
665is set, the behaviour is changed so that Unicode properties are used to
666determine character types, as follows:
667.sp
668 \ed any character that matches \ep{Nd} (decimal digit)
669 \es any character that matches \ep{Z} or \eh or \ev
670 \ew any character that matches \ep{L} or \ep{N}, plus underscore
671.sp
672The upper case escapes match the inverse sets of characters. Note that \ed
673matches only decimal digits, whereas \ew matches any Unicode digit, as well as
674any Unicode letter, and underscore. Note also that PCRE2_UCP affects \eb, and
675\eB because they are defined in terms of \ew and \eW. Matching these sequences
676is noticeably slower when PCRE2_UCP is set.
677.P
678The sequences \eh, \eH, \ev, and \eV, in contrast to the other sequences, which
679match only ASCII characters by default, always match a specific list of code
680points, whether or not PCRE2_UCP is set. The horizontal space characters are:
681.sp
682 U+0009 Horizontal tab (HT)
683 U+0020 Space
684 U+00A0 Non-break space
685 U+1680 Ogham space mark
686 U+180E Mongolian vowel separator
687 U+2000 En quad
688 U+2001 Em quad
689 U+2002 En space
690 U+2003 Em space
691 U+2004 Three-per-em space
692 U+2005 Four-per-em space
693 U+2006 Six-per-em space
694 U+2007 Figure space
695 U+2008 Punctuation space
696 U+2009 Thin space
697 U+200A Hair space
698 U+202F Narrow no-break space
699 U+205F Medium mathematical space
700 U+3000 Ideographic space
701.sp
702The vertical space characters are:
703.sp
704 U+000A Linefeed (LF)
705 U+000B Vertical tab (VT)
706 U+000C Form feed (FF)
707 U+000D Carriage return (CR)
708 U+0085 Next line (NEL)
709 U+2028 Line separator
710 U+2029 Paragraph separator
711.sp
712In 8-bit, non-UTF-8 mode, only the characters with code points less than 256
713are relevant.
714.
715.
716.\" HTML <a name="newlineseq"></a>
717.SS "Newline sequences"
718.rs
719.sp
720Outside a character class, by default, the escape sequence \eR matches any
721Unicode newline sequence. In 8-bit non-UTF-8 mode \eR is equivalent to the
722following:
723.sp
724 (?>\er\en|\en|\ex0b|\ef|\er|\ex85)
725.sp
726This is an example of an "atomic group", details of which are given
727.\" HTML <a href="#atomicgroup">
728.\" </a>
729below.
730.\"
731This particular group matches either the two-character sequence CR followed by
732LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
733U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
734line, U+0085). Because this is an atomic group, the two-character sequence is
735treated as a single unit that cannot be split.
736.P
737In other modes, two additional characters whose code points are greater than 255
738are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
739Unicode support is not needed for these characters to be recognized.
740.P
741It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
742complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
743at compile time. (BSR is an abbreviation for "backslash R".) This can be made
744the default when PCRE2 is built; if this is the case, the other behaviour can
745be requested via the PCRE2_BSR_UNICODE option. It is also possible to specify
746these settings by starting a pattern string with one of the following
747sequences:
748.sp
749 (*BSR_ANYCRLF) CR, LF, or CRLF only
750 (*BSR_UNICODE) any Unicode newline sequence
751.sp
752These override the default and the options given to the compiling function.
753Note that these special settings, which are not Perl-compatible, are recognized
754only at the very start of a pattern, and that they must be in upper case. If
755more than one of them is present, the last one is used. They can be combined
756with a change of newline convention; for example, a pattern can start with:
757.sp
758 (*ANY)(*BSR_ANYCRLF)
759.sp
760They can also be combined with the (*UTF) or (*UCP) special sequences. Inside a
761character class, \eR is treated as an unrecognized escape sequence, and causes
762an error.
763.
764.
765.\" HTML <a name="uniextseq"></a>
766.SS Unicode character properties
767.rs
768.sp
769When PCRE2 is built with Unicode support (the default), three additional escape
770sequences that match characters with specific properties are available. They
771can be used in any mode, though in 8-bit and 16-bit non-UTF modes these
772sequences are of course limited to testing characters whose code points are
773less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points
774greater than 0x10ffff (the Unicode limit) may be encountered. These are all
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700775treated as being in the Unknown script and with an unassigned type.
776.P
777Matching characters by Unicode property is not fast, because PCRE2 has to do a
778multistage table lookup in order to find a character's property. That is why
779the traditional escape sequences such as \ed and \ew do not use Unicode
780properties in PCRE2 by default, though you can make them do so by setting the
781PCRE2_UCP option or by starting the pattern with (*UCP).
782.P
783The extra escape sequences that provide property support are:
Elliott Hughes5b808042021-10-01 10:56:10 -0700784.sp
785 \ep{\fIxx\fP} a character with the \fIxx\fP property
786 \eP{\fIxx\fP} a character without the \fIxx\fP property
787 \eX a Unicode extended grapheme cluster
788.sp
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700789The property names represented by \fIxx\fP above are not case-sensitive, and in
790accordance with Unicode's "loose matching" rules, spaces, hyphens, and
791underscores are ignored. There is support for Unicode script names, Unicode
792general category properties, "Any", which matches any character (including
793newline), Bidi_Class, a number of binary (yes/no) properties, and some special
794PCRE2 properties (described
Elliott Hughes5b808042021-10-01 10:56:10 -0700795.\" HTML <a href="#extraprops">
796.\" </a>
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700797below).
Elliott Hughes5b808042021-10-01 10:56:10 -0700798.\"
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700799Certain other Perl properties such as "InMusicalSymbols" are not supported by
800PCRE2. Note that \eP{Any} does not match any characters, so always causes a
801match failure.
802.
803.
804.
805.SS "Script properties for \ep and \eP"
806.rs
807.sp
808There are three different syntax forms for matching a script. Each Unicode
809character has a basic script and, optionally, a list of other scripts ("Script
810Extensions") with which it is commonly used. Using the Adlam script as an
811example, \ep{sc:Adlam} matches characters whose basic script is Adlam, whereas
812\ep{scx:Adlam} matches, in addition, characters that have Adlam in their
813extensions list. The full names "script" and "script extensions" for the
814property types are recognized, and a equals sign is an alternative to the
815colon. If a script name is given without a property type, for example,
816\ep{Adlam}, it is treated as \ep{scx:Adlam}. Perl changed to this
817interpretation at release 5.26 and PCRE2 changed at release 10.40.
Elliott Hughes5b808042021-10-01 10:56:10 -0700818.P
Elliott Hughes5b808042021-10-01 10:56:10 -0700819Unassigned characters (and in non-UTF 32-bit mode, characters with code points
820greater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
821part of an identified script are lumped together as "Common". The current list
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700822of recognized script names and their 4-character abbreviations can be obtained
823by running this command:
824.sp
825 pcre2test -LS
826.sp
827.
828.
829.
830.SS "The general category property for \ep and \eP"
831.rs
832.sp
Elliott Hughes5b808042021-10-01 10:56:10 -0700833Each character has exactly one Unicode general category property, specified by
834a two-letter abbreviation. For compatibility with Perl, negation can be
835specified by including a circumflex between the opening brace and the property
836name. For example, \ep{^Lu} is the same as \eP{Lu}.
837.P
838If only one letter is specified with \ep or \eP, it includes all the general
839category properties that start with that letter. In this case, in the absence
840of negation, the curly brackets in the escape sequence are optional; these two
841examples have the same effect:
842.sp
843 \ep{L}
844 \epL
845.sp
846The following general category property codes are supported:
847.sp
848 C Other
849 Cc Control
850 Cf Format
851 Cn Unassigned
852 Co Private use
853 Cs Surrogate
854.sp
855 L Letter
856 Ll Lower case letter
857 Lm Modifier letter
858 Lo Other letter
859 Lt Title case letter
860 Lu Upper case letter
861.sp
862 M Mark
863 Mc Spacing mark
864 Me Enclosing mark
865 Mn Non-spacing mark
866.sp
867 N Number
868 Nd Decimal number
869 Nl Letter number
870 No Other number
871.sp
872 P Punctuation
873 Pc Connector punctuation
874 Pd Dash punctuation
875 Pe Close punctuation
876 Pf Final punctuation
877 Pi Initial punctuation
878 Po Other punctuation
879 Ps Open punctuation
880.sp
881 S Symbol
882 Sc Currency symbol
883 Sk Modifier symbol
884 Sm Mathematical symbol
885 So Other symbol
886.sp
887 Z Separator
888 Zl Line separator
889 Zp Paragraph separator
890 Zs Space separator
891.sp
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700892The special property LC, which has the synonym L&, is also supported: it
893matches a character that has the Lu, Ll, or Lt property, in other words, a
894letter that is not classified as a modifier or "other".
Elliott Hughes5b808042021-10-01 10:56:10 -0700895.P
896The Cs (Surrogate) property applies only to characters whose code points are in
897the range U+D800 to U+DFFF. These characters are no different to any other
898character when PCRE2 is not in UTF mode (using the 16-bit or 32-bit library).
899However, they are not valid in Unicode strings and so cannot be tested by PCRE2
900in UTF mode, unless UTF validity checking has been turned off (see the
901discussion of PCRE2_NO_UTF_CHECK in the
902.\" HREF
903\fBpcre2api\fP
904.\"
905page).
906.P
907The long synonyms for property names that Perl supports (such as \ep{Letter})
908are not supported by PCRE2, nor is it permitted to prefix any of these
909properties with "Is".
910.P
911No character that is in the Unicode table has the Cn (unassigned) property.
912Instead, this property is assumed for any code point that is not in the
913Unicode table.
914.P
915Specifying caseless matching does not affect these escape sequences. For
916example, \ep{Lu} always matches only upper case letters. This is different from
917the behaviour of current versions of Perl.
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700918.
919.
920.SS "Binary (yes/no) properties for \ep and \eP"
921.rs
922.sp
923Unicode defines a number of binary properties, that is, properties whose only
924values are true or false. You can obtain a list of those that are recognized by
925\ep and \eP, along with their abbreviations, by running this command:
926.sp
927 pcre2test -LP
928.sp
929.
930.
931.SS "The Bidi_Class property for \ep and \eP"
932.rs
933.sp
934 \ep{Bidi_Class:<class>} matches a character with the given class
935 \ep{BC:<class>} matches a character with the given class
936.sp
937The recognized classes are:
938.sp
939 AL Arabic letter
940 AN Arabic number
941 B paragraph separator
942 BN boundary neutral
943 CS common separator
944 EN European number
945 ES European separator
946 ET European terminator
947 FSI first strong isolate
948 L left-to-right
949 LRE left-to-right embedding
950 LRI left-to-right isolate
951 LRO left-to-right override
952 NSM non-spacing mark
953 ON other neutral
954 PDF pop directional format
955 PDI pop directional isolate
956 R right-to-left
957 RLE right-to-left embedding
958 RLI right-to-left isolate
959 RLO right-to-left override
960 S segment separator
961 WS which space
962.sp
963An equals sign may be used instead of a colon. The class names are
964case-insensitive; only the short names listed above are recognized.
Elliott Hughes5b808042021-10-01 10:56:10 -0700965.
966.
967.SS Extended grapheme clusters
968.rs
969.sp
970The \eX escape matches any number of Unicode characters that form an "extended
971grapheme cluster", and treats the sequence as an atomic group
972.\" HTML <a href="#atomicgroup">
973.\" </a>
974(see below).
975.\"
976Unicode supports various kinds of composite character by giving each character
977a grapheme breaking property, and having rules that use these properties to
978define the boundaries of extended grapheme clusters. The rules are defined in
979Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
980abandoned the use of some previous properties that had been used for emojis.
981Instead it introduced various emoji-specific properties. PCRE2 uses only the
982Extended Pictographic property.
983.P
984\eX always matches at least one character. Then it decides whether to add
985additional characters according to the following rules for ending a cluster:
986.P
9871. End at the end of the subject string.
988.P
9892. Do not end between CR and LF; otherwise end after any control character.
990.P
9913. Do not break Hangul (a Korean script) syllable sequences. Hangul characters
992are of five types: L, V, T, LV, and LVT. An L character may be followed by an
993L, V, LV, or LVT character; an LV or V character may be followed by a V or T
994character; an LVT or T character may be followed only by a T character.
995.P
9964. Do not end before extending characters or spacing marks or the "zero-width
997joiner" character. Characters with the "mark" property always have the
998"extend" grapheme breaking property.
999.P
10005. Do not end after prepend characters.
1001.P
10026. Do not break within emoji modifier sequences or emoji zwj sequences. That
1003is, do not break between characters with the Extended_Pictographic property.
1004Extend and ZWJ characters are allowed between the characters.
1005.P
10067. Do not break within emoji flag sequences. That is, do not break between
1007regional indicator (RI) characters if there are an odd number of RI characters
1008before the break point.
1009.P
10108. Otherwise, end the cluster.
1011.
1012.
1013.\" HTML <a name="extraprops"></a>
1014.SS PCRE2's additional properties
1015.rs
1016.sp
1017As well as the standard Unicode properties described above, PCRE2 supports four
1018more that make it possible to convert traditional escape sequences such as \ew
1019and \es to use Unicode properties. PCRE2 uses these non-standard, non-Perl
1020properties internally when PCRE2_UCP is set. However, they may also be used
1021explicitly. These properties are:
1022.sp
1023 Xan Any alphanumeric character
1024 Xps Any POSIX space character
1025 Xsp Any Perl space character
1026 Xwd Any Perl "word" character
1027.sp
1028Xan matches characters that have either the L (letter) or the N (number)
1029property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
1030carriage return, and any other character that has the Z (separator) property.
1031Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl
1032compatibility, but Perl changed. Xwd matches the same characters as Xan, plus
1033underscore.
1034.P
1035There is another non-standard property, Xuc, which matches any character that
1036can be represented by a Universal Character Name in C++ and other programming
1037languages. These are the characters $, @, ` (grave accent), and all characters
1038with Unicode code points greater than or equal to U+00A0, except for the
1039surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
1040excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH
1041where H is a hexadecimal digit. Note that the Xuc property does not match these
1042sequences but the characters that they represent.)
1043.
1044.
1045.\" HTML <a name="resetmatchstart"></a>
1046.SS "Resetting the match start"
1047.rs
1048.sp
1049In normal use, the escape sequence \eK causes any previously matched characters
1050not to be included in the final matched sequence that is returned. For example,
1051the pattern:
1052.sp
1053 foo\eKbar
1054.sp
1055matches "foobar", but reports that it has matched "bar". \eK does not interact
1056with anchoring in any way. The pattern:
1057.sp
1058 ^foo\eKbar
1059.sp
1060matches only when the subject begins with "foobar" (in single line mode),
1061though it again reports the matched string as "bar". This feature is similar to
1062a lookbehind assertion
1063.\" HTML <a href="#lookbehind">
1064.\" </a>
1065(described below).
1066.\"
1067However, in this case, the part of the subject before the real match does not
1068have to be of fixed length, as lookbehind assertions do. The use of \eK does
1069not interfere with the setting of
1070.\" HTML <a href="#group">
1071.\" </a>
1072captured substrings.
1073.\"
1074For example, when the pattern
1075.sp
1076 (foo)\eKbar
1077.sp
1078matches "foobar", the first substring is still set to "foo".
1079.P
1080From version 5.32.0 Perl forbids the use of \eK in lookaround assertions. From
1081release 10.38 PCRE2 also forbids this by default. However, the
1082PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be used when calling
1083\fBpcre2_compile()\fP to re-enable the previous behaviour. When this option is
1084set, \eK is acted upon when it occurs inside positive assertions, but is
1085ignored in negative assertions. Note that when a pattern such as (?=ab\eK)
1086matches, the reported start of the match can be greater than the end of the
1087match. Using \eK in a lookbehind assertion at the start of a pattern can also
1088lead to odd effects. For example, consider this pattern:
1089.sp
1090 (?<=\eKfoo)bar
1091.sp
1092If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting
1093offset of 3 succeeds and reports the matching string as "foobar", that is, the
1094start of the reported match is earlier than where the match started.
1095.
1096.
1097.\" HTML <a name="smallassertions"></a>
1098.SS "Simple assertions"
1099.rs
1100.sp
1101The final use of backslash is for certain simple assertions. An assertion
1102specifies a condition that has to be met at a particular point in a match,
1103without consuming any characters from the subject string. The use of
1104groups for more complicated assertions is described
1105.\" HTML <a href="#bigassertions">
1106.\" </a>
1107below.
1108.\"
1109The backslashed assertions are:
1110.sp
1111 \eb matches at a word boundary
1112 \eB matches when not at a word boundary
1113 \eA matches at the start of the subject
1114 \eZ matches at the end of the subject
1115 also matches before a newline at the end of the subject
1116 \ez matches only at the end of the subject
1117 \eG matches at the first matching position in the subject
1118.sp
1119Inside a character class, \eb has a different meaning; it matches the backspace
1120character. If any other of these assertions appears in a character class, an
1121"invalid escape sequence" error is generated.
1122.P
1123A word boundary is a position in the subject string where the current character
1124and the previous character do not both match \ew or \eW (i.e. one matches
1125\ew and the other matches \eW), or the start or end of the string if the
1126first or last character matches \ew, respectively. When PCRE2 is built with
1127Unicode support, the meanings of \ew and \eW can be changed by setting the
1128PCRE2_UCP option. When this is done, it also affects \eb and \eB. Neither PCRE2
1129nor Perl has a separate "start of word" or "end of word" metasequence. However,
1130whatever follows \eb normally determines which it is. For example, the fragment
1131\eba matches "a" at the start of a word.
1132.P
1133The \eA, \eZ, and \ez assertions differ from the traditional circumflex and
1134dollar (described in the next section) in that they only ever match at the very
1135start and end of the subject string, whatever options are set. Thus, they are
1136independent of multiline mode. These three assertions are not affected by the
1137PCRE2_NOTBOL or PCRE2_NOTEOL options, which affect only the behaviour of the
1138circumflex and dollar metacharacters. However, if the \fIstartoffset\fP
1139argument of \fBpcre2_match()\fP is non-zero, indicating that matching is to
1140start at a point other than the beginning of the subject, \eA can never match.
1141The difference between \eZ and \ez is that \eZ matches before a newline at the
1142end of the string as well as at the very end, whereas \ez matches only at the
1143end.
1144.P
1145The \eG assertion is true only when the current matching position is at the
1146start point of the matching process, as specified by the \fIstartoffset\fP
1147argument of \fBpcre2_match()\fP. It differs from \eA when the value of
1148\fIstartoffset\fP is non-zero. By calling \fBpcre2_match()\fP multiple times
1149with appropriate arguments, you can mimic Perl's /g option, and it is in this
1150kind of implementation where \eG can be useful.
1151.P
1152Note, however, that PCRE2's implementation of \eG, being true at the starting
1153character of the matching process, is subtly different from Perl's, which
1154defines it as true at the end of the previous match. In Perl, these can be
1155different when the previously matched string was empty. Because PCRE2 does just
1156one match at a time, it cannot reproduce this behaviour.
1157.P
1158If all the alternatives of a pattern begin with \eG, the expression is anchored
1159to the starting match position, and the "anchored" flag is set in the compiled
1160regular expression.
1161.
1162.
1163.SH "CIRCUMFLEX AND DOLLAR"
1164.rs
1165.sp
1166The circumflex and dollar metacharacters are zero-width assertions. That is,
1167they test for a particular condition being true without consuming any
1168characters from the subject string. These two metacharacters are concerned with
1169matching the starts and ends of lines. If the newline convention is set so that
1170only the two-character sequence CRLF is recognized as a newline, isolated CR
1171and LF characters are treated as ordinary data characters, and are not
1172recognized as newlines.
1173.P
1174Outside a character class, in the default matching mode, the circumflex
1175character is an assertion that is true only if the current matching point is at
1176the start of the subject string. If the \fIstartoffset\fP argument of
1177\fBpcre2_match()\fP is non-zero, or if PCRE2_NOTBOL is set, circumflex can
1178never match if the PCRE2_MULTILINE option is unset. Inside a character class,
1179circumflex has an entirely different meaning
1180.\" HTML <a href="#characterclass">
1181.\" </a>
1182(see below).
1183.\"
1184.P
1185Circumflex need not be the first character of the pattern if a number of
1186alternatives are involved, but it should be the first thing in each alternative
1187in which it appears if the pattern is ever to match that branch. If all
1188possible alternatives start with a circumflex, that is, if the pattern is
1189constrained to match only at the start of the subject, it is said to be an
1190"anchored" pattern. (There are also other constructs that can cause a pattern
1191to be anchored.)
1192.P
1193The dollar character is an assertion that is true only if the current matching
1194point is at the end of the subject string, or immediately before a newline at
1195the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
1196that it does not actually match the newline. Dollar need not be the last
1197character of the pattern if a number of alternatives are involved, but it
1198should be the last item in any branch in which it appears. Dollar has no
1199special meaning in a character class.
1200.P
1201The meaning of dollar can be changed so that it matches only at the very end of
1202the string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This
1203does not affect the \eZ assertion.
1204.P
1205The meanings of the circumflex and dollar metacharacters are changed if the
1206PCRE2_MULTILINE option is set. When this is the case, a dollar character
1207matches before any newlines in the string, as well as at the very end, and a
1208circumflex matches immediately after internal newlines as well as at the start
1209of the subject string. It does not match after a newline that ends the string,
1210for compatibility with Perl. However, this can be changed by setting the
1211PCRE2_ALT_CIRCUMFLEX option.
1212.P
1213For example, the pattern /^abc$/ matches the subject string "def\enabc" (where
1214\en represents a newline) in multiline mode, but not otherwise. Consequently,
1215patterns that are anchored in single line mode because all branches start with
1216^ are not anchored in multiline mode, and a match for circumflex is possible
1217when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
1218PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
1219.P
1220When the newline convention (see
1221.\" HTML <a href="#newlines">
1222.\" </a>
1223"Newline conventions"
1224.\"
1225below) recognizes the two-character sequence CRLF as a newline, this is
1226preferred, even if the single characters CR and LF are also recognized as
1227newlines. For example, if the newline convention is "any", a multiline mode
1228circumflex matches before "xyz" in the string "abc\er\enxyz" rather than after
1229CR, even though CR on its own is a valid newline. (It also matches at the very
1230start of the string, of course.)
1231.P
1232Note that the sequences \eA, \eZ, and \ez can be used to match the start and
1233end of the subject in both modes, and if all branches of a pattern start with
1234\eA it is always anchored, whether or not PCRE2_MULTILINE is set.
1235.
1236.
1237.\" HTML <a name="fullstopdot"></a>
1238.SH "FULL STOP (PERIOD, DOT) AND \eN"
1239.rs
1240.sp
1241Outside a character class, a dot in the pattern matches any one character in
1242the subject string except (by default) a character that signifies the end of a
Elliott Hughes4e19c8e2022-04-15 15:11:02 -07001243line. One or more characters may be specified as line terminators (see
1244.\" HTML <a href="#newlines">
1245.\" </a>
1246"Newline conventions"
1247.\"
1248above).
Elliott Hughes5b808042021-10-01 10:56:10 -07001249.P
Elliott Hughes4e19c8e2022-04-15 15:11:02 -07001250Dot never matches a single line-ending character. When the two-character
1251sequence CRLF is the only line ending, dot does not match CR if it is
1252immediately followed by LF, but otherwise it matches all characters (including
1253isolated CRs and LFs). When ANYCRLF is selected for line endings, no occurences
1254of CR of LF match dot. When all Unicode line endings are being recognized, dot
1255does not match CR or LF or any of the other line ending characters.
Elliott Hughes5b808042021-10-01 10:56:10 -07001256.P
1257The behaviour of dot with regard to newlines can be changed. If the
1258PCRE2_DOTALL option is set, a dot matches any one character, without exception.
1259If the two-character sequence CRLF is present in the subject string, it takes
1260two dots to match it.
1261.P
1262The handling of dot is entirely independent of the handling of circumflex and
1263dollar, the only relationship being that they both involve newlines. Dot has no
1264special meaning in a character class.
1265.P
1266The escape sequence \eN when not followed by an opening brace behaves like a
1267dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
1268it matches any character except one that signifies the end of a line.
1269.P
1270When \eN is followed by an opening brace it has a different meaning. See the
1271section entitled
1272.\" HTML <a href="digitsafterbackslash">
1273.\" </a>
1274"Non-printing characters"
1275.\"
1276above for details. Perl also uses \eN{name} to specify characters by Unicode
1277name; PCRE2 does not support this.
1278.
1279.
1280.SH "MATCHING A SINGLE CODE UNIT"
1281.rs
1282.sp
1283Outside a character class, the escape sequence \eC matches any one code unit,
1284whether or not a UTF mode is set. In the 8-bit library, one code unit is one
1285byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
128632-bit unit. Unlike a dot, \eC always matches line-ending characters. The
1287feature is provided in Perl in order to match individual bytes in UTF-8 mode,
1288but it is unclear how it can usefully be used.
1289.P
1290Because \eC breaks up characters into individual code units, matching one unit
1291with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
1292with a malformed UTF character. This has undefined results, because PCRE2
1293assumes that it is matching character by character in a valid UTF string (by
1294default it checks the subject string's validity at the start of processing
1295unless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
1296.P
1297An application can lock out the use of \eC by setting the
1298PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
1299build PCRE2 with the use of \eC permanently disabled.
1300.P
1301PCRE2 does not allow \eC to appear in lookbehind assertions
1302.\" HTML <a href="#lookbehind">
1303.\" </a>
1304(described below)
1305.\"
1306in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
1307the length of the lookbehind. Neither the alternative matching function
1308\fBpcre2_dfa_match()\fP nor the JIT optimizer support \eC in these UTF modes.
1309The former gives a match-time error; the latter fails to optimize and so the
1310match is always run using the interpreter.
1311.P
1312In the 32-bit library, however, \eC is always supported (when not explicitly
1313locked out) because it always matches a single code unit, whether or not UTF-32
1314is specified.
1315.P
1316In general, the \eC escape sequence is best avoided. However, one way of using
1317it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
1318lookahead to check the length of the next character, as in this pattern, which
1319could be used with a UTF-8 string (ignore white space and line breaks):
1320.sp
1321 (?| (?=[\ex00-\ex7f])(\eC) |
1322 (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
1323 (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
1324 (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1325.sp
1326In this example, a group that starts with (?| resets the capturing parentheses
1327numbers in each alternative (see
1328.\" HTML <a href="#dupgroupnumber">
1329.\" </a>
1330"Duplicate Group Numbers"
1331.\"
1332below). The assertions at the start of each branch check the next UTF-8
1333character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1334character's individual bytes are then captured by the appropriate number of
1335\eC groups.
1336.
1337.
1338.\" HTML <a name="characterclass"></a>
1339.SH "SQUARE BRACKETS AND CHARACTER CLASSES"
1340.rs
1341.sp
1342An opening square bracket introduces a character class, terminated by a closing
1343square bracket. A closing square bracket on its own is not special by default.
1344If a closing square bracket is required as a member of the class, it should be
1345the first data character in the class (after an initial circumflex, if present)
1346or escaped with a backslash. This means that, by default, an empty class cannot
1347be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing
1348square bracket at the start does end the (empty) class.
1349.P
1350A character class matches a single character in the subject. A matched
1351character must be in the set of characters defined by the class, unless the
1352first character in the class definition is a circumflex, in which case the
1353subject character must not be in the set defined by the class. If a circumflex
1354is actually required as a member of the class, ensure it is not the first
1355character, or escape it with a backslash.
1356.P
1357For example, the character class [aeiou] matches any lower case vowel, while
1358[^aeiou] matches any character that is not a lower case vowel. Note that a
1359circumflex is just a convenient notation for specifying the characters that
1360are in the class by enumerating those that are not. A class that starts with a
1361circumflex is not an assertion; it still consumes a character from the subject
1362string, and therefore it fails if the current pointer is at the end of the
1363string.
1364.P
1365Characters in a class may be specified by their code points using \eo, \ex, or
1366\eN{U+hh..} in the usual way. When caseless matching is set, any letters in a
1367class represent both their upper case and lower case versions, so for example,
1368a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
1369match "A", whereas a caseful version would. Note that there are two ASCII
1370characters, K and S, that, in addition to their lower case ASCII equivalents,
1371are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S)
1372respectively when either PCRE2_UTF or PCRE2_UCP is set.
1373.P
1374Characters that might indicate line breaks are never treated in any special way
1375when matching character classes, whatever line-ending sequence is in use, and
1376whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
1377class such as [^a] always matches one of these characters.
1378.P
1379The generic character type escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es,
1380\eS, \ev, \eV, \ew, and \eW may appear in a character class, and add the
1381characters that they match to the class. For example, [\edABCDEF] matches any
1382hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
1383\ed, \es, \ew and their upper case partners, just as it does when they appear
1384outside a character class, as described in the section entitled
1385.\" HTML <a href="#genericchartypes">
1386.\" </a>
1387"Generic character types"
1388.\"
1389above. The escape sequence \eb has a different meaning inside a character
1390class; it matches the backspace character. The sequences \eB, \eR, and \eX are
1391not special inside a character class. Like any other unrecognized escape
1392sequences, they cause an error. The same is true for \eN when not followed by
1393an opening brace.
1394.P
1395The minus (hyphen) character can be used to specify a range of characters in a
1396character class. For example, [d-m] matches any letter between d and m,
1397inclusive. If a minus character is required in a class, it must be escaped with
1398a backslash or appear in a position where it cannot be interpreted as
1399indicating a range, typically as the first or last character in the class,
1400or immediately after a range. For example, [b-d-z] matches letters in the range
1401b to d, a hyphen character, or z.
1402.P
1403Perl treats a hyphen as a literal if it appears before or after a POSIX class
1404(see below) or before or after a character type escape such as as \ed or \eH.
1405However, unless the hyphen is the last character in the class, Perl outputs a
1406warning in its warning mode, as this is most likely a user error. As PCRE2 has
1407no facility for warning, an error is given in these cases.
1408.P
1409It is not possible to have the literal character "]" as the end character of a
1410range. A pattern such as [W-]46] is interpreted as a class of two characters
1411("W" and "-") followed by a literal string "46]", so it would match "W46]" or
1412"-46]". However, if the "]" is escaped with a backslash it is interpreted as
1413the end of range, so [W-\e]46] is interpreted as a class containing a range
1414followed by two other characters. The octal or hexadecimal representation of
1415"]" can also be used to end a range.
1416.P
1417Ranges normally include all code points between the start and end characters,
1418inclusive. They can also be used for code points specified numerically, for
1419example [\e000-\e037]. Ranges can include any characters that are valid for the
1420current mode. In any UTF mode, the so-called "surrogate" characters (those
1421whose code points lie between 0xd800 and 0xdfff inclusive) may not be specified
1422explicitly by default (the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables
1423this check). However, ranges such as [\ex{d7ff}-\ex{e000}], which include the
1424surrogates, are always permitted.
1425.P
1426There is a special case in EBCDIC environments for ranges whose end points are
1427both specified as literal letters in the same case. For compatibility with
1428Perl, EBCDIC code points within the range that are not letters are omitted. For
1429example, [h-k] matches only four characters, even though the codes for h and k
1430are 0x88 and 0x92, a range of 11 code points. However, if the range is
1431specified numerically, for example, [\ex88-\ex92] or [h-\ex92], all code points
1432are included.
1433.P
1434If a range that includes letters is used when caseless matching is set, it
1435matches the letters in either case. For example, [W-c] is equivalent to
1436[][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
1437tables for a French locale are in use, [\exc8-\excb] matches accented E
1438characters in both cases.
1439.P
1440A circumflex can conveniently be used with the upper case character types to
1441specify a more restricted set of characters than the matching lower case type.
1442For example, the class [^\eW_] matches any letter or digit, but not underscore,
1443whereas [\ew] includes underscore. A positive character class should be read as
1444"something OR something OR ..." and a negative class as "NOT something AND NOT
1445something AND NOT ...".
1446.P
1447The only metacharacters that are recognized in character classes are backslash,
1448hyphen (only where it can be interpreted as specifying a range), circumflex
1449(only at the start), opening square bracket (only when it can be interpreted as
1450introducing a POSIX class name, or for a special compatibility feature - see
1451the next two sections), and the terminating closing square bracket. However,
1452escaping other non-alphanumeric characters does no harm.
1453.
1454.
1455.SH "POSIX CHARACTER CLASSES"
1456.rs
1457.sp
1458Perl supports the POSIX notation for character classes. This uses names
1459enclosed by [: and :] within the enclosing square brackets. PCRE2 also supports
1460this notation. For example,
1461.sp
1462 [01[:alpha:]%]
1463.sp
1464matches "0", "1", any alphabetic character, or "%". The supported class names
1465are:
1466.sp
1467 alnum letters and digits
1468 alpha letters
1469 ascii character codes 0 - 127
1470 blank space or tab only
1471 cntrl control characters
1472 digit decimal digits (same as \ed)
1473 graph printing characters, excluding space
1474 lower lower case letters
1475 print printing characters, including space
1476 punct printing characters, excluding letters and digits and space
1477 space white space (the same as \es from PCRE2 8.34)
1478 upper upper case letters
1479 word "word" characters (same as \ew)
1480 xdigit hexadecimal digits
1481.sp
1482The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1483and space (32). If locale-specific matching is taking place, the list of space
1484characters may be different; there may be fewer or more of them. "Space" and
1485\es match the same set of characters.
1486.P
1487The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
14885.8. Another Perl extension is negation, which is indicated by a ^ character
1489after the colon. For example,
1490.sp
1491 [12[:^digit:]]
1492.sp
1493matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
1494syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1495supported, and an error is given if they are encountered.
1496.P
1497By default, characters with values greater than 127 do not match any of the
1498POSIX character classes, although this may be different for characters in the
1499range 128-255 when locale-specific matching is happening. However, if the
1500PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
1501changed so that Unicode character properties are used. This is achieved by
1502replacing certain POSIX classes with other sequences, as follows:
1503.sp
1504 [:alnum:] becomes \ep{Xan}
1505 [:alpha:] becomes \ep{L}
1506 [:blank:] becomes \eh
1507 [:cntrl:] becomes \ep{Cc}
1508 [:digit:] becomes \ep{Nd}
1509 [:lower:] becomes \ep{Ll}
1510 [:space:] becomes \ep{Xps}
1511 [:upper:] becomes \ep{Lu}
1512 [:word:] becomes \ep{Xwd}
1513.sp
1514Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX
1515classes are handled specially in UCP mode:
1516.TP 10
1517[:graph:]
1518This matches characters that have glyphs that mark the page when printed. In
1519Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
1520properties, except for:
1521.sp
1522 U+061C Arabic Letter Mark
1523 U+180E Mongolian Vowel Separator
1524 U+2066 - U+2069 Various "isolate"s
1525.sp
1526.TP 10
1527[:print:]
1528This matches the same characters as [:graph:] plus space characters that are
1529not controls, that is, characters with the Zs property.
1530.TP 10
1531[:punct:]
1532This matches all characters that have the Unicode P (punctuation) property,
1533plus those characters with code points less than 256 that have the S (Symbol)
1534property.
1535.P
1536The other POSIX classes are unchanged, and match only characters with code
1537points less than 256.
1538.
1539.
1540.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
1541.rs
1542.sp
1543In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
1544syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of
1545word". PCRE2 treats these items as follows:
1546.sp
1547 [[:<:]] is converted to \eb(?=\ew)
1548 [[:>:]] is converted to \eb(?<=\ew)
1549.sp
1550Only these exact character sequences are recognized. A sequence such as
1551[a[:<:]b] provokes error for an unrecognized POSIX class name. This support is
1552not compatible with Perl. It is provided to help migrations from other
1553environments, and is best not used in any new patterns. Note that \eb matches
1554at the start and the end of a word (see
1555.\" HTML <a href="#smallassertions">
1556.\" </a>
1557"Simple assertions"
1558.\"
1559above), and in a Perl-style pattern the preceding or following character
1560normally shows which is wanted, without the need for the assertions that are
1561used above in order to give exactly the POSIX behaviour.
1562.
1563.
1564.SH "VERTICAL BAR"
1565.rs
1566.sp
1567Vertical bar characters are used to separate alternative patterns. For example,
1568the pattern
1569.sp
1570 gilbert|sullivan
1571.sp
1572matches either "gilbert" or "sullivan". Any number of alternatives may appear,
1573and an empty alternative is permitted (matching the empty string). The matching
1574process tries each alternative in turn, from left to right, and the first one
1575that succeeds is used. If the alternatives are within a group
1576.\" HTML <a href="#group">
1577.\" </a>
1578(defined below),
1579.\"
1580"succeeds" means matching the rest of the main pattern as well as the
1581alternative in the group.
1582.
1583.
1584.\" HTML <a name="internaloptions"></a>
1585.SH "INTERNAL OPTION SETTING"
1586.rs
1587.sp
1588The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
1589PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
1590changed from within the pattern by a sequence of letters enclosed between "(?"
1591and ")". These options are Perl-compatible, and are described in detail in the
1592.\" HREF
1593\fBpcre2api\fP
1594.\"
1595documentation. The option letters are:
1596.sp
1597 i for PCRE2_CASELESS
1598 m for PCRE2_MULTILINE
1599 n for PCRE2_NO_AUTO_CAPTURE
1600 s for PCRE2_DOTALL
1601 x for PCRE2_EXTENDED
1602 xx for PCRE2_EXTENDED_MORE
1603.sp
1604For example, (?im) sets caseless, multiline matching. It is also possible to
1605unset these options by preceding the relevant letters with a hyphen, for
1606example (?-im). The two "extended" options are not independent; unsetting either
1607one cancels the effects of both of them.
1608.P
1609A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
1610and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
1611permitted. Only one hyphen may appear in the options string. If a letter
1612appears both before and after the hyphen, the option is unset. An empty options
1613setting "(?)" is allowed. Needless to say, it has no effect.
1614.P
1615If the first character following (? is a circumflex, it causes all of the above
1616options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
1617the circumflex to cause some options to be re-instated, but a hyphen may not
1618appear.
1619.P
1620The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
1621the same way as the Perl-compatible options by using the characters J and U
1622respectively. However, these are not unset by (?^).
1623.P
1624When one of these option changes occurs at top level (that is, not inside
1625group parentheses), the change applies to the remainder of the pattern
1626that follows. An option change within a group (see below for a description
1627of groups) affects only that part of the group that follows it, so
1628.sp
1629 (a(?i)b)c
1630.sp
1631matches abc and aBc and no other strings (assuming PCRE2_CASELESS is not used).
1632By this means, options can be made to have different settings in different
1633parts of the pattern. Any changes made in one alternative do carry on
1634into subsequent branches within the same group. For example,
1635.sp
1636 (a(?i)b|c)
1637.sp
1638matches "ab", "aB", "c", and "C", even though when matching "C" the first
1639branch is abandoned before the option setting. This is because the effects of
1640option settings happen at compile time. There would be some very weird
1641behaviour otherwise.
1642.P
1643As a convenient shorthand, if any option settings are required at the start of
1644a non-capturing group (see the next section), the option letters may
1645appear between the "?" and the ":". Thus the two patterns
1646.sp
1647 (?i:saturday|sunday)
1648 (?:(?i)saturday|sunday)
1649.sp
1650match exactly the same set of strings.
1651.P
1652\fBNote:\fP There are other PCRE2-specific options, applying to the whole
1653pattern, which can be set by the application when the compiling function is
1654called. In addition, the pattern can contain special leading sequences such as
1655(*CRLF) to override what the application has set or what has been defaulted.
1656Details are given in the section entitled
1657.\" HTML <a href="#newlineseq">
1658.\" </a>
1659"Newline sequences"
1660.\"
1661above. There are also the (*UTF) and (*UCP) leading sequences that can be used
1662to set UTF and Unicode property modes; they are equivalent to setting the
1663PCRE2_UTF and PCRE2_UCP options, respectively. However, the application can set
1664the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use of the
1665(*UTF) and (*UCP) sequences.
1666.
1667.
1668.\" HTML <a name="group"></a>
1669.SH GROUPS
1670.rs
1671.sp
1672Groups are delimited by parentheses (round brackets), which can be nested.
1673Turning part of a pattern into a group does two things:
1674.sp
16751. It localizes a set of alternatives. For example, the pattern
1676.sp
1677 cat(aract|erpillar|)
1678.sp
1679matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
1680match "cataract", "erpillar" or an empty string.
1681.sp
16822. It creates a "capture group". This means that, when the whole pattern
1683matches, the portion of the subject string that matched the group is passed
1684back to the caller, separately from the portion that matched the whole pattern.
1685(This applies only to the traditional matching function; the DFA matching
1686function does not support capturing.)
1687.P
1688Opening parentheses are counted from left to right (starting from 1) to obtain
1689numbers for capture groups. For example, if the string "the red king" is
1690matched against the pattern
1691.sp
1692 the ((red|white) (king|queen))
1693.sp
1694the captured substrings are "red king", "red", and "king", and are numbered 1,
16952, and 3, respectively.
1696.P
1697The fact that plain parentheses fulfil two functions is not always helpful.
1698There are often times when grouping is required without capturing. If an
1699opening parenthesis is followed by a question mark and a colon, the group
1700does not do any capturing, and is not counted when computing the number of any
1701subsequent capture groups. For example, if the string "the white queen"
1702is matched against the pattern
1703.sp
1704 the ((?:red|white) (king|queen))
1705.sp
1706the captured substrings are "white queen" and "queen", and are numbered 1 and
17072. The maximum number of capture groups is 65535.
1708.P
1709As a convenient shorthand, if any option settings are required at the start of
1710a non-capturing group, the option letters may appear between the "?" and the
1711":". Thus the two patterns
1712.sp
1713 (?i:saturday|sunday)
1714 (?:(?i)saturday|sunday)
1715.sp
1716match exactly the same set of strings. Because alternative branches are tried
1717from left to right, and options are not reset until the end of the group is
1718reached, an option setting in one branch does affect subsequent branches, so
1719the above patterns match "SUNDAY" as well as "Saturday".
1720.
1721.
1722.\" HTML <a name="dupgroupnumber"></a>
1723.SH "DUPLICATE GROUP NUMBERS"
1724.rs
1725.sp
1726Perl 5.10 introduced a feature whereby each alternative in a group uses the
1727same numbers for its capturing parentheses. Such a group starts with (?| and is
1728itself a non-capturing group. For example, consider this pattern:
1729.sp
1730 (?|(Sat)ur|(Sun))day
1731.sp
1732Because the two alternatives are inside a (?| group, both sets of capturing
1733parentheses are numbered one. Thus, when the pattern matches, you can look
1734at captured substring number one, whichever alternative matched. This construct
1735is useful when you want to capture part, but not all, of one of a number of
1736alternatives. Inside a (?| group, parentheses are numbered as usual, but the
1737number is reset at the start of each branch. The numbers of any capturing
1738parentheses that follow the whole group start after the highest number used in
1739any branch. The following example is taken from the Perl documentation. The
1740numbers underneath show in which buffer the captured content will be stored.
1741.sp
1742 # before ---------------branch-reset----------- after
1743 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1744 # 1 2 2 3 2 3 4
1745.sp
1746A backreference to a capture group uses the most recent value that is set for
1747the group. The following pattern matches "abcabc" or "defdef":
1748.sp
1749 /(?|(abc)|(def))\e1/
1750.sp
1751In contrast, a subroutine call to a capture group always refers to the
1752first one in the pattern with the given number. The following pattern matches
1753"abcabc" or "defabc":
1754.sp
1755 /(?|(abc)|(def))(?1)/
1756.sp
1757A relative reference such as (?-1) is no different: it is just a convenient way
1758of computing an absolute group number.
1759.P
1760If a
1761.\" HTML <a href="#conditions">
1762.\" </a>
1763condition test
1764.\"
1765for a group's having matched refers to a non-unique number, the test is
1766true if any group with that number has matched.
1767.P
1768An alternative approach to using this "branch reset" feature is to use
1769duplicate named groups, as described in the next section.
1770.
1771.
1772.SH "NAMED CAPTURE GROUPS"
1773.rs
1774.sp
1775Identifying capture groups by number is simple, but it can be very hard to keep
1776track of the numbers in complicated patterns. Furthermore, if an expression is
1777modified, the numbers may change. To help with this difficulty, PCRE2 supports
1778the naming of capture groups. This feature was not added to Perl until release
17795.10. Python had the feature earlier, and PCRE1 introduced it at release 4.0,
1780using the Python syntax. PCRE2 supports both the Perl and the Python syntax.
1781.P
1782In PCRE2, a capture group can be named in one of three ways: (?<name>...) or
1783(?'name'...) as in Perl, or (?P<name>...) as in Python. Names may be up to 32
1784code units long. When PCRE2_UTF is not set, they may contain only ASCII
1785alphanumeric characters and underscores, but must start with a non-digit. When
1786PCRE2_UTF is set, the syntax of group names is extended to allow any Unicode
1787letter or Unicode decimal digit. In other words, group names must match one of
1788these patterns:
1789.sp
1790 ^[_A-Za-z][_A-Za-z0-9]*\ez when PCRE2_UTF is not set
1791 ^[_\ep{L}][_\ep{L}\ep{Nd}]*\ez when PCRE2_UTF is set
1792.sp
1793References to capture groups from other parts of the pattern, such as
1794.\" HTML <a href="#backreferences">
1795.\" </a>
1796backreferences,
1797.\"
1798.\" HTML <a href="#recursion">
1799.\" </a>
1800recursion,
1801.\"
1802and
1803.\" HTML <a href="#conditions">
1804.\" </a>
1805conditions,
1806.\"
1807can all be made by name as well as by number.
1808.P
1809Named capture groups are allocated numbers as well as names, exactly as
1810if the names were not present. In both PCRE2 and Perl, capture groups
1811are primarily identified by numbers; any names are just aliases for these
1812numbers. The PCRE2 API provides function calls for extracting the complete
1813name-to-number translation table from a compiled pattern, as well as
1814convenience functions for extracting captured substrings by name.
1815.P
1816\fBWarning:\fP When more than one capture group has the same number, as
1817described in the previous section, a name given to one of them applies to all
1818of them. Perl allows identically numbered groups to have different names.
1819Consider this pattern, where there are two capture groups, both numbered 1:
1820.sp
1821 (?|(?<AA>aa)|(?<BB>bb))
1822.sp
1823Perl allows this, with both names AA and BB as aliases of group 1. Thus, after
1824a successful match, both names yield the same value (either "aa" or "bb").
1825.P
1826In an attempt to reduce confusion, PCRE2 does not allow the same group number
1827to be associated with more than one name. The example above provokes a
1828compile-time error. However, there is still scope for confusion. Consider this
1829pattern:
1830.sp
1831 (?|(?<AA>aa)|(bb))
1832.sp
1833Although the second group number 1 is not explicitly named, the name AA is
1834still an alias for any group 1. Whether the pattern matches "aa" or "bb", a
1835reference by name to group AA yields the matched string.
1836.P
1837By default, a name must be unique within a pattern, except that duplicate names
1838are permitted for groups with the same number, for example:
1839.sp
1840 (?|(?<AA>aa)|(?<AA>bb))
1841.sp
1842The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
1843option at compile time, or by the use of (?J) within the pattern, as described
1844in the section entitled
1845.\" HTML <a href="#internaloptions">
1846.\" </a>
1847"Internal Option Setting"
1848.\"
1849above.
1850.P
1851Duplicate names can be useful for patterns where only one instance of the named
1852capture group can match. Suppose you want to match the name of a weekday,
1853either as a 3-letter abbreviation or as the full name, and in both cases you
1854want to extract the abbreviation. This pattern (ignoring the line breaks) does
1855the job:
1856.sp
1857 (?J)
1858 (?<DN>Mon|Fri|Sun)(?:day)?|
1859 (?<DN>Tue)(?:sday)?|
1860 (?<DN>Wed)(?:nesday)?|
1861 (?<DN>Thu)(?:rsday)?|
1862 (?<DN>Sat)(?:urday)?
1863.sp
1864There are five capture groups, but only one is ever set after a match. The
1865convenience functions for extracting the data by name returns the substring for
1866the first (and in this example, the only) group of that name that matched. This
1867saves searching to find which numbered group it was. (An alternative way of
1868solving this problem is to use a "branch reset" group, as described in the
1869previous section.)
1870.P
1871If you make a backreference to a non-unique named group from elsewhere in the
1872pattern, the groups to which the name refers are checked in the order in which
1873they appear in the overall pattern. The first one that is set is used for the
1874reference. For example, this pattern matches both "foofoo" and "barbar" but not
1875"foobar" or "barfoo":
1876.sp
1877 (?J)(?:(?<n>foo)|(?<n>bar))\ek<n>
1878.sp
1879.P
1880If you make a subroutine call to a non-unique named group, the one that
1881corresponds to the first occurrence of the name is used. In the absence of
1882duplicate numbers this is the one with the lowest number.
1883.P
1884If you use a named reference in a condition
1885test (see the
1886.\"
1887.\" HTML <a href="#conditions">
1888.\" </a>
1889section about conditions
1890.\"
1891below), either to check whether a capture group has matched, or to check for
1892recursion, all groups with the same name are tested. If the condition is true
1893for any one of them, the overall condition is true. This is the same behaviour
1894as testing by number. For further details of the interfaces for handling named
1895capture groups, see the
1896.\" HREF
1897\fBpcre2api\fP
1898.\"
1899documentation.
1900.
1901.
1902.SH REPETITION
1903.rs
1904.sp
1905Repetition is specified by quantifiers, which can follow any of the following
1906items:
1907.sp
1908 a literal data character
1909 the dot metacharacter
1910 the \eC escape sequence
1911 the \eR escape sequence
1912 the \eX escape sequence
1913 an escape such as \ed or \epL that matches a single character
1914 a character class
1915 a backreference
1916 a parenthesized group (including lookaround assertions)
1917 a subroutine call (recursive or otherwise)
1918.sp
1919The general repetition quantifier specifies a minimum and maximum number of
1920permitted matches, by giving the two numbers in curly brackets (braces),
1921separated by a comma. The numbers must be less than 65536, and the first must
1922be less than or equal to the second. For example,
1923.sp
1924 z{2,4}
1925.sp
1926matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
1927character. If the second number is omitted, but the comma is present, there is
1928no upper limit; if the second number and the comma are both omitted, the
1929quantifier specifies an exact number of required matches. Thus
1930.sp
1931 [aeiou]{3,}
1932.sp
1933matches at least 3 successive vowels, but may match many more, whereas
1934.sp
1935 \ed{8}
1936.sp
1937matches exactly 8 digits. An opening curly bracket that appears in a position
1938where a quantifier is not allowed, or one that does not match the syntax of a
1939quantifier, is taken as a literal character. For example, {,6} is not a
1940quantifier, but a literal string of four characters.
1941.P
1942In UTF modes, quantifiers apply to characters rather than to individual code
1943units. Thus, for example, \ex{100}{2} matches two characters, each of
1944which is represented by a two-byte sequence in a UTF-8 string. Similarly,
1945\eX{3} matches three Unicode extended grapheme clusters, each of which may be
1946several code units long (and they may be of different lengths).
1947.P
1948The quantifier {0} is permitted, causing the expression to behave as if the
1949previous item and the quantifier were not present. This may be useful for
1950capture groups that are referenced as
1951.\" HTML <a href="#groupsassubroutines">
1952.\" </a>
1953subroutines
1954.\"
1955from elsewhere in the pattern (but see also the section entitled
1956.\" HTML <a href="#subdefine">
1957.\" </a>
1958"Defining capture groups for use by reference only"
1959.\"
1960below). Except for parenthesized groups, items that have a {0} quantifier are
1961omitted from the compiled pattern.
1962.P
1963For convenience, the three most common quantifiers have single-character
1964abbreviations:
1965.sp
1966 * is equivalent to {0,}
1967 + is equivalent to {1,}
1968 ? is equivalent to {0,1}
1969.sp
1970It is possible to construct infinite loops by following a group that can match
1971no characters with a quantifier that has no upper limit, for example:
1972.sp
1973 (a?)*
1974.sp
1975Earlier versions of Perl and PCRE1 used to give an error at compile time for
1976such patterns. However, because there are cases where this can be useful, such
1977patterns are now accepted, but whenever an iteration of such a group matches no
1978characters, matching moves on to the next item in the pattern instead of
1979repeatedly matching an empty string. This does not prevent backtracking into
1980any of the iterations if a subsequent item fails to match.
1981.P
1982By default, quantifiers are "greedy", that is, they match as much as possible
1983(up to the maximum number of permitted times), without causing the rest of the
1984pattern to fail. The classic example of where this gives problems is in trying
1985to match comments in C programs. These appear between /* and */ and within the
1986comment, individual * and / characters may appear. An attempt to match C
1987comments by applying the pattern
1988.sp
1989 /\e*.*\e*/
1990.sp
1991to the string
1992.sp
1993 /* first comment */ not comment /* second comment */
1994.sp
1995fails, because it matches the entire string owing to the greediness of the .*
1996item. However, if a quantifier is followed by a question mark, it ceases to be
1997greedy, and instead matches the minimum number of times possible, so the
1998pattern
1999.sp
2000 /\e*.*?\e*/
2001.sp
2002does the right thing with the C comments. The meaning of the various
2003quantifiers is not otherwise changed, just the preferred number of matches.
2004Do not confuse this use of question mark with its use as a quantifier in its
2005own right. Because it has two uses, it can sometimes appear doubled, as in
2006.sp
2007 \ed??\ed
2008.sp
2009which matches one digit by preference, but can match two if that is the only
2010way the rest of the pattern matches.
2011.P
2012If the PCRE2_UNGREEDY option is set (an option that is not available in Perl),
2013the quantifiers are not greedy by default, but individual ones can be made
2014greedy by following them with a question mark. In other words, it inverts the
2015default behaviour.
2016.P
2017When a parenthesized group is quantified with a minimum repeat count that
2018is greater than 1 or with a limited maximum, more memory is required for the
2019compiled pattern, in proportion to the size of the minimum or maximum.
2020.P
2021If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option (equivalent
2022to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
2023implicitly anchored, because whatever follows will be tried against every
2024character position in the subject string, so there is no point in retrying the
2025overall match at any position after the first. PCRE2 normally treats such a
2026pattern as though it were preceded by \eA.
2027.P
2028In cases where it is known that the subject string contains no newlines, it is
2029worth setting PCRE2_DOTALL in order to obtain this optimization, or
2030alternatively, using ^ to indicate anchoring explicitly.
2031.P
2032However, there are some cases where the optimization cannot be used. When .*
2033is inside capturing parentheses that are the subject of a backreference
2034elsewhere in the pattern, a match at the start may fail where a later one
2035succeeds. Consider, for example:
2036.sp
2037 (.*)abc\e1
2038.sp
2039If the subject is "xyz123abc123" the match point is the fourth character. For
2040this reason, such a pattern is not implicitly anchored.
2041.P
2042Another case where implicit anchoring is not applied is when the leading .* is
2043inside an atomic group. Once again, a match at the start may fail where a later
2044one succeeds. Consider this pattern:
2045.sp
2046 (?>.*?a)b
2047.sp
2048It matches "ab" in the subject "aab". The use of the backtracking control verbs
2049(*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
2050PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
2051.P
2052When a capture group is repeated, the value captured is the substring that
2053matched the final iteration. For example, after
2054.sp
2055 (tweedle[dume]{3}\es*)+
2056.sp
2057has matched "tweedledum tweedledee" the value of the captured substring is
2058"tweedledee". However, if there are nested capture groups, the corresponding
2059captured values may have been set in previous iterations. For example, after
2060.sp
2061 (a|(b))+
2062.sp
2063matches "aba" the value of the second captured substring is "b".
2064.
2065.
2066.\" HTML <a name="atomicgroup"></a>
2067.SH "ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS"
2068.rs
2069.sp
2070With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
2071repetition, failure of what follows normally causes the repeated item to be
2072re-evaluated to see if a different number of repeats allows the rest of the
2073pattern to match. Sometimes it is useful to prevent this, either to change the
2074nature of the match, or to cause it fail earlier than it otherwise might, when
2075the author of the pattern knows there is no point in carrying on.
2076.P
2077Consider, for example, the pattern \ed+foo when applied to the subject line
2078.sp
2079 123456bar
2080.sp
2081After matching all 6 digits and then failing to match "foo", the normal
2082action of the matcher is to try again with only 5 digits matching the \ed+
2083item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
2084(a term taken from Jeffrey Friedl's book) provides the means for specifying
2085that once a group has matched, it is not to be re-evaluated in this way.
2086.P
2087If we use atomic grouping for the previous example, the matcher gives up
2088immediately on failing to match "foo" the first time. The notation is a kind of
2089special parenthesis, starting with (?> as in this example:
2090.sp
2091 (?>\ed+)foo
2092.sp
2093Perl 5.28 introduced an experimental alphabetic form starting with (* which may
2094be easier to remember:
2095.sp
2096 (*atomic:\ed+)foo
2097.sp
Elliott Hughes4e19c8e2022-04-15 15:11:02 -07002098This kind of parenthesized group "locks up" the part of the pattern it contains
2099once it has matched, and a failure further into the pattern is prevented from
2100backtracking into it. Backtracking past it to previous items, however, works as
2101normal.
Elliott Hughes5b808042021-10-01 10:56:10 -07002102.P
2103An alternative description is that a group of this type matches exactly the
2104string of characters that an identical standalone pattern would match, if
2105anchored at the current point in the subject string.
2106.P
2107Atomic groups are not capture groups. Simple cases such as the above example
2108can be thought of as a maximizing repeat that must swallow everything it can.
2109So, while both \ed+ and \ed+? are prepared to adjust the number of digits they
2110match in order to make the rest of the pattern match, (?>\ed+) can only match
2111an entire sequence of digits.
2112.P
2113Atomic groups in general can of course contain arbitrarily complicated
2114expressions, and can be nested. However, when the contents of an atomic
2115group is just a single repeated item, as in the example above, a simpler
2116notation, called a "possessive quantifier" can be used. This consists of an
2117additional + character following a quantifier. Using this notation, the
2118previous example can be rewritten as
2119.sp
2120 \ed++foo
2121.sp
2122Note that a possessive quantifier can be used with an entire group, for
2123example:
2124.sp
2125 (abc|xyz){2,3}+
2126.sp
2127Possessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY
2128option is ignored. They are a convenient notation for the simpler forms of
2129atomic group. However, there is no difference in the meaning of a possessive
2130quantifier and the equivalent atomic group, though there may be a performance
2131difference; possessive quantifiers should be slightly faster.
2132.P
2133The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
2134Jeffrey Friedl originated the idea (and the name) in the first edition of his
2135book. Mike McCloskey liked it, so implemented it when he built Sun's Java
2136package, and PCRE1 copied it from there. It found its way into Perl at release
21375.10.
2138.P
2139PCRE2 has an optimization that automatically "possessifies" certain simple
2140pattern constructs. For example, the sequence A+B is treated as A++B because
2141there is no point in backtracking into a sequence of A's when B must follow.
2142This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting
2143the pattern with (*NO_AUTO_POSSESS).
2144.P
2145When a pattern contains an unlimited repeat inside a group that can itself be
2146repeated an unlimited number of times, the use of an atomic group is the only
2147way to avoid some failing matches taking a very long time indeed. The pattern
2148.sp
2149 (\eD+|<\ed+>)*[!?]
2150.sp
2151matches an unlimited number of substrings that either consist of non-digits, or
2152digits enclosed in <>, followed by either ! or ?. When it matches, it runs
2153quickly. However, if it is applied to
2154.sp
2155 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2156.sp
2157it takes a long time before reporting failure. This is because the string can
2158be divided between the internal \eD+ repeat and the external * repeat in a
2159large number of ways, and all have to be tried. (The example uses [!?] rather
2160than a single character at the end, because both PCRE2 and Perl have an
2161optimization that allows for fast failure when a single character is used. They
2162remember the last single character that is required for a match, and fail early
2163if it is not present in the string.) If the pattern is changed so that it uses
2164an atomic group, like this:
2165.sp
2166 ((?>\eD+)|<\ed+>)*[!?]
2167.sp
2168sequences of non-digits cannot be broken, and failure happens quickly.
2169.
2170.
2171.\" HTML <a name="backreferences"></a>
2172.SH "BACKREFERENCES"
2173.rs
2174.sp
2175Outside a character class, a backslash followed by a digit greater than 0 (and
2176possibly further digits) is a backreference to a capture group earlier (that
2177is, to its left) in the pattern, provided there have been that many previous
2178capture groups.
2179.P
2180However, if the decimal number following the backslash is less than 8, it is
2181always taken as a backreference, and causes an error only if there are not that
2182many capture groups in the entire pattern. In other words, the group that is
2183referenced need not be to the left of the reference for numbers less than 8. A
2184"forward backreference" of this type can make sense when a repetition is
2185involved and the group to the right has participated in an earlier iteration.
2186.P
2187It is not possible to have a numerical "forward backreference" to a group whose
2188number is 8 or more using this syntax because a sequence such as \e50 is
2189interpreted as a character defined in octal. See the subsection entitled
2190"Non-printing characters"
2191.\" HTML <a href="#digitsafterbackslash">
2192.\" </a>
2193above
2194.\"
2195for further details of the handling of digits following a backslash. Other
2196forms of backreferencing do not suffer from this restriction. In particular,
2197there is no problem when named capture groups are used (see below).
2198.P
2199Another way of avoiding the ambiguity inherent in the use of digits following a
2200backslash is to use the \eg escape sequence. This escape must be followed by a
2201signed or unsigned number, optionally enclosed in braces. These examples are
2202all identical:
2203.sp
2204 (ring), \e1
2205 (ring), \eg1
2206 (ring), \eg{1}
2207.sp
2208An unsigned number specifies an absolute reference without the ambiguity that
2209is present in the older syntax. It is also useful when literal digits follow
2210the reference. A signed number is a relative reference. Consider this example:
2211.sp
2212 (abc(def)ghi)\eg{-1}
2213.sp
2214The sequence \eg{-1} is a reference to the most recently started capture group
2215before \eg, that is, is it equivalent to \e2 in this example. Similarly,
2216\eg{-2} would be equivalent to \e1. The use of relative references can be
2217helpful in long patterns, and also in patterns that are created by joining
2218together fragments that contain references within themselves.
2219.P
2220The sequence \eg{+1} is a reference to the next capture group. This kind of
2221forward reference can be useful in patterns that repeat. Perl does not support
2222the use of + in this way.
2223.P
2224A backreference matches whatever actually most recently matched the capture
2225group in the current subject string, rather than anything at all that matches
2226the group (see
2227.\" HTML <a href="#groupsassubroutines">
2228.\" </a>
2229"Groups as subroutines"
2230.\"
2231below for a way of doing that). So the pattern
2232.sp
2233 (sens|respons)e and \e1ibility
2234.sp
2235matches "sense and sensibility" and "response and responsibility", but not
2236"sense and responsibility". If caseful matching is in force at the time of the
2237backreference, the case of letters is relevant. For example,
2238.sp
2239 ((?i)rah)\es+\e1
2240.sp
2241matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
2242capture group is matched caselessly.
2243.P
2244There are several different ways of writing backreferences to named capture
2245groups. The .NET syntax \ek{name} and the Perl syntax \ek<name> or \ek'name'
2246are supported, as is the Python syntax (?P=name). Perl 5.10's unified
2247backreference syntax, in which \eg can be used for both numeric and named
2248references, is also supported. We could rewrite the above example in any of the
2249following ways:
2250.sp
2251 (?<p1>(?i)rah)\es+\ek<p1>
2252 (?'p1'(?i)rah)\es+\ek{p1}
2253 (?P<p1>(?i)rah)\es+(?P=p1)
2254 (?<p1>(?i)rah)\es+\eg{p1}
2255.sp
2256A capture group that is referenced by name may appear in the pattern before or
2257after the reference.
2258.P
2259There may be more than one backreference to the same group. If a group has not
2260actually been used in a particular match, backreferences to it always fail by
2261default. For example, the pattern
2262.sp
2263 (a|(bc))\e2
2264.sp
2265always fails if it starts to match "a" rather than "bc". However, if the
2266PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an
2267unset value matches an empty string.
2268.P
2269Because there may be many capture groups in a pattern, all digits following a
2270backslash are taken as part of a potential backreference number. If the pattern
2271continues with a digit character, some delimiter must be used to terminate the
2272backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this
2273can be white space. Otherwise, the \eg{} syntax or an empty comment (see
2274.\" HTML <a href="#comments">
2275.\" </a>
2276"Comments"
2277.\"
2278below) can be used.
2279.
2280.
2281.SS "Recursive backreferences"
2282.rs
2283.sp
2284A backreference that occurs inside the group to which it refers fails when the
2285group is first used, so, for example, (a\e1) never matches. However, such
2286references can be useful inside repeated groups. For example, the pattern
2287.sp
2288 (a|b\e1)+
2289.sp
2290matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
2291the group, the backreference matches the character string corresponding to the
2292previous iteration. In order for this to work, the pattern must be such that
2293the first iteration does not need to match the backreference. This can be done
2294using alternation, as in the example above, or by a quantifier with a minimum
2295of zero.
2296.P
2297For versions of PCRE2 less than 10.25, backreferences of this type used to
2298cause the group that they reference to be treated as an
2299.\" HTML <a href="#atomicgroup">
2300.\" </a>
2301atomic group.
2302.\"
2303This restriction no longer applies, and backtracking into such groups can occur
2304as normal.
2305.
2306.
2307.\" HTML <a name="bigassertions"></a>
2308.SH ASSERTIONS
2309.rs
2310.sp
2311An assertion is a test on the characters following or preceding the current
2312matching point that does not consume any characters. The simple assertions
2313coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described
2314.\" HTML <a href="#smallassertions">
2315.\" </a>
2316above.
2317.\"
2318.P
2319More complicated assertions are coded as parenthesized groups. There are two
2320kinds: those that look ahead of the current position in the subject string, and
2321those that look behind it, and in each case an assertion may be positive (must
2322match for the assertion to be true) or negative (must not match for the
2323assertion to be true). An assertion group is matched in the normal way,
2324and if it is true, matching continues after it, but with the matching position
2325in the subject string reset to what it was before the assertion was processed.
2326.P
2327The Perl-compatible lookaround assertions are atomic. If an assertion is true,
2328but there is a subsequent matching failure, there is no backtracking into the
2329assertion. However, there are some cases where non-atomic assertions can be
2330useful. PCRE2 has some support for these, described in the section entitled
2331.\" HTML <a href="#nonatomicassertions">
2332.\" </a>
2333"Non-atomic assertions"
2334.\"
2335below, but they are not Perl-compatible.
2336.P
2337A lookaround assertion may appear as the condition in a
2338.\" HTML <a href="#conditions">
2339.\" </a>
2340conditional group
2341.\"
2342(see below). In this case, the result of matching the assertion determines
2343which branch of the condition is followed.
2344.P
2345Assertion groups are not capture groups. If an assertion contains capture
2346groups within it, these are counted for the purposes of numbering the capture
2347groups in the whole pattern. Within each branch of an assertion, locally
2348captured substrings may be referenced in the usual way. For example, a sequence
2349such as (.)\eg{-1} can be used to check that two adjacent characters are the
2350same.
2351.P
2352When a branch within an assertion fails to match, any substrings that were
2353captured are discarded (as happens with any pattern branch that fails to
2354match). A negative assertion is true only when all its branches fail to match;
2355this means that no captured substrings are ever retained after a successful
2356negative assertion. When an assertion contains a matching branch, what happens
2357depends on the type of assertion.
2358.P
2359For a positive assertion, internally captured substrings in the successful
2360branch are retained, and matching continues with the next pattern item after
2361the assertion. For a negative assertion, a matching branch means that the
2362assertion is not true. If such an assertion is being used as a condition in a
2363.\" HTML <a href="#conditions">
2364.\" </a>
2365conditional group
2366.\"
2367(see below), captured substrings are retained, because matching continues with
2368the "no" branch of the condition. For other failing negative assertions,
2369control passes to the previous backtracking point, thus discarding any captured
2370strings within the assertion.
2371.P
2372Most assertion groups may be repeated; though it makes no sense to assert the
2373same thing several times, the side effect of capturing in positive assertions
2374may occasionally be useful. However, an assertion that forms the condition for
2375a conditional group may not be quantified. PCRE2 used to restrict the
2376repetition of assertions, but from release 10.35 the only restriction is that
2377an unlimited maximum repetition is changed to be one more than the minimum. For
2378example, {3,} is treated as {3,4}.
2379.
2380.
2381.SS "Alphabetic assertion names"
2382.rs
2383.sp
2384Traditionally, symbolic sequences such as (?= and (?<= have been used to
2385specify lookaround assertions. Perl 5.28 introduced some experimental
2386alphabetic alternatives which might be easier to remember. They all start with
2387(* instead of (? and must be written using lower case letters. PCRE2 supports
2388the following synonyms:
2389.sp
2390 (*positive_lookahead: or (*pla: is the same as (?=
2391 (*negative_lookahead: or (*nla: is the same as (?!
2392 (*positive_lookbehind: or (*plb: is the same as (?<=
2393 (*negative_lookbehind: or (*nlb: is the same as (?<!
2394.sp
2395For example, (*pla:foo) is the same assertion as (?=foo). In the following
2396sections, the various assertions are described using the original symbolic
2397forms.
2398.
2399.
2400.SS "Lookahead assertions"
2401.rs
2402.sp
2403Lookahead assertions start with (?= for positive assertions and (?! for
2404negative assertions. For example,
2405.sp
2406 \ew+(?=;)
2407.sp
2408matches a word followed by a semicolon, but does not include the semicolon in
2409the match, and
2410.sp
2411 foo(?!bar)
2412.sp
2413matches any occurrence of "foo" that is not followed by "bar". Note that the
2414apparently similar pattern
2415.sp
2416 (?!foo)bar
2417.sp
2418does not find an occurrence of "bar" that is preceded by something other than
2419"foo"; it finds any occurrence of "bar" whatsoever, because the assertion
2420(?!foo) is always true when the next three characters are "bar". A
2421lookbehind assertion is needed to achieve the other effect.
2422.P
2423If you want to force a matching failure at some point in a pattern, the most
2424convenient way to do it is with (?!) because an empty string always matches, so
2425an assertion that requires there not to be an empty string must always fail.
2426The backtracking control verb (*FAIL) or (*F) is a synonym for (?!).
2427.
2428.
2429.\" HTML <a name="lookbehind"></a>
2430.SS "Lookbehind assertions"
2431.rs
2432.sp
2433Lookbehind assertions start with (?<= for positive assertions and (?<! for
2434negative assertions. For example,
2435.sp
2436 (?<!foo)bar
2437.sp
2438does find an occurrence of "bar" that is not preceded by "foo". The contents of
2439a lookbehind assertion are restricted such that all the strings it matches must
2440have a fixed length. However, if there are several top-level alternatives, they
2441do not all have to have the same fixed length. Thus
2442.sp
2443 (?<=bullock|donkey)
2444.sp
2445is permitted, but
2446.sp
2447 (?<!dogs?|cats?)
2448.sp
2449causes an error at compile time. Branches that match different length strings
2450are permitted only at the top level of a lookbehind assertion. This is an
2451extension compared with Perl, which requires all branches to match the same
2452length of string. An assertion such as
2453.sp
2454 (?<=ab(c|de))
2455.sp
2456is not permitted, because its single top-level branch can match two different
2457lengths, but it is acceptable to PCRE2 if rewritten to use two top-level
2458branches:
2459.sp
2460 (?<=abc|abde)
2461.sp
2462In some cases, the escape sequence \eK
2463.\" HTML <a href="#resetmatchstart">
2464.\" </a>
2465(see above)
2466.\"
2467can be used instead of a lookbehind assertion to get round the fixed-length
2468restriction.
2469.P
2470The implementation of lookbehind assertions is, for each alternative, to
2471temporarily move the current position back by the fixed length and then try to
2472match. If there are insufficient characters before the current position, the
2473assertion fails.
2474.P
2475In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
2476single code unit even in a UTF mode) to appear in lookbehind assertions,
2477because it makes it impossible to calculate the length of the lookbehind. The
2478\eX and \eR escapes, which can match different numbers of code units, are never
2479permitted in lookbehinds.
2480.P
2481.\" HTML <a href="#groupsassubroutines">
2482.\" </a>
2483"Subroutine"
2484.\"
2485calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
2486as the called capture group matches a fixed-length string. However,
2487.\" HTML <a href="#recursion">
2488.\" </a>
2489recursion,
2490.\"
2491that is, a "subroutine" call into a group that is already active,
2492is not supported.
2493.P
2494Perl does not support backreferences in lookbehinds. PCRE2 does support them,
2495but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
2496must not be set, there must be no use of (?| in the pattern (it creates
2497duplicate group numbers), and if the backreference is by name, the name
2498must be unique. Of course, the referenced group must itself match a fixed
2499length substring. The following pattern matches words containing at least two
2500characters that begin and end with the same character:
2501.sp
2502 \eb(\ew)\ew++(?<=\e1)
2503.P
2504Possessive quantifiers can be used in conjunction with lookbehind assertions to
2505specify efficient matching of fixed-length strings at the end of subject
2506strings. Consider a simple pattern such as
2507.sp
2508 abcd$
2509.sp
2510when applied to a long string that does not match. Because matching proceeds
2511from left to right, PCRE2 will look for each "a" in the subject and then see if
2512what follows matches the rest of the pattern. If the pattern is specified as
2513.sp
2514 ^.*abcd$
2515.sp
2516the initial .* matches the entire string at first, but when this fails (because
2517there is no following "a"), it backtracks to match all but the last character,
2518then all but the last two characters, and so on. Once again the search for "a"
2519covers the entire string, from right to left, so we are no better off. However,
2520if the pattern is written as
2521.sp
2522 ^.*+(?<=abcd)
2523.sp
2524there can be no backtracking for the .*+ item because of the possessive
2525quantifier; it can match only the entire string. The subsequent lookbehind
2526assertion does a single test on the last four characters. If it fails, the
2527match fails immediately. For long strings, this approach makes a significant
2528difference to the processing time.
2529.
2530.
2531.SS "Using multiple assertions"
2532.rs
2533.sp
2534Several assertions (of any sort) may occur in succession. For example,
2535.sp
2536 (?<=\ed{3})(?<!999)foo
2537.sp
2538matches "foo" preceded by three digits that are not "999". Notice that each of
2539the assertions is applied independently at the same point in the subject
2540string. First there is a check that the previous three characters are all
2541digits, and then there is a check that the same three characters are not "999".
2542This pattern does \fInot\fP match "foo" preceded by six characters, the first
2543of which are digits and the last three of which are not "999". For example, it
2544doesn't match "123abcfoo". A pattern to do that is
2545.sp
2546 (?<=\ed{3}...)(?<!999)foo
2547.sp
2548This time the first assertion looks at the preceding six characters, checking
2549that the first three are digits, and then the second assertion checks that the
2550preceding three characters are not "999".
2551.P
2552Assertions can be nested in any combination. For example,
2553.sp
2554 (?<=(?<!foo)bar)baz
2555.sp
2556matches an occurrence of "baz" that is preceded by "bar" which in turn is not
2557preceded by "foo", while
2558.sp
2559 (?<=\ed{3}(?!999)...)foo
2560.sp
2561is another pattern that matches "foo" preceded by three digits and any three
2562characters that are not "999".
2563.
2564.
2565.\" HTML <a name="nonatomicassertions"></a>
2566.SH "NON-ATOMIC ASSERTIONS"
2567.rs
2568.sp
2569The traditional Perl-compatible lookaround assertions are atomic. That is, if
2570an assertion is true, but there is a subsequent matching failure, there is no
2571backtracking into the assertion. However, there are some cases where non-atomic
2572positive assertions can be useful. PCRE2 provides these using the following
2573syntax:
2574.sp
2575 (*non_atomic_positive_lookahead: or (*napla: or (?*
2576 (*non_atomic_positive_lookbehind: or (*naplb: or (?<*
2577.sp
2578Consider the problem of finding the right-most word in a string that also
2579appears earlier in the string, that is, it must appear at least twice in total.
2580This pattern returns the required result as captured substring 1:
2581.sp
2582 ^(?x)(*napla: .* \eb(\ew++)) (?> .*? \eb\e1\eb ){2}
2583.sp
2584For a subject such as "word1 word2 word3 word2 word3 word4" the result is
2585"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
2586"x" option, which causes white space (introduced for readability) to be
2587ignored. Inside the assertion, the greedy .* at first consumes the entire
2588string, but then has to backtrack until the rest of the assertion can match a
2589word, which is captured by group 1. In other words, when the assertion first
2590succeeds, it captures the right-most word in the string.
2591.P
2592The current matching point is then reset to the start of the subject, and the
2593rest of the pattern match checks for two occurrences of the captured word,
2594using an ungreedy .*? to scan from the left. If this succeeds, we are done, but
2595if the last word in the string does not occur twice, this part of the pattern
2596fails. If a traditional atomic lookhead (?= or (*pla: had been used, the
2597assertion could not be re-entered, and the whole match would fail. The pattern
2598would succeed only if the very last word in the subject was found twice.
2599.P
2600Using a non-atomic lookahead, however, means that when the last word does not
2601occur twice in the string, the lookahead can backtrack and find the second-last
2602word, and so on, until either the match succeeds, or all words have been
2603tested.
2604.P
2605Two conditions must be met for a non-atomic assertion to be useful: the
2606contents of one or more capturing groups must change after a backtrack into the
2607assertion, and there must be a backreference to a changed group later in the
2608pattern. If this is not the case, the rest of the pattern match fails exactly
2609as before because nothing has changed, so using a non-atomic assertion just
2610wastes resources.
2611.P
2612There is one exception to backtracking into a non-atomic assertion. If an
2613(*ACCEPT) control verb is triggered, the assertion succeeds atomically. That
2614is, a subsequent match failure cannot backtrack into the assertion.
2615.P
2616Non-atomic assertions are not supported by the alternative matching function
2617\fBpcre2_dfa_match()\fP. They are supported by JIT, but only if they do not
2618contain any control verbs such as (*ACCEPT). (This may change in future). Note
2619that assertions that appear as conditions for
2620.\" HTML <a href="#conditions">
2621.\" </a>
2622conditional groups
2623.\"
2624(see below) must be atomic.
2625.
2626.
2627.SH "SCRIPT RUNS"
2628.rs
2629.sp
2630In concept, a script run is a sequence of characters that are all from the same
2631Unicode script such as Latin or Greek. However, because some scripts are
2632commonly used together, and because some diacritical and other marks are used
2633with multiple scripts, it is not that simple. There is a full description of
2634the rules that PCRE2 uses in the section entitled
2635.\" HTML <a href="pcre2unicode.html#scriptruns">
2636.\" </a>
2637"Script Runs"
2638.\"
2639in the
2640.\" HREF
2641\fBpcre2unicode\fP
2642.\"
2643documentation.
2644.P
2645If part of a pattern is enclosed between (*script_run: or (*sr: and a closing
2646parenthesis, it fails if the sequence of characters that it matches are not a
2647script run. After a failure, normal backtracking occurs. Script runs can be
2648used to detect spoofing attacks using characters that look the same, but are
2649from different scripts. The string "paypal.com" is an infamous example, where
2650the letters could be a mixture of Latin and Cyrillic. This pattern ensures that
2651the matched characters in a sequence of non-spaces that follow white space are
2652a script run:
2653.sp
2654 \es+(*sr:\eS+)
2655.sp
2656To be sure that they are all from the Latin script (for example), a lookahead
2657can be used:
2658.sp
2659 \es+(?=\ep{Latin})(*sr:\eS+)
2660.sp
2661This works as long as the first character is expected to be a character in that
2662script, and not (for example) punctuation, which is allowed with any script. If
2663this is not the case, a more creative lookahead is needed. For example, if
2664digits, underscore, and dots are permitted at the start:
2665.sp
2666 \es+(?=[0-9_.]*\ep{Latin})(*sr:\eS+)
2667.sp
2668.P
2669In many cases, backtracking into a script run pattern fragment is not
2670desirable. The script run can employ an atomic group to prevent this. Because
2671this is a common requirement, a shorthand notation is provided by
2672(*atomic_script_run: or (*asr:
2673.sp
2674 (*asr:...) is the same as (*sr:(?>...))
2675.sp
2676Note that the atomic group is inside the script run. Putting it outside would
2677not prevent backtracking into the script run pattern.
2678.P
2679Support for script runs is not available if PCRE2 is compiled without Unicode
2680support. A compile-time error is given if any of the above constructs is
2681encountered. Script runs are not supported by the alternate matching function,
2682\fBpcre2_dfa_match()\fP because they use the same mechanism as capturing
2683parentheses.
2684.P
2685\fBWarning:\fP The (*ACCEPT) control verb
2686.\" HTML <a href="#acceptverb">
2687.\" </a>
2688(see below)
2689.\"
2690should not be used within a script run group, because it causes an immediate
2691exit from the group, bypassing the script run checking.
2692.
2693.
2694.\" HTML <a name="conditions"></a>
2695.SH "CONDITIONAL GROUPS"
2696.rs
2697.sp
2698It is possible to cause the matching process to obey a pattern fragment
2699conditionally or to choose between two alternative fragments, depending on
2700the result of an assertion, or whether a specific capture group has
2701already been matched. The two possible forms of conditional group are:
2702.sp
2703 (?(condition)yes-pattern)
2704 (?(condition)yes-pattern|no-pattern)
2705.sp
2706If the condition is satisfied, the yes-pattern is used; otherwise the
2707no-pattern (if present) is used. An absent no-pattern is equivalent to an empty
2708string (it always matches). If there are more than two alternatives in the
2709group, a compile-time error occurs. Each of the two alternatives may itself
2710contain nested groups of any form, including conditional groups; the
2711restriction to two alternatives applies only at the level of the condition
2712itself. This pattern fragment is an example where the alternatives are complex:
2713.sp
2714 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
2715.sp
2716.P
2717There are five kinds of condition: references to capture groups, references to
2718recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
2719.
2720.
2721.SS "Checking for a used capture group by number"
2722.rs
2723.sp
2724If the text between the parentheses consists of a sequence of digits, the
2725condition is true if a capture group of that number has previously matched. If
2726there is more than one capture group with the same number (see the earlier
2727.\"
2728.\" HTML <a href="#recursion">
2729.\" </a>
2730section about duplicate group numbers),
2731.\"
2732the condition is true if any of them have matched. An alternative notation is
2733to precede the digits with a plus or minus sign. In this case, the group number
2734is relative rather than absolute. The most recently opened capture group can be
2735referenced by (?(-1), the next most recent by (?(-2), and so on. Inside loops
2736it can also make sense to refer to subsequent groups. The next capture group
2737can be referenced as (?(+1), and so on. (The value zero in any of these forms
2738is not used; it provokes a compile-time error.)
2739.P
2740Consider the following pattern, which contains non-significant white space to
2741make it more readable (assume the PCRE2_EXTENDED option) and to divide it into
2742three parts for ease of discussion:
2743.sp
2744 ( \e( )? [^()]+ (?(1) \e) )
2745.sp
2746The first part matches an optional opening parenthesis, and if that
2747character is present, sets it as the first captured substring. The second part
2748matches one or more characters that are not parentheses. The third part is a
2749conditional group that tests whether or not the first capture group
2750matched. If it did, that is, if subject started with an opening parenthesis,
2751the condition is true, and so the yes-pattern is executed and a closing
2752parenthesis is required. Otherwise, since no-pattern is not present, the
2753conditional group matches nothing. In other words, this pattern matches a
2754sequence of non-parentheses, optionally enclosed in parentheses.
2755.P
2756If you were embedding this pattern in a larger one, you could use a relative
2757reference:
2758.sp
2759 ...other stuff... ( \e( )? [^()]+ (?(-1) \e) ) ...
2760.sp
2761This makes the fragment independent of the parentheses in the larger pattern.
2762.
2763.
2764.SS "Checking for a used capture group by name"
2765.rs
2766.sp
2767Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
2768capture group by name. For compatibility with earlier versions of PCRE1, which
2769had this facility before Perl, the syntax (?(name)...) is also recognized.
2770Note, however, that undelimited names consisting of the letter R followed by
2771digits are ambiguous (see the following section). Rewriting the above example
2772to use a named group gives this:
2773.sp
2774 (?<OPEN> \e( )? [^()]+ (?(<OPEN>) \e) )
2775.sp
2776If the name used in a condition of this kind is a duplicate, the test is
2777applied to all groups of the same name, and is true if any one of them has
2778matched.
2779.
2780.
2781.SS "Checking for pattern recursion"
2782.rs
2783.sp
2784"Recursion" in this sense refers to any subroutine-like call from one part of
2785the pattern to another, whether or not it is actually recursive. See the
2786sections entitled
2787.\" HTML <a href="#recursion">
2788.\" </a>
2789"Recursive patterns"
2790.\"
2791and
2792.\" HTML <a href="#groupsassubroutines">
2793.\" </a>
2794"Groups as subroutines"
2795.\"
2796below for details of recursion and subroutine calls.
2797.P
2798If a condition is the string (R), and there is no capture group with the name
2799R, the condition is true if matching is currently in a recursion or subroutine
2800call to the whole pattern or any capture group. If digits follow the letter R,
2801and there is no group with that name, the condition is true if the most recent
2802call is into a group with the given number, which must exist somewhere in the
2803overall pattern. This is a contrived example that is equivalent to a+b:
2804.sp
2805 ((?(R1)a+|(?1)b))
2806.sp
2807However, in both cases, if there is a capture group with a matching name, the
2808condition tests for its being set, as described in the section above, instead
2809of testing for recursion. For example, creating a group with the name R1 by
2810adding (?<R1>) to the above pattern completely changes its meaning.
2811.P
2812If a name preceded by ampersand follows the letter R, for example:
2813.sp
2814 (?(R&name)...)
2815.sp
2816the condition is true if the most recent recursion is into a group of that name
2817(which must exist within the pattern).
2818.P
2819This condition does not check the entire recursion stack. It tests only the
2820current level. If the name used in a condition of this kind is a duplicate, the
2821test is applied to all groups of the same name, and is true if any one of
2822them is the most recent recursion.
2823.P
2824At "top level", all these recursion test conditions are false.
2825.
2826.
2827.\" HTML <a name="subdefine"></a>
2828.SS "Defining capture groups for use by reference only"
2829.rs
2830.sp
2831If the condition is the string (DEFINE), the condition is always false, even if
2832there is a group with the name DEFINE. In this case, there may be only one
2833alternative in the rest of the conditional group. It is always skipped if
2834control reaches this point in the pattern; the idea of DEFINE is that it can be
2835used to define subroutines that can be referenced from elsewhere. (The use of
2836.\" HTML <a href="#groupsassubroutines">
2837.\" </a>
2838subroutines
2839.\"
2840is described below.) For example, a pattern to match an IPv4 address such as
2841"192.168.23.245" could be written like this (ignore white space and line
2842breaks):
2843.sp
2844 (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) )
2845 \eb (?&byte) (\e.(?&byte)){3} \eb
2846.sp
Elliott Hughes16619d62021-10-29 12:10:38 -07002847The first part of the pattern is a DEFINE group inside which another group
Elliott Hughes5b808042021-10-01 10:56:10 -07002848named "byte" is defined. This matches an individual component of an IPv4
2849address (a number less than 256). When matching takes place, this part of the
2850pattern is skipped because DEFINE acts like a false condition. The rest of the
2851pattern uses references to the named group to match the four dot-separated
2852components of an IPv4 address, insisting on a word boundary at each end.
2853.
2854.
2855.SS "Checking the PCRE2 version"
2856.rs
2857.sp
2858Programs that link with a PCRE2 library can check the version by calling
2859\fBpcre2_config()\fP with appropriate arguments. Users of applications that do
2860not have access to the underlying code cannot do this. A special "condition"
2861called VERSION exists to allow such users to discover which version of PCRE2
2862they are dealing with by using this condition to match a string such as
2863"yesno". VERSION must be followed either by "=" or ">=" and a version number.
2864For example:
2865.sp
2866 (?(VERSION>=10.4)yes|no)
2867.sp
2868This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
2869"no" otherwise. The fractional part of the version number may not contain more
2870than two digits.
2871.
2872.
2873.SS "Assertion conditions"
2874.rs
2875.sp
2876If the condition is not in any of the above formats, it must be a parenthesized
2877assertion. This may be a positive or negative lookahead or lookbehind
2878assertion. However, it must be a traditional atomic assertion, not one of the
2879PCRE2-specific
2880.\" HTML <a href="#nonatomicassertions">
2881.\" </a>
2882non-atomic assertions.
2883.\"
2884.P
2885Consider this pattern, again containing non-significant white space, and with
2886the two alternatives on the second line:
2887.sp
2888 (?(?=[^a-z]*[a-z])
2889 \ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} )
2890.sp
2891The condition is a positive lookahead assertion that matches an optional
2892sequence of non-letters followed by a letter. In other words, it tests for the
2893presence of at least one letter in the subject. If a letter is found, the
2894subject is matched against the first alternative; otherwise it is matched
2895against the second. This pattern matches strings in one of the two forms
2896dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2897.P
2898When an assertion that is a condition contains capture groups, any
2899capturing that occurs in a matching branch is retained afterwards, for both
2900positive and negative assertions, because matching always continues after the
2901assertion, whether it succeeds or fails. (Compare non-conditional assertions,
2902for which captures are retained only for positive assertions that succeed.)
2903.
2904.
2905.\" HTML <a name="comments"></a>
2906.SH COMMENTS
2907.rs
2908.sp
2909There are two ways of including comments in patterns that are processed by
2910PCRE2. In both cases, the start of the comment must not be in a character
2911class, nor in the middle of any other sequence of related characters such as
2912(?: or a group name or number. The characters that make up a comment play
2913no part in the pattern matching.
2914.P
2915The sequence (?# marks the start of a comment that continues up to the next
2916closing parenthesis. Nested parentheses are not permitted. If the
2917PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
2918also introduces a comment, which in this case continues to immediately after
2919the next newline character or character sequence in the pattern. Which
2920characters are interpreted as newlines is controlled by an option passed to the
2921compiling function or by a special sequence at the start of the pattern, as
2922described in the section entitled
2923.\" HTML <a href="#newlines">
2924.\" </a>
2925"Newline conventions"
2926.\"
2927above. Note that the end of this type of comment is a literal newline sequence
2928in the pattern; escape sequences that happen to represent a newline do not
2929count. For example, consider this pattern when PCRE2_EXTENDED is set, and the
2930default newline convention (a single linefeed character) is in force:
2931.sp
2932 abc #comment \en still comment
2933.sp
2934On encountering the # character, \fBpcre2_compile()\fP skips along, looking for
2935a newline in the pattern. The sequence \en is still literal at this stage, so
2936it does not terminate the comment. Only an actual character with the code value
29370x0a (the default newline) does so.
2938.
2939.
2940.\" HTML <a name="recursion"></a>
2941.SH "RECURSIVE PATTERNS"
2942.rs
2943.sp
2944Consider the problem of matching a string in parentheses, allowing for
2945unlimited nested parentheses. Without the use of recursion, the best that can
2946be done is to use a pattern that matches up to some fixed depth of nesting. It
2947is not possible to handle an arbitrary nesting depth.
2948.P
2949For some time, Perl has provided a facility that allows regular expressions to
2950recurse (amongst other things). It does this by interpolating Perl code in the
2951expression at run time, and the code can refer to the expression itself. A Perl
2952pattern using code interpolation to solve the parentheses problem can be
2953created like this:
2954.sp
2955 $re = qr{\e( (?: (?>[^()]+) | (?p{$re}) )* \e)}x;
2956.sp
2957The (?p{...}) item interpolates Perl code at run time, and in this case refers
2958recursively to the pattern in which it appears.
2959.P
2960Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it
2961supports special syntax for recursion of the entire pattern, and also for
2962individual capture group recursion. After its introduction in PCRE1 and Python,
2963this kind of recursion was subsequently introduced into Perl at release 5.10.
2964.P
2965A special item that consists of (? followed by a number greater than zero and a
2966closing parenthesis is a recursive subroutine call of the capture group of the
2967given number, provided that it occurs inside that group. (If not, it is a
2968.\" HTML <a href="#groupsassubroutines">
2969.\" </a>
2970non-recursive subroutine
2971.\"
2972call, which is described in the next section.) The special item (?R) or (?0) is
2973a recursive call of the entire regular expression.
2974.P
2975This PCRE2 pattern solves the nested parentheses problem (assume the
2976PCRE2_EXTENDED option is set so that white space is ignored):
2977.sp
2978 \e( ( [^()]++ | (?R) )* \e)
2979.sp
2980First it matches an opening parenthesis. Then it matches any number of
2981substrings which can either be a sequence of non-parentheses, or a recursive
2982match of the pattern itself (that is, a correctly parenthesized substring).
2983Finally there is a closing parenthesis. Note the use of a possessive quantifier
2984to avoid backtracking into sequences of non-parentheses.
2985.P
2986If this were part of a larger pattern, you would not want to recurse the entire
2987pattern, so instead you could use this:
2988.sp
2989 ( \e( ( [^()]++ | (?1) )* \e) )
2990.sp
2991We have put the pattern into parentheses, and caused the recursion to refer to
2992them instead of the whole pattern.
2993.P
2994In a larger pattern, keeping track of parenthesis numbers can be tricky. This
2995is made easier by the use of relative references. Instead of (?1) in the
2996pattern above you can write (?-2) to refer to the second most recently opened
2997parentheses preceding the recursion. In other words, a negative number counts
2998capturing parentheses leftwards from the point at which it is encountered.
2999.P
3000Be aware however, that if
3001.\" HTML <a href="#dupgroupnumber">
3002.\" </a>
3003duplicate capture group numbers
3004.\"
3005are in use, relative references refer to the earliest group with the
3006appropriate number. Consider, for example:
3007.sp
3008 (?|(a)|(b)) (c) (?-2)
3009.sp
3010The first two capture groups (a) and (b) are both numbered 1, and group (c)
3011is number 2. When the reference (?-2) is encountered, the second most recently
3012opened parentheses has the number 1, but it is the first such group (the (a)
3013group) to which the recursion refers. This would be the same if an absolute
3014reference (?1) was used. In other words, relative references are just a
3015shorthand for computing a group number.
3016.P
3017It is also possible to refer to subsequent capture groups, by writing
3018references such as (?+2). However, these cannot be recursive because the
3019reference is not inside the parentheses that are referenced. They are always
3020.\" HTML <a href="#groupsassubroutines">
3021.\" </a>
3022non-recursive subroutine
3023.\"
3024calls, as described in the next section.
3025.P
3026An alternative approach is to use named parentheses. The Perl syntax for this
3027is (?&name); PCRE1's earlier syntax (?P>name) is also supported. We could
3028rewrite the above example as follows:
3029.sp
3030 (?<pn> \e( ( [^()]++ | (?&pn) )* \e) )
3031.sp
3032If there is more than one group with the same name, the earliest one is
3033used.
3034.P
3035The example pattern that we have been looking at contains nested unlimited
3036repeats, and so the use of a possessive quantifier for matching strings of
3037non-parentheses is important when applying the pattern to strings that do not
3038match. For example, when this pattern is applied to
3039.sp
3040 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3041.sp
3042it yields "no match" quickly. However, if a possessive quantifier is not used,
3043the match runs for a very long time indeed because there are so many different
3044ways the + and * repeats can carve up the subject, and all have to be tested
3045before failure can be reported.
3046.P
3047At the end of a match, the values of capturing parentheses are those from
3048the outermost level. If you want to obtain intermediate values, a callout
3049function can be used (see below and the
3050.\" HREF
3051\fBpcre2callout\fP
3052.\"
3053documentation). If the pattern above is matched against
3054.sp
3055 (ab(cd)ef)
3056.sp
3057the value for the inner capturing parentheses (numbered 2) is "ef", which is
3058the last value taken on at the top level. If a capture group is not matched at
3059the top level, its final captured value is unset, even if it was (temporarily)
3060set at a deeper level during the matching process.
3061.P
3062Do not confuse the (?R) item with the condition (R), which tests for recursion.
3063Consider this pattern, which matches text in angle brackets, allowing for
3064arbitrary nesting. Only digits are allowed in nested brackets (that is, when
3065recursing), whereas any characters are permitted at the outer level.
3066.sp
3067 < (?: (?(R) \ed++ | [^<>]*+) | (?R)) * >
3068.sp
3069In this pattern, (?(R) is the start of a conditional group, with two different
3070alternatives for the recursive and non-recursive cases. The (?R) item is the
3071actual recursive call.
3072.
3073.
3074.\" HTML <a name="recursiondifference"></a>
3075.SS "Differences in recursion processing between PCRE2 and Perl"
3076.rs
3077.sp
3078Some former differences between PCRE2 and Perl no longer exist.
3079.P
3080Before release 10.30, recursion processing in PCRE2 differed from Perl in that
3081a recursive subroutine call was always treated as an atomic group. That is,
3082once it had matched some of the subject string, it was never re-entered, even
3083if it contained untried alternatives and there was a subsequent matching
3084failure. (Historical note: PCRE implemented recursion before Perl did.)
3085.P
3086Starting with release 10.30, recursive subroutine calls are no longer treated
3087as atomic. That is, they can be re-entered to try unused alternatives if there
3088is a matching failure later in the pattern. This is now compatible with the way
3089Perl works. If you want a subroutine call to be atomic, you must explicitly
3090enclose it in an atomic group.
3091.P
3092Supporting backtracking into recursions simplifies certain types of recursive
3093pattern. For example, this pattern matches palindromic strings:
3094.sp
3095 ^((.)(?1)\e2|.?)$
3096.sp
3097The second branch in the group matches a single central character in the
3098palindrome when there are an odd number of characters, or nothing when there
3099are an even number of characters, but in order to work it has to be able to try
3100the second case when the rest of the pattern match fails. If you want to match
3101typical palindromic phrases, the pattern has to ignore all non-word characters,
3102which can be done like this:
3103.sp
3104 ^\eW*+((.)\eW*+(?1)\eW*+\e2|\eW*+.?)\eW*+$
3105.sp
3106If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
3107man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
3108avoid backtracking into sequences of non-word characters. Without this, PCRE2
3109takes a great deal longer (ten times or more) to match typical phrases, and
3110Perl takes so long that you think it has gone into a loop.
3111.P
3112Another way in which PCRE2 and Perl used to differ in their recursion
3113processing is in the handling of captured values. Formerly in Perl, when a
3114group was called recursively or as a subroutine (see the next section), it
3115had no access to any values that were captured outside the recursion, whereas
3116in PCRE2 these values can be referenced. Consider this pattern:
3117.sp
3118 ^(.)(\e1|a(?2))
3119.sp
3120This pattern matches "bab". The first capturing parentheses match "b", then in
3121the second group, when the backreference \e1 fails to match "b", the second
3122alternative matches "a" and then recurses. In the recursion, \e1 does now match
3123"b" and so the whole match succeeds. This match used to fail in Perl, but in
3124later versions (I tried 5.024) it now works.
3125.
3126.
3127.\" HTML <a name="groupsassubroutines"></a>
3128.SH "GROUPS AS SUBROUTINES"
3129.rs
3130.sp
3131If the syntax for a recursive group call (either by number or by name) is used
3132outside the parentheses to which it refers, it operates a bit like a subroutine
3133in a programming language. More accurately, PCRE2 treats the referenced group
3134as an independent subpattern which it tries to match at the current matching
3135position. The called group may be defined before or after the reference. A
3136numbered reference can be absolute or relative, as in these examples:
3137.sp
3138 (...(absolute)...)...(?2)...
3139 (...(relative)...)...(?-1)...
3140 (...(?+1)...(relative)...
3141.sp
3142An earlier example pointed out that the pattern
3143.sp
3144 (sens|respons)e and \e1ibility
3145.sp
3146matches "sense and sensibility" and "response and responsibility", but not
3147"sense and responsibility". If instead the pattern
3148.sp
3149 (sens|respons)e and (?1)ibility
3150.sp
3151is used, it does match "sense and responsibility" as well as the other two
3152strings. Another example is given in the discussion of DEFINE above.
3153.P
3154Like recursions, subroutine calls used to be treated as atomic, but this
3155changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
3156occur. However, any capturing parentheses that are set during the subroutine
3157call revert to their previous values afterwards.
3158.P
3159Processing options such as case-independence are fixed when a group is
3160defined, so if it is used as a subroutine, such options cannot be changed for
3161different calls. For example, consider this pattern:
3162.sp
3163 (abc)(?i:(?-1))
3164.sp
3165It matches "abcabc". It does not match "abcABC" because the change of
3166processing option does not affect the called group.
3167.P
3168The behaviour of
3169.\" HTML <a href="#backtrackcontrol">
3170.\" </a>
3171backtracking control verbs
3172.\"
3173in groups when called as subroutines is described in the section entitled
3174.\" HTML <a href="#btsub">
3175.\" </a>
3176"Backtracking verbs in subroutines"
3177.\"
3178below.
3179.
3180.
3181.\" HTML <a name="onigurumasubroutines"></a>
3182.SH "ONIGURUMA SUBROUTINE SYNTAX"
3183.rs
3184.sp
3185For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
3186a number enclosed either in angle brackets or single quotes, is an alternative
3187syntax for calling a group as a subroutine, possibly recursively. Here are two
3188of the examples used above, rewritten using this syntax:
3189.sp
3190 (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) )
3191 (sens|respons)e and \eg'1'ibility
3192.sp
3193PCRE2 supports an extension to Oniguruma: if a number is preceded by a
3194plus or a minus sign it is taken as a relative reference. For example:
3195.sp
3196 (abc)(?i:\eg<-1>)
3197.sp
3198Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
3199synonymous. The former is a backreference; the latter is a subroutine call.
3200.
3201.
3202.SH CALLOUTS
3203.rs
3204.sp
3205Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
3206code to be obeyed in the middle of matching a regular expression. This makes it
3207possible, amongst other things, to extract different substrings that match the
3208same pair of parentheses when there is a repetition.
3209.P
3210PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl
3211code. The feature is called "callout". The caller of PCRE2 provides an external
3212function by putting its entry point in a match context using the function
3213\fBpcre2_set_callout()\fP, and then passing that context to \fBpcre2_match()\fP
3214or \fBpcre2_dfa_match()\fP. If no match context is passed, or if the callout
3215entry point is set to NULL, callouts are disabled.
3216.P
3217Within a regular expression, (?C<arg>) indicates a point at which the external
3218function is to be called. There are two kinds of callout: those with a
3219numerical argument and those with a string argument. (?C) on its own with no
3220argument is treated as (?C0). A numerical argument allows the application to
3221distinguish between different callouts. String arguments were added for release
322210.20 to make it possible for script languages that use PCRE2 to embed short
3223scripts within patterns in a similar way to Perl.
3224.P
3225During matching, when PCRE2 reaches a callout point, the external function is
3226called. It is provided with the number or string argument of the callout, the
3227position in the pattern, and one item of data that is also set in the match
3228block. The callout function may cause matching to proceed, to backtrack, or to
3229fail.
3230.P
3231By default, PCRE2 implements a number of optimizations at matching time, and
3232one side-effect is that sometimes callouts are skipped. If you need all
3233possible callouts to happen, you need to set options that disable the relevant
3234optimizations. More details, including a complete description of the
3235programming interface to the callout function, are given in the
3236.\" HREF
3237\fBpcre2callout\fP
3238.\"
3239documentation.
3240.
3241.
3242.SS "Callouts with numerical arguments"
3243.rs
3244.sp
3245If you just want to have a means of identifying different callout points, put a
3246number less than 256 after the letter C. For example, this pattern has two
3247callout points:
3248.sp
3249 (?C1)abc(?C2)def
3250.sp
3251If the PCRE2_AUTO_CALLOUT flag is passed to \fBpcre2_compile()\fP, numerical
3252callouts are automatically installed before each item in the pattern. They are
3253all numbered 255. If there is a conditional group in the pattern whose
3254condition is an assertion, an additional callout is inserted just before the
3255condition. An explicit callout may also be set at this position, as in this
3256example:
3257.sp
3258 (?(?C9)(?=a)abc|def)
3259.sp
3260Note that this applies only to assertion conditions, not to other types of
3261condition.
3262.
3263.
3264.SS "Callouts with string arguments"
3265.rs
3266.sp
3267A delimited string may be used instead of a number as a callout argument. The
3268starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
3269the same as the start, except for {, where the ending delimiter is }. If the
3270ending delimiter is needed within the string, it must be doubled. For
3271example:
3272.sp
3273 (?C'ab ''c'' d')xyz(?C{any text})pqr
3274.sp
3275The doubling is removed before the string is passed to the callout function.
3276.
3277.
3278.\" HTML <a name="backtrackcontrol"></a>
3279.SH "BACKTRACKING CONTROL"
3280.rs
3281.sp
3282There are a number of special "Backtracking Control Verbs" (to use Perl's
3283terminology) that modify the behaviour of backtracking during matching. They
3284are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
3285and may behave differently depending on whether or not a name argument is
3286present. The names are not required to be unique within the pattern.
3287.P
3288By default, for compatibility with Perl, a name is any sequence of characters
3289that does not include a closing parenthesis. The name is not processed in
3290any way, and it is not possible to include a closing parenthesis in the name.
3291This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
3292is no longer Perl-compatible.
3293.P
3294When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
3295and only an unescaped closing parenthesis terminates the name. However, the
3296only backslash items that are permitted are \eQ, \eE, and sequences such as
3297\ex{100} that define character code points. Character type escapes such as \ed
3298are faulted.
3299.P
3300A closing parenthesis can be included in a name either as \e) or between \eQ
3301and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or
3302PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
3303skipped, and #-comments are recognized, exactly as in the rest of the pattern.
3304PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
3305PCRE2_ALT_VERBNAMES is also set.
3306.P
3307The maximum length of a name is 255 in the 8-bit library and 65535 in the
330816-bit and 32-bit libraries. If the name is empty, that is, if the closing
3309parenthesis immediately follows the colon, the effect is as if the colon were
3310not there. Any number of these verbs may occur in a pattern. Except for
3311(*ACCEPT), they may not be quantified.
3312.P
3313Since these verbs are specifically related to backtracking, most of them can be
3314used only when the pattern is to be matched using the traditional matching
3315function, because that uses a backtracking algorithm. With the exception of
3316(*FAIL), which behaves like a failing negative assertion, the backtracking
3317control verbs cause an error if encountered by the DFA matching function.
3318.P
3319The behaviour of these verbs in
3320.\" HTML <a href="#btrepeat">
3321.\" </a>
3322repeated groups,
3323.\"
3324.\" HTML <a href="#btassert">
3325.\" </a>
3326assertions,
3327.\"
3328and in
3329.\" HTML <a href="#btsub">
3330.\" </a>
3331capture groups called as subroutines
3332.\"
3333(whether or not recursively) is documented below.
3334.
3335.
3336.\" HTML <a name="nooptimize"></a>
3337.SS "Optimizations that affect backtracking verbs"
3338.rs
3339.sp
3340PCRE2 contains some optimizations that are used to speed up matching by running
3341some checks at the start of each match attempt. For example, it may know the
3342minimum length of matching subject, or that a particular character must be
3343present. When one of these optimizations bypasses the running of a match, any
3344included backtracking verbs will not, of course, be processed. You can suppress
3345the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
3346when calling \fBpcre2_compile()\fP, or by starting the pattern with
3347(*NO_START_OPT). There is more discussion of this option in the section
3348entitled
3349.\" HTML <a href="pcre2api.html#compiling">
3350.\" </a>
3351"Compiling a pattern"
3352.\"
3353in the
3354.\" HREF
3355\fBpcre2api\fP
3356.\"
3357documentation.
3358.P
3359Experiments with Perl suggest that it too has similar optimizations, and like
3360PCRE2, turning them off can change the result of a match.
3361.
3362.
3363.\" HTML <a name="acceptverb"></a>
3364.SS "Verbs that act immediately"
3365.rs
3366.sp
3367The following verbs act as soon as they are encountered.
3368.sp
3369 (*ACCEPT) or (*ACCEPT:NAME)
3370.sp
3371This verb causes the match to end successfully, skipping the remainder of the
3372pattern. However, when it is inside a capture group that is called as a
3373subroutine, only that group is ended successfully. Matching then continues
3374at the outer level. If (*ACCEPT) in triggered in a positive assertion, the
3375assertion succeeds; in a negative assertion, the assertion fails.
3376.P
3377If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For
3378example:
3379.sp
3380 A((?:A|B(*ACCEPT)|C)D)
3381.sp
3382This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
3383the outer parentheses.
3384.P
3385(*ACCEPT) is the only backtracking verb that is allowed to be quantified
3386because an ungreedy quantification with a minimum of zero acts only when a
3387backtrack happens. Consider, for example,
3388.sp
3389 (A(*ACCEPT)??B)C
3390.sp
3391where A, B, and C may be complex expressions. After matching "A", the matcher
3392processes "BC"; if that fails, causing a backtrack, (*ACCEPT) is triggered and
3393the match succeeds. In both cases, all but C is captured. Whereas (*COMMIT)
3394(see below) means "fail on backtrack", a repeated (*ACCEPT) of this type means
3395"succeed on backtrack".
3396.P
3397\fBWarning:\fP (*ACCEPT) should not be used within a script run group, because
3398it causes an immediate exit from the group, bypassing the script run checking.
3399.sp
3400 (*FAIL) or (*FAIL:NAME)
3401.sp
3402This verb causes a matching failure, forcing backtracking to occur. It may be
3403abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
3404documentation notes that it is probably useful only when combined with (?{}) or
3405(??{}). Those are, of course, Perl features that are not present in PCRE2. The
3406nearest equivalent is the callout feature, as for example in this pattern:
3407.sp
3408 a+(?C)(*FAIL)
3409.sp
3410A match with the string "aaaa" always fails, but the callout is taken before
3411each backtrack happens (in this example, 10 times).
3412.P
3413(*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*ACCEPT) and
3414(*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is recorded just before
3415the verb acts.
3416.
3417.
3418.SS "Recording which path was taken"
3419.rs
3420.sp
3421There is one verb whose main purpose is to track how a match was arrived at,
3422though it also has a secondary use in conjunction with advancing the match
3423starting point (see (*SKIP) below).
3424.sp
3425 (*MARK:NAME) or (*:NAME)
3426.sp
3427A name is always required with this verb. For all the other backtracking
3428control verbs, a NAME argument is optional.
3429.P
3430When a match succeeds, the name of the last-encountered mark name on the
3431matching path is passed back to the caller as described in the section entitled
3432.\" HTML <a href="pcre2api.html#matchotherdata">
3433.\" </a>
3434"Other information about the match"
3435.\"
3436in the
3437.\" HREF
3438\fBpcre2api\fP
3439.\"
3440documentation. This applies to all instances of (*MARK) and other verbs,
3441including those inside assertions and atomic groups. However, there are
3442differences in those cases when (*MARK) is used in conjunction with (*SKIP) as
3443described below.
3444.P
3445The mark name that was last encountered on the matching path is passed back. A
3446verb without a NAME argument is ignored for this purpose. Here is an example of
3447\fBpcre2test\fP output, where the "mark" modifier requests the retrieval and
3448outputting of (*MARK) data:
3449.sp
3450 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
3451 data> XY
3452 0: XY
3453 MK: A
3454 XZ
3455 0: XZ
3456 MK: B
3457.sp
3458The (*MARK) name is tagged with "MK:" in this output, and in this example it
3459indicates which of the two alternatives matched. This is a more efficient way
3460of obtaining this information than putting each alternative in its own
3461capturing parentheses.
3462.P
3463If a verb with a name is encountered in a positive assertion that is true, the
3464name is recorded and passed back if it is the last-encountered. This does not
3465happen for negative assertions or failing positive assertions.
3466.P
3467After a partial match or a failed match, the last encountered name in the
3468entire match process is returned. For example:
3469.sp
3470 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
3471 data> XP
3472 No match, mark = B
3473.sp
3474Note that in this unanchored example the mark is retained from the match
3475attempt that started at the letter "X" in the subject. Subsequent match
3476attempts starting at "P" and then with an empty string do not get as far as the
3477(*MARK) item, but nevertheless do not reset it.
3478.P
3479If you are interested in (*MARK) values after failed matches, you should
3480probably set the PCRE2_NO_START_OPTIMIZE option
3481.\" HTML <a href="#nooptimize">
3482.\" </a>
3483(see above)
3484.\"
3485to ensure that the match is always attempted.
3486.
3487.
3488.SS "Verbs that act after backtracking"
3489.rs
3490.sp
3491The following verbs do nothing when they are encountered. Matching continues
3492with what follows, but if there is a subsequent match failure, causing a
3493backtrack to the verb, a failure is forced. That is, backtracking cannot pass
3494to the left of the verb. However, when one of these verbs appears inside an
3495atomic group or in a lookaround assertion that is true, its effect is confined
3496to that group, because once the group has been matched, there is never any
3497backtracking into it. Backtracking from beyond an assertion or an atomic group
3498ignores the entire group, and seeks a preceding backtracking point.
3499.P
3500These verbs differ in exactly what kind of failure occurs when backtracking
3501reaches them. The behaviour described below is what happens when the verb is
3502not in a subroutine or an assertion. Subsequent sections cover these special
3503cases.
3504.sp
3505 (*COMMIT) or (*COMMIT:NAME)
3506.sp
3507This verb causes the whole match to fail outright if there is a later matching
3508failure that causes backtracking to reach it. Even if the pattern is
3509unanchored, no further attempts to find a match by advancing the starting point
3510take place. If (*COMMIT) is the only backtracking verb that is encountered,
3511once it has been passed \fBpcre2_match()\fP is committed to finding a match at
3512the current starting point, or not at all. For example:
3513.sp
3514 a+(*COMMIT)b
3515.sp
3516This matches "xxaab" but not "aacaab". It can be thought of as a kind of
3517dynamic anchor, or "I've started, so I must finish."
3518.P
3519The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COMMIT). It is
3520like (*MARK:NAME) in that the name is remembered for passing back to the
3521caller. However, (*SKIP:NAME) searches only for names that are set with
3522(*MARK), ignoring those set by any of the other backtracking verbs.
3523.P
3524If there is more than one backtracking verb in a pattern, a different one that
3525follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a
3526match does not always guarantee that a match must be at this starting point.
3527.P
3528Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
3529unless PCRE2's start-of-match optimizations are turned off, as shown in this
3530output from \fBpcre2test\fP:
3531.sp
3532 re> /(*COMMIT)abc/
3533 data> xyzabc
3534 0: abc
3535 data>
3536 re> /(*COMMIT)abc/no_start_optimize
3537 data> xyzabc
3538 No match
3539.sp
3540For the first pattern, PCRE2 knows that any match must start with "a", so the
3541optimization skips along the subject to "a" before applying the pattern to the
3542first set of data. The match attempt then succeeds. The second pattern disables
3543the optimization that skips along to the first character. The pattern is now
3544applied starting at "x", and so the (*COMMIT) causes the match to fail without
3545trying any other starting points.
3546.sp
3547 (*PRUNE) or (*PRUNE:NAME)
3548.sp
3549This verb causes the match to fail at the current starting position in the
3550subject if there is a later matching failure that causes backtracking to reach
3551it. If the pattern is unanchored, the normal "bumpalong" advance to the next
3552starting character then happens. Backtracking can occur as usual to the left of
3553(*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but
3554if there is no match to the right, backtracking cannot cross (*PRUNE). In
3555simple cases, the use of (*PRUNE) is just an alternative to an atomic group or
3556possessive quantifier, but there are some uses of (*PRUNE) that cannot be
3557expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
3558as (*COMMIT).
3559.P
3560The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
3561like (*MARK:NAME) in that the name is remembered for passing back to the
3562caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
3563ignoring those set by other backtracking verbs.
3564.sp
3565 (*SKIP)
3566.sp
3567This verb, when given without a name, is like (*PRUNE), except that if the
3568pattern is unanchored, the "bumpalong" advance is not to the next character,
3569but to the position in the subject where (*SKIP) was encountered. (*SKIP)
3570signifies that whatever text was matched leading up to it cannot be part of a
3571successful match if there is a later mismatch. Consider:
3572.sp
3573 a+(*SKIP)b
3574.sp
3575If the subject is "aaaac...", after the first match attempt fails (starting at
3576the first character in the string), the starting point skips on to start the
3577next attempt at "c". Note that a possessive quantifier does not have the same
3578effect as this example; although it would suppress backtracking during the
3579first match attempt, the second attempt would start at the second character
3580instead of skipping on to "c".
3581.P
3582If (*SKIP) is used to specify a new starting position that is the same as the
3583starting position of the current match, or (by being inside a lookbehind)
3584earlier, the position specified by (*SKIP) is ignored, and instead the normal
3585"bumpalong" occurs.
3586.sp
3587 (*SKIP:NAME)
3588.sp
3589When (*SKIP) has an associated name, its behaviour is modified. When such a
3590(*SKIP) is triggered, the previous path through the pattern is searched for the
3591most recent (*MARK) that has the same name. If one is found, the "bumpalong"
3592advance is to the subject position that corresponds to that (*MARK) instead of
3593to where (*SKIP) was encountered. If no (*MARK) with a matching name is found,
3594the (*SKIP) is ignored.
3595.P
3596The search for a (*MARK) name uses the normal backtracking mechanism, which
3597means that it does not see (*MARK) settings that are inside atomic groups or
3598assertions, because they are never re-entered by backtracking. Compare the
3599following \fBpcre2test\fP examples:
3600.sp
3601 re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
3602 data: abc
3603 0: a
3604 1: a
3605 data:
3606 re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
3607 data: abc
3608 0: b
3609 1: b
3610.sp
3611In the first example, the (*MARK) setting is in an atomic group, so it is not
3612seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. This allows
3613the second branch of the pattern to be tried at the first character position.
3614In the second example, the (*MARK) setting is not in an atomic group. This
3615allows (*SKIP:X) to find the (*MARK) when it backtracks, and this causes a new
3616matching attempt to start at the second character. This time, the (*MARK) is
3617never seen because "a" does not match "b", so the matcher immediately jumps to
3618the second branch of the pattern.
3619.P
3620Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores
3621names that are set by other backtracking verbs.
3622.sp
3623 (*THEN) or (*THEN:NAME)
3624.sp
3625This verb causes a skip to the next innermost alternative when backtracking
3626reaches it. That is, it cancels any further backtracking within the current
3627alternative. Its name comes from the observation that it can be used for a
3628pattern-based if-then-else block:
3629.sp
3630 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
3631.sp
3632If the COND1 pattern matches, FOO is tried (and possibly further items after
3633the end of the group if FOO succeeds); on failure, the matcher skips to the
3634second alternative and tries COND2, without backtracking into COND1. If that
3635succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
3636more alternatives, so there is a backtrack to whatever came before the entire
3637group. If (*THEN) is not inside an alternation, it acts like (*PRUNE).
3638.P
3639The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). It is
3640like (*MARK:NAME) in that the name is remembered for passing back to the
3641caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
3642ignoring those set by other backtracking verbs.
3643.P
3644A group that does not contain a | character is just a part of the enclosing
3645alternative; it is not a nested alternation with only one alternative. The
3646effect of (*THEN) extends beyond such a group to the enclosing alternative.
3647Consider this pattern, where A, B, etc. are complex pattern fragments that do
3648not contain any | characters at this level:
3649.sp
3650 A (B(*THEN)C) | D
3651.sp
3652If A and B are matched, but there is a failure in C, matching does not
3653backtrack into A; instead it moves to the next alternative, that is, D.
3654However, if the group containing (*THEN) is given an alternative, it
3655behaves differently:
3656.sp
3657 A (B(*THEN)C | (*FAIL)) | D
3658.sp
3659The effect of (*THEN) is now confined to the inner group. After a failure in C,
3660matching moves to (*FAIL), which causes the whole group to fail because there
3661are no more alternatives to try. In this case, matching does backtrack into A.
3662.P
3663Note that a conditional group is not considered as having two alternatives,
3664because only one is ever used. In other words, the | character in a conditional
3665group has a different meaning. Ignoring white space, consider:
3666.sp
3667 ^.*? (?(?=a) a | b(*THEN)c )
3668.sp
3669If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
3670it initially matches zero characters. The condition (?=a) then fails, the
3671character "b" is matched, but "c" is not. At this point, matching does not
3672backtrack to .*? as might perhaps be expected from the presence of the |
3673character. The conditional group is part of the single alternative that
3674comprises the whole pattern, and so the match fails. (If there was a backtrack
3675into .*?, allowing it to match "b", the match would succeed.)
3676.P
3677The verbs just described provide four different "strengths" of control when
3678subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
3679next alternative. (*PRUNE) comes next, failing the match at the current
3680starting position, but allowing an advance to the next character (for an
3681unanchored pattern). (*SKIP) is similar, except that the advance may be more
3682than one character. (*COMMIT) is the strongest, causing the entire match to
3683fail.
3684.
3685.
3686.SS "More than one backtracking verb"
3687.rs
3688.sp
3689If more than one backtracking verb is present in a pattern, the one that is
3690backtracked onto first acts. For example, consider this pattern, where A, B,
3691etc. are complex pattern fragments:
3692.sp
3693 (A(*COMMIT)B(*THEN)C|ABD)
3694.sp
3695If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to
3696fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes
3697the next alternative (ABD) to be tried. This behaviour is consistent, but is
3698not always the same as Perl's. It means that if two or more backtracking verbs
3699appear in succession, all the the last of them has no effect. Consider this
3700example:
3701.sp
3702 ...(*COMMIT)(*PRUNE)...
3703.sp
3704If there is a matching failure to the right, backtracking onto (*PRUNE) causes
3705it to be triggered, and its action is taken. There can never be a backtrack
3706onto (*COMMIT).
3707.
3708.
3709.\" HTML <a name="btrepeat"></a>
3710.SS "Backtracking verbs in repeated groups"
3711.rs
3712.sp
3713PCRE2 sometimes differs from Perl in its handling of backtracking verbs in
3714repeated groups. For example, consider:
3715.sp
3716 /(a(*COMMIT)b)+ac/
3717.sp
3718If the subject is "abac", Perl matches unless its optimizations are disabled,
3719but PCRE2 always fails because the (*COMMIT) in the second repeat of the group
3720acts.
3721.
3722.
3723.\" HTML <a name="btassert"></a>
3724.SS "Backtracking verbs in assertions"
3725.rs
3726.sp
3727(*FAIL) in any assertion has its normal effect: it forces an immediate
3728backtrack. The behaviour of the other backtracking verbs depends on whether or
3729not the assertion is standalone or acting as the condition in a conditional
3730group.
3731.P
3732(*ACCEPT) in a standalone positive assertion causes the assertion to succeed
3733without any further processing; captured strings and a mark name (if set) are
3734retained. In a standalone negative assertion, (*ACCEPT) causes the assertion to
3735fail without any further processing; captured substrings and any mark name are
3736discarded.
3737.P
3738If the assertion is a condition, (*ACCEPT) causes the condition to be true for
3739a positive assertion and false for a negative one; captured substrings are
3740retained in both cases.
3741.P
3742The remaining verbs act only when a later failure causes a backtrack to
3743reach them. This means that, for the Perl-compatible assertions, their effect
3744is confined to the assertion, because Perl lookaround assertions are atomic. A
3745backtrack that occurs after such an assertion is complete does not jump back
3746into the assertion. Note in particular that a (*MARK) name that is set in an
3747assertion is not "seen" by an instance of (*SKIP:NAME) later in the pattern.
3748.P
3749PCRE2 now supports non-atomic positive assertions, as described in the section
3750entitled
3751.\" HTML <a href="#nonatomicassertions">
3752.\" </a>
3753"Non-atomic assertions"
3754.\"
3755above. These assertions must be standalone (not used as conditions). They are
3756not Perl-compatible. For these assertions, a later backtrack does jump back
3757into the assertion, and therefore verbs such as (*COMMIT) can be triggered by
3758backtracks from later in the pattern.
3759.P
3760The effect of (*THEN) is not allowed to escape beyond an assertion. If there
3761are no more branches to try, (*THEN) causes a positive assertion to be false,
3762and a negative assertion to be true.
3763.P
3764The other backtracking verbs are not treated specially if they appear in a
3765standalone positive assertion. In a conditional positive assertion,
3766backtracking (from within the assertion) into (*COMMIT), (*SKIP), or (*PRUNE)
3767causes the condition to be false. However, for both standalone and conditional
3768negative assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes
3769the assertion to be true, without considering any further alternative branches.
3770.
3771.
3772.\" HTML <a name="btsub"></a>
3773.SS "Backtracking verbs in subroutines"
3774.rs
3775.sp
3776These behaviours occur whether or not the group is called recursively.
3777.P
3778(*ACCEPT) in a group called as a subroutine causes the subroutine match to
3779succeed without any further processing. Matching then continues after the
3780subroutine call. Perl documents this behaviour. Perl's treatment of the other
3781verbs in subroutines is different in some cases.
3782.P
3783(*FAIL) in a group called as a subroutine has its normal effect: it forces
3784an immediate backtrack.
3785.P
3786(*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail when
3787triggered by being backtracked to in a group called as a subroutine. There is
3788then a backtrack at the outer level.
3789.P
3790(*THEN), when triggered, skips to the next alternative in the innermost
3791enclosing group that has alternatives (its normal behaviour). However, if there
3792is no such group within the subroutine's group, the subroutine match fails and
3793there is a backtrack at the outer level.
3794.
3795.
3796.SH "SEE ALSO"
3797.rs
3798.sp
3799\fBpcre2api\fP(3), \fBpcre2callout\fP(3), \fBpcre2matching\fP(3),
3800\fBpcre2syntax\fP(3), \fBpcre2\fP(3).
3801.
3802.
3803.SH AUTHOR
3804.rs
3805.sp
3806.nf
3807Philip Hazel
3808Retired from University Computing Service
3809Cambridge, England.
3810.fi
3811.
3812.
3813.SH REVISION
3814.rs
3815.sp
3816.nf
Elliott Hughes4e19c8e2022-04-15 15:11:02 -07003817Last updated: 12 January 2022
3818Copyright (c) 1997-2022 University of Cambridge.
Elliott Hughes5b808042021-10-01 10:56:10 -07003819.fi