Upgrade to pcre2 version 10.31.
Bug: N/A
Test: builds and boots, getprop -Z works
Change-Id: I2fbda9427edc9e5d966333a567b50539e17ed48d
diff --git a/dist2/HACKING b/dist2/HACKING
index 883aa64..d727add 100644
--- a/dist2/HACKING
+++ b/dist2/HACKING
@@ -7,8 +7,8 @@
library is referred to as PCRE1 below. For information about testing PCRE2, see
the pcre2test documentation and the comment at the head of the RunTest file.
-PCRE1 releases were up to 8.3x when PCRE2 was developed. The 8.xx series will
-continue for bugfixes if necessary. PCRE2 releases started at 10.00 to avoid
+PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
+releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
confusion with PCRE1.
@@ -16,19 +16,20 @@
-----------------
Many years ago I implemented some regular expression functions to an algorithm
-suggested by Martin Richards. These were not Unix-like in form, and were quite
-restricted in what they could do by comparison with Perl. The interesting part
-about the algorithm was that the amount of space required to hold the compiled
-form of an expression was known in advance. The code to apply an expression did
-not operate by backtracking, as the original Henry Spencer code and current
-PCRE2 and Perl code does, but instead checked all possibilities simultaneously
-by keeping a list of current states and checking all of them as it advanced
-through the subject string. In the terminology of Jeffrey Friedl's book, it was
-a "DFA algorithm", though it was not a traditional Finite State Machine (FSM).
-When the pattern was all used up, all remaining states were possible matches,
-and the one matching the longest subset of the subject string was chosen. This
-did not necessarily maximize the individual wild portions of the pattern, as is
-expected in Unix and Perl-style regular expressions.
+suggested by Martin Richards. The rather simple patterns were not Unix-like in
+form, and were quite restricted in what they could do by comparison with Perl.
+The interesting part about the algorithm was that the amount of space required
+to hold the compiled form of an expression was known in advance. The code to
+apply an expression did not operate by backtracking, as the original Henry
+Spencer code and current PCRE2 and Perl code does, but instead checked all
+possibilities simultaneously by keeping a list of current states and checking
+all of them as it advanced through the subject string. In the terminology of
+Jeffrey Friedl's book, it was a "DFA algorithm", though it was not a
+traditional Finite State Machine (FSM). When the pattern was all used up, all
+remaining states were possible matches, and the one matching the longest subset
+of the subject string was chosen. This did not necessarily maximize the
+individual wild portions of the pattern, as is expected in Unix and Perl-style
+regular expressions.
Historical note 2
@@ -47,18 +48,20 @@
OK, here's the real stuff
-------------------------
-For the set of functions that formed the original PCRE1 library (which are
-unrelated to those mentioned above), I tried at first to invent an algorithm
-that used an amount of store bounded by a multiple of the number of characters
-in the pattern, to save on compiling time. However, because of the greater
-complexity in Perl regular expressions, I couldn't do this. In any case, a
-first pass through the pattern is helpful for other reasons.
+For the set of functions that formed the original PCRE1 library in 1997 (which
+are unrelated to those mentioned above), I tried at first to invent an
+algorithm that used an amount of store bounded by a multiple of the number of
+characters in the pattern, to save on compiling time. However, because of the
+greater complexity in Perl regular expressions, I couldn't do this, even though
+the then current Perl 5.004 patterns were much simpler than those supported
+nowadays. In any case, a first pass through the pattern is helpful for other
+reasons.
Support for 16-bit and 32-bit data strings
-------------------------------------------
-The library can be compiled in any combination of 8-bit, 16-bit or 32-bit
+The PCRE2 library can be compiled in any combination of 8-bit, 16-bit or 32-bit
modes, creating up to three different libraries. In the description that
follows, the word "short" is used for a 16-bit data quantity, and the phrase
"code unit" is used for a quantity that is a byte in 8-bit mode, a short in
@@ -85,12 +88,12 @@
things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then
I had a flash of inspiration as to how I could run the real compile function in
a "fake" mode that enables it to compute how much memory it would need, while
-actually only ever using a few hundred bytes of working memory, and without too
+in most cases only ever using a small amount of working memory, and without too
many tests of the mode that might slow it down. So I refactored the compiling
-functions to work this way. This got rid of about 600 lines of source. It
-should make future maintenance and development easier. As this was such a major
-change, I never released 6.8, instead upping the number to 7.0 (other quite
-major changes were also present in the 7.0 release).
+functions to work this way. This got rid of about 600 lines of source and made
+further maintenance and development easier. As this was such a major change, I
+never released 6.8, instead upping the number to 7.0 (other quite major changes
+were also present in the 7.0 release).
A side effect of this work was that the previous limit of 200 on the nesting
depth of parentheses was removed. However, there was a downside: compiling ran
@@ -104,20 +107,208 @@
for nested parenthesized groups. This is a safety feature for environments with
small stacks where the patterns are provided by users.
-History repeated itself for release 10.20. A number of bugs relating to named
-subpatterns had been discovered by fuzzers. Most of these were related to the
-handling of forward references when it was not known if the named pattern was
+
+Yet another pattern scan
+------------------------
+
+History repeated itself for PCRE2 release 10.20. A number of bugs relating to
+named subpatterns had been discovered by fuzzers. Most of these were related to
+the handling of forward references when it was not known if the named group was
unique. (References to non-unique names use a different opcode and more
memory.) The use of duplicate group numbers (the (?| facility) also caused
-issues.
+issues.
-To get around these problems I adopted a new approach by adding a third pass,
-really a "pre-pass", over the pattern, which does nothing other than identify
-all the named subpatterns and their corresponding group numbers. This means
-that the actual compile (both pre-pass and real compile) have full knowledge of
-group names and numbers throughout. Several dozen lines of messy code were
-eliminated, though the new pre-pass is not short (skipping over [] classes is
-complicated).
+To get around these problems I adopted a new approach by adding a third pass
+over the pattern (really a "pre-pass"), which did nothing other than identify
+all the named subpatterns and their corresponding group numbers. This means
+that the actual compile (both the memory-computing dummy run and the real
+compile) has full knowledge of group names and numbers throughout. Several
+dozen lines of messy code were eliminated, though the new pre-pass was not
+short. In particular, parsing and skipping over [] classes is complicated.
+
+While working on 10.22 I realized that I could simplify yet again by moving
+more of the parsing into the pre-pass, thus avoiding doing it in two places, so
+after 10.22 was released, the code underwent yet another big refactoring. This
+is how it is from 10.23 onwards:
+
+The function called parse_regex() scans the pattern characters, parsing them
+into literal data and meta characters. It converts escapes such as \x{123}
+into literals, handles \Q...\E, and skips over comments and non-significant
+white space. The result of the scanning is put into a vector of 32-bit unsigned
+integers. Values less than 0x80000000 are literal data. Higher values represent
+meta-characters. The top 16-bits of such values identify the meta-character,
+and these are given names such as META_CAPTURE. The lower 16-bits are available
+for data, for example, the capturing group number. The only situation in which
+literal data values greater than 0x7fffffff can appear is when the 32-bit
+library is running in non-UTF mode. This is handled by having a special
+meta-character that is followed by the 32-bit data value.
+
+The size of the parsed pattern vector, when auto-callouts are not enabled, is
+bounded by the length of the pattern (with one exception). The code is written
+so that each item in the pattern uses no more vector elements than the number
+of code units in the item itself. The exception is the aforementioned large
+32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
+advance to check for such values. When auto-callouts are enabled, the generous
+assumption is made that there will be a callout for each pattern code unit
+(which of course is only actually true if all code units are literals) plus one
+at the end. There is a default parsed pattern vector on the system stack, but
+if this is not big enough, heap memory is used.
+
+As before, the actual compiling function is run twice, the first time to
+determine the amount of memory needed for the final compiled pattern. It
+now processes the parsed pattern vector, not the pattern itself, although some
+of the parsed items refer to strings in the pattern - for example, group
+names. As escapes and comments have already been processed, the code is a bit
+simpler than before.
+
+Most errors can be diagnosed during the parsing scan. For those that cannot
+(for example, "lookbehind assertion is not fixed length"), the parsed code
+contains offsets into the pattern so that the actual compiling code can
+report where errors are.
+
+
+The elements of the parsed pattern vector
+-----------------------------------------
+
+The word "offset" below means a code unit offset into the pattern. When
+PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
+stored in a single parsed pattern element. Otherwise (typically on 64-bit
+systems) it occupies two elements. The following meta items occupy just one
+element, with no data:
+
+META_ACCEPT (*ACCEPT)
+META_ASTERISK *
+META_ASTERISK_PLUS *+
+META_ASTERISK_QUERY *?
+META_ATOMIC (?> start of atomic group
+META_CIRCUMFLEX ^ metacharacter
+META_CLASS [ start of non-empty class
+META_CLASS_EMPTY [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
+META_CLASS_EMPTY_NOT [^] negative empty class - ditto
+META_CLASS_END ] end of non-empty class
+META_CLASS_NOT [^ start non-empty negative class
+META_COMMIT (*COMMIT)
+META_COND_ASSERT (?(?assertion)
+META_DOLLAR $ metacharacter
+META_DOT . metacharacter
+META_END End of pattern (this value is 0x80000000)
+META_FAIL (*FAIL)
+META_KET ) closing parenthesis
+META_LOOKAHEAD (?= start of lookahead
+META_LOOKAHEADNOT (?! start of negative lookahead
+META_NOCAPTURE (?: no capture parens
+META_PLUS +
+META_PLUS_PLUS ++
+META_PLUS_QUERY +?
+META_PRUNE (*PRUNE) - no argument
+META_QUERY ?
+META_QUERY_PLUS ?+
+META_QUERY_QUERY ??
+META_RANGE_ESCAPED hyphen in class range with at least one escape
+META_RANGE_LITERAL hyphen in class range defined literally
+META_SKIP (*SKIP) - no argument
+META_THEN (*THEN) - no argument
+
+The two RANGE values occur only in character classes. They are positioned
+between two literals that define the start and end of the range. In an EBCDIC
+evironment it is necessary to know whether either of the range values was
+specified as an escape. In an ASCII/Unicode environment the distinction is not
+relevant.
+
+The following have data in the lower 16 bits, and may be followed by other data
+elements:
+
+META_ALT | alternation
+META_BACKREF back reference
+META_CAPTURE start of capturing group
+META_ESCAPE non-literal escape sequence
+META_RECURSE recursion call
+
+If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
+is the length of its branch, for which OP_REVERSE must be generated.
+
+META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
+their data in the lower 16 bits of the element.
+
+META_BACKREF is followed by an offset if the back reference group number is 10
+or more. The offsets of the first ocurrences of references to groups whose
+numbers are less than 10 are put in cb->small_ref_offset[] (only the first
+occurrence is useful). On 64-bit systems this avoids using more than two parsed
+pattern elements for items such as \3. The offset is used when an error occurs
+because the reference is to a non-existent group.
+
+META_RECURSE is always followed by an offset, for use in error messages.
+
+META_ESCAPE has an ESC_xxx value as its data. For ESC_P and ESC_p, the next
+element contains the 16-bit type and data property values, packed together.
+ESC_g and ESC_k are used only for named references - numerical ones are turned
+into META_RECURSE or META_BACKREF as appropriate. ESC_g and ESC_k are followed
+by a length and an offset into the pattern to specify the name.
+
+The following have one data item that follows in the next vector element:
+
+META_BIGVALUE Next is a literal >= META_END
+META_OPTIONS (?i) and friends (data is new option bits)
+META_POSIX POSIX class item (data identifies the class)
+META_POSIX_NEG negative POSIX class item (ditto)
+
+The following are followed by a length element, then a number of character code
+values (which should match with the length):
+
+META_MARK (*MARK:xxxx)
+META_PRUNE_ARG (*PRUNE:xxx)
+META_SKIP_ARG (*SKIP:xxxx)
+META_THEN_ARG (*THEN:xxxx)
+
+The following are followed by a length element, then an offset in the pattern
+that identifies the name:
+
+META_COND_NAME (?(<name>) or (?('name') or (?(name)
+META_COND_RNAME (?(R&name)
+META_COND_RNUMBER (?(Rdigits)
+META_RECURSE_BYNAME (?&name)
+META_BACKREF_BYNAME \k'name'
+
+META_COND_RNUMBER is used for names that start with R and continue with digits,
+because this is an ambiguous case. It could be a back reference to a group with
+that name, or it could be a recursion test on a numbered group.
+
+This one is followed by an offset, for use in error messages, then a number:
+
+META_COND_NUMBER (?([+-]digits)
+
+The following is followed just by an offset, for use in error messages:
+
+META_COND_DEFINE (?(DEFINE)
+
+The following are also followed just by an offset, but also the lower 16 bits
+of the main word contain the length of the first branch of the lookbehind
+group; this is used when generating OP_REVERSE for that branch.
+
+META_LOOKBEHIND (?<=
+META_LOOKBEHINDNOT (?<!
+
+The following are followed by two elements, the minimum and maximum. Repeat
+values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
+represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
+
+META_MINMAX {n,m} repeat
+META_MINMAX_PLUS {n,m}+ repeat
+META_MINMAX_QUERY {n,m}? repeat
+
+This one is followed by three elements. The first is 0 for '>' and 1 for '>=';
+the next two are the major and minor numbers:
+
+META_COND_VERSION (?(VERSION<op>x.y)
+
+Callouts are converted into one of two items:
+
+META_CALLOUT_NUMBER (?C with numerical argument
+META_CALLOUT_STRING (?C with string argument
+
+In both cases, the next two elements contain the offset and length of the next
+item in the pattern. Then there is either one callout number, or a length and
+an offset for the string argument. The length includes both delimiters.
Traditional matching function
@@ -154,9 +345,14 @@
------------------
The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
-some others) may change in the middle of patterns. Their processing is handled
-entirely at compile time by generating different opcodes for the different
-settings. The runtime functions do not need to keep track of an options state.
+others) may be changed in the middle of patterns by items such as (?i). Their
+processing is handled entirely at compile time by generating different opcodes
+for the different settings. The runtime functions do not need to keep track of
+an options state.
+
+PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
+are tracked and processed during the parsing pre-pass. The others are handled
+from META_OPTIONS items during the main compile phase.
Format of compiled patterns
@@ -180,7 +376,7 @@
In this description, we assume the "normal" compilation options. Data values
that are counts (e.g. quantifiers) are always two bytes long in 8-bit mode
-(most significant byte first), or one code unit in 16-bit and 32-bit modes.
+(most significant byte first), and one code unit in 16-bit and 32-bit modes.
Opcodes with no following data
@@ -220,16 +416,16 @@
OP_ACCEPT ) These are Perl 5.10's "backtracking control
OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing
OP_FAIL ) parentheses, it may be preceded by one or more
- OP_PRUNE ) OP_CLOSE, each followed by a count that
+ OP_PRUNE ) OP_CLOSE, each followed by a number that
OP_SKIP ) indicates which parentheses must be closed.
OP_THEN )
OP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion.
-This ends the assertion, not the entire pattern match. The assertion (?!) is
+This ends the assertion, not the entire pattern match. The assertion (?!) is
always optimized to OP_FAIL.
OP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in
-non-UTF modes and in UTF-32 mode (since one code unit still equals one
+non-UTF modes and in UTF-32 mode (since one code unit still equals one
character). Another use is for [^] when empty classes are permitted
(PCRE2_ALLOW_EMPTY_CLASS is set).
@@ -248,14 +444,22 @@
---------------------------
The OP_CHAR opcode is followed by a single character that is to be matched
-casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
-the character may be more than one code unit long. In UTF-32 mode, characters
-are always exactly one code unit long.
+casefully. For caseless matching of characters that have at most two
+case-equivalent code points, OP_CHARI is used. In UTF-8 or UTF-16 modes, the
+character may be more than one code unit long. In UTF-32 mode, characters are
+always exactly one code unit long.
If there is only one character in a character class, OP_CHAR or OP_CHARI is
used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is,
for something like [^a]).
+Caseless matching (positive or negative) of characters that have more than two
+case-equivalent code points (which is possible only in UTF mode) is handled by
+compiling a Unicode property item (see below), with the pseudo-property
+PT_CLIST. The value of this property is an offset in a vector called
+"ucd_caseless_sets" which identifies the start of a short list of equivalent
+characters, terminated by the value NOTACHAR (0xffffffff).
+
Repeating single characters
---------------------------
@@ -331,7 +535,8 @@
and a value. The types are a set of #defines of the form PT_xxx, and the values
are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
The value is relevant only for PT_GC (General Category), PT_PC (Particular
-Category), and PT_SC (Script).
+Category), PT_SC (Script), and the pseudo-property PT_CLIST, which is used to
+identify a list of case-equivalent characters when there are three or more.
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
@@ -343,7 +548,10 @@
If there is only one character in a class, OP_CHAR or OP_CHARI is used for a
positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
-something like [^a]).
+something like [^a]), except when caselessly matching a character that has more
+than two case-equivalent code points (which can happen only in UTF mode). In
+this case a Unicode property item is used, as described above in "Matching
+literal characters".
A set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated,
negated, single-character classes. The normal single-character opcodes
@@ -364,8 +572,8 @@
For classes containing characters with values greater than 255 or that contain
\p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
code points are less than 256, followed by a list of pairs (for a range) and/or
-single characters and/or properties. In caseless mode, both cases are
-explicitly listed.
+single characters and/or properties. In caseless mode, all equivalent
+characters are explicitly listed.
OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
opcode and its data. This is followed by a code unit containing flag bits:
@@ -422,8 +630,8 @@
OP_CRMINRANGE
OP_CRPOSRANGE
-All but the last three are single-code-unit items, with no data. The others are
-followed by the minimum and maximum repeat counts.
+All but the last three are single-code-unit items, with no data. The range
+opcodes are followed by the minimum and maximum repeat counts.
Brackets and alternation
@@ -438,16 +646,17 @@
Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A
bracket opcode is followed by a LINK_SIZE value which gives the offset to the
-next alternative OP_ALT or, if there aren't any branches, to the matching
-OP_KET opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset
-to the next one, or to the OP_KET opcode. For capturing brackets, the bracket
-number is a count that immediately follows the offset.
+next alternative OP_ALT or, if there aren't any branches, to the terminating
+opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset to the
+next one, or to the final opcode. For capturing brackets, the bracket number is
+a count that immediately follows the offset.
-OP_KET is used for subpatterns that do not repeat indefinitely, and OP_KETRMIN
-and OP_KETRMAX are used for indefinite repetitions, minimally or maximally
-respectively (see below for possessive repetitions). All three are followed by
-a LINK_SIZE value giving (as a positive number) the offset back to the matching
-bracket opcode.
+There are several opcodes that mark the end of a subpattern group. OP_KET is
+used for subpatterns that do not repeat indefinitely, OP_KETRMIN and
+OP_KETRMAX are used for indefinite repetitions, minimally or maximally
+respectively, and OP_KETRPOS for possessive repetitions (see below for more
+details). All four are followed by a LINK_SIZE value giving (as a positive
+number) the offset back to the matching bracket opcode.
If a subpattern is quantified such that it is permitted to match zero times, it
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
@@ -488,17 +697,9 @@
Once-only (atomic) groups
-------------------------
-These are just like other subpatterns, but they start with the opcode
-OP_ONCE or OP_ONCE_NC. The former is used when there are no capturing brackets
-within the atomic group; the latter when there are. The distinction is needed
-for when there is a backtrack to before the group - any captures within the
-group must be reset, so it is necessary to retain backtracking points inside
-the group, even after it is complete, in order to do this. When there are no
-captures in an atomic group, all the backtracking can be discarded when it is
-complete. This is more efficient, and also uses less stack.
-
+These are just like other subpatterns, but they start with the opcode OP_ONCE.
The check for matching an empty string in an unbounded repeat is handled
-entirely at runtime, so there are just these two opcodes for atomic groups.
+entirely at runtime, so there is just this one opcode for atomic groups.
Assertions
@@ -544,14 +745,14 @@
or OP_FALSE.
If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
-must start with an assertion, whose opcode normally immediately follows OP_COND
-or OP_SCOND. However, if automatic callouts are enabled, a callout is inserted
-immediately before the assertion. It is also possible to insert a manual
-callout at this point. Only assertion conditions may have callouts preceding
-the condition.
+must start with a parenthesized assertion, whose opcode normally immediately
+follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a
+callout is inserted immediately before the assertion. It is also possible to
+insert a manual callout at this point. Only assertion conditions may have
+callouts preceding the condition.
-A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
-parts of the pattern, so this is another opcode that may appear as a condition.
+A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
+parts of the pattern, so this is another opcode that may appear as a condition.
It is treated the same as OP_FALSE.
@@ -561,21 +762,28 @@
Recursion either matches the current pattern, or some subexpression. The opcode
OP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting
bracket from the start of the whole pattern. OP_RECURSE is also used for
-"subroutine" calls, even though they are not strictly a recursion. Repeated
-recursions are automatically wrapped inside OP_ONCE brackets, because otherwise
-some patterns broke them. A non-repeated recursion is not wrapped in OP_ONCE
-brackets, but it is nevertheless still treated as an atomic group.
+"subroutine" calls, even though they are not strictly a recursion. Up till
+release 10.30 recursions were treated as atomic groups, making them
+incompatible with Perl (but PCRE had then well before Perl did). From 10.30,
+backtracking into recursions is supported.
+
+Repeated recursions used to be wrapped inside OP_ONCE brackets, which not only
+forced no backtracking, but also allowed repetition to be handled as for other
+bracketed groups. From 10.30 onwards, repeated recursions are duplicated for
+their minimum repetitions, and then wrapped in non-capturing brackets for the
+remainder. For example, (?1){3} is treated as (?1)(?1)(?1), and (?1){2,4} is
+treated as (?1)(?1)(?:(?1)){0,2}.
-Callout
--------
+Callouts
+--------
-A callout can nowadays have either a numerical argument or a string argument.
-These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are
-followed by two LINK_SIZE values giving the offset in the pattern string to the
-start of the following item, and another count giving the length of this item.
-These values make it possible for pcre2test to output useful tracing
-information using callouts.
+A callout may have either a numerical argument or a string argument. These use
+OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are followed by
+two LINK_SIZE values giving the offset in the pattern string to the start of
+the following item, and another count giving the length of this item. These
+values make it possible for pcre2test to output useful tracing information
+using callouts.
In the case of a numeric callout, after these two values there is a single code
unit containing the callout number, in the range 0-255, with 255 being used for
@@ -593,17 +801,17 @@
the application needs it. In the 8-bit library, the callout in /X(?C'abc')Y/ is
compiled as the following bytes (decimal numbers represent binary values):
- [OP_CALLOUT] [0] [10] [0] [1] [0] [14] [0] [5] ['] [a] [b] [c] [0]
- -------- ------- -------- -------
- | | | |
- ------- LINK_SIZE items ------
+ [OP_CALLOUT_STR] [0] [10] [0] [1] [0] [14] [0] [5] ['] [a] [b] [c] [0]
+ -------- ------- -------- -------
+ | | | |
+ ------- LINK_SIZE items ------
Opcode table checking
---------------------
The last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is
-not a real opcode, but is used to check that tables indexed by opcode are the
-correct length, in order to catch updating errors.
+not a real opcode, but is used to check at compile time that tables indexed by
+opcode are the correct length, in order to catch updating errors.
Philip Hazel
-June 2016
+21 April 2017