Upgrade to pcre2 version 10.31. Bug: N/A Test: builds and boots, getprop -Z works Change-Id: I2fbda9427edc9e5d966333a567b50539e17ed48d

commit: 9bc971b1e045f368b6d281841bc892804f37b767 [log] [tgz]
author: Elliott Hughes <enh@google.com> Fri Jul 27 13:23:14 2018 -0700
committer: Elliott Hughes <enh@google.com> Fri Jul 27 13:23:14 2018 -0700
tree: 7ec5bf9c5d8c7f3b73939f42ddbb33585e646cf4
parent: 6420c7d130cade76df8fcd0a750710545565b306 [diff] [blame]
diff --git a/dist2/HACKING b/dist2/HACKING
index 883aa64..d727add 100644
--- a/dist2/HACKING
+++ b/dist2/HACKING

@@ -7,8 +7,8 @@
 library is referred to as PCRE1 below. For information about testing PCRE2, see
 the pcre2test documentation and the comment at the head of the RunTest file.
 
-PCRE1 releases were up to 8.3x when PCRE2 was developed. The 8.xx series will
-continue for bugfixes if necessary. PCRE2 releases started at 10.00 to avoid
+PCRE1 releases were up to 8.3x when PCRE2 was developed, and later bug fix
+releases remain in the 8.xx series. PCRE2 releases started at 10.00 to avoid
 confusion with PCRE1.
 
 
@@ -16,19 +16,20 @@
 -----------------
 
 Many years ago I implemented some regular expression functions to an algorithm
-suggested by Martin Richards. These were not Unix-like in form, and were quite
-restricted in what they could do by comparison with Perl. The interesting part
-about the algorithm was that the amount of space required to hold the compiled
-form of an expression was known in advance. The code to apply an expression did
-not operate by backtracking, as the original Henry Spencer code and current
-PCRE2 and Perl code does, but instead checked all possibilities simultaneously
-by keeping a list of current states and checking all of them as it advanced
-through the subject string. In the terminology of Jeffrey Friedl's book, it was
-a "DFA algorithm", though it was not a traditional Finite State Machine (FSM).
-When the pattern was all used up, all remaining states were possible matches,
-and the one matching the longest subset of the subject string was chosen. This
-did not necessarily maximize the individual wild portions of the pattern, as is
-expected in Unix and Perl-style regular expressions.
+suggested by Martin Richards. The rather simple patterns were not Unix-like in
+form, and were quite restricted in what they could do by comparison with Perl.
+The interesting part about the algorithm was that the amount of space required
+to hold the compiled form of an expression was known in advance. The code to
+apply an expression did not operate by backtracking, as the original Henry
+Spencer code and current PCRE2 and Perl code does, but instead checked all
+possibilities simultaneously by keeping a list of current states and checking
+all of them as it advanced through the subject string. In the terminology of
+Jeffrey Friedl's book, it was a "DFA algorithm", though it was not a
+traditional Finite State Machine (FSM). When the pattern was all used up, all
+remaining states were possible matches, and the one matching the longest subset
+of the subject string was chosen. This did not necessarily maximize the
+individual wild portions of the pattern, as is expected in Unix and Perl-style
+regular expressions.
 
 
 Historical note 2
@@ -47,18 +48,20 @@
 OK, here's the real stuff
 -------------------------
 
-For the set of functions that formed the original PCRE1 library (which are
-unrelated to those mentioned above), I tried at first to invent an algorithm
-that used an amount of store bounded by a multiple of the number of characters
-in the pattern, to save on compiling time. However, because of the greater
-complexity in Perl regular expressions, I couldn't do this. In any case, a
-first pass through the pattern is helpful for other reasons.
+For the set of functions that formed the original PCRE1 library in 1997 (which
+are unrelated to those mentioned above), I tried at first to invent an
+algorithm that used an amount of store bounded by a multiple of the number of
+characters in the pattern, to save on compiling time. However, because of the
+greater complexity in Perl regular expressions, I couldn't do this, even though
+the then current Perl 5.004 patterns were much simpler than those supported
+nowadays. In any case, a first pass through the pattern is helpful for other
+reasons.
 
 
 Support for 16-bit and 32-bit data strings
 -------------------------------------------
 
-The library can be compiled in any combination of 8-bit, 16-bit or 32-bit
+The PCRE2 library can be compiled in any combination of 8-bit, 16-bit or 32-bit
 modes, creating up to three different libraries. In the description that
 follows, the word "short" is used for a 16-bit data quantity, and the phrase
 "code unit" is used for a quantity that is a byte in 8-bit mode, a short in
@@ -85,12 +88,12 @@
 things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then
 I had a flash of inspiration as to how I could run the real compile function in
 a "fake" mode that enables it to compute how much memory it would need, while
-actually only ever using a few hundred bytes of working memory, and without too
+in most cases only ever using a small amount of working memory, and without too
 many tests of the mode that might slow it down. So I refactored the compiling
-functions to work this way. This got rid of about 600 lines of source. It
-should make future maintenance and development easier. As this was such a major
-change, I never released 6.8, instead upping the number to 7.0 (other quite
-major changes were also present in the 7.0 release).
+functions to work this way. This got rid of about 600 lines of source and made
+further maintenance and development easier. As this was such a major change, I
+never released 6.8, instead upping the number to 7.0 (other quite major changes
+were also present in the 7.0 release).
 
 A side effect of this work was that the previous limit of 200 on the nesting
 depth of parentheses was removed. However, there was a downside: compiling ran
@@ -104,20 +107,208 @@
 for nested parenthesized groups. This is a safety feature for environments with
 small stacks where the patterns are provided by users.
 
-History repeated itself for release 10.20. A number of bugs relating to named 
-subpatterns had been discovered by fuzzers. Most of these were related to the 
-handling of forward references when it was not known if the named pattern was
+
+Yet another pattern scan
+------------------------
+
+History repeated itself for PCRE2 release 10.20. A number of bugs relating to
+named subpatterns had been discovered by fuzzers. Most of these were related to
+the handling of forward references when it was not known if the named group was
 unique. (References to non-unique names use a different opcode and more
 memory.) The use of duplicate group numbers (the (?| facility) also caused
-issues. 
+issues.
 
-To get around these problems I adopted a new approach by adding a third pass,
-really a "pre-pass", over the pattern, which does nothing other than identify
-all the named subpatterns and their corresponding group numbers. This means 
-that the actual compile (both pre-pass and real compile) have full knowledge of 
-group names and numbers throughout. Several dozen lines of messy code were 
-eliminated, though the new pre-pass is not short (skipping over [] classes is 
-complicated).
+To get around these problems I adopted a new approach by adding a third pass
+over the pattern (really a "pre-pass"), which did nothing other than identify
+all the named subpatterns and their corresponding group numbers. This means
+that the actual compile (both the memory-computing dummy run and the real
+compile) has full knowledge of group names and numbers throughout. Several
+dozen lines of messy code were eliminated, though the new pre-pass was not
+short. In particular, parsing and skipping over [] classes is complicated.
+
+While working on 10.22 I realized that I could simplify yet again by moving
+more of the parsing into the pre-pass, thus avoiding doing it in two places, so
+after 10.22 was released, the code underwent yet another big refactoring. This
+is how it is from 10.23 onwards:
+
+The function called parse_regex() scans the pattern characters, parsing them
+into literal data and meta characters. It converts escapes such as \x{123}
+into literals, handles \Q...\E, and skips over comments and non-significant
+white space. The result of the scanning is put into a vector of 32-bit unsigned
+integers. Values less than 0x80000000 are literal data. Higher values represent
+meta-characters. The top 16-bits of such values identify the meta-character,
+and these are given names such as META_CAPTURE. The lower 16-bits are available
+for data, for example, the capturing group number. The only situation in which
+literal data values greater than 0x7fffffff can appear is when the 32-bit
+library is running in non-UTF mode. This is handled by having a special
+meta-character that is followed by the 32-bit data value.
+
+The size of the parsed pattern vector, when auto-callouts are not enabled, is
+bounded by the length of the pattern (with one exception). The code is written
+so that each item in the pattern uses no more vector elements than the number
+of code units in the item itself. The exception is the aforementioned large
+32-bit number handling. For this reason, 32-bit non-UTF patterns are scanned in
+advance to check for such values. When auto-callouts are enabled, the generous
+assumption is made that there will be a callout for each pattern code unit
+(which of course is only actually true if all code units are literals) plus one
+at the end. There is a default parsed pattern vector on the system stack, but
+if this is not big enough, heap memory is used.
+
+As before, the actual compiling function is run twice, the first time to
+determine the amount of memory needed for the final compiled pattern. It
+now processes the parsed pattern vector, not the pattern itself, although some
+of the parsed items refer to strings in the pattern - for example, group
+names. As escapes and comments have already been processed, the code is a bit
+simpler than before.
+
+Most errors can be diagnosed during the parsing scan. For those that cannot
+(for example, "lookbehind assertion is not fixed length"), the parsed code
+contains offsets into the pattern so that the actual compiling code can
+report where errors are.
+
+
+The elements of the parsed pattern vector
+-----------------------------------------
+
+The word "offset" below means a code unit offset into the pattern. When
+PCRE2_SIZE (which is usually size_t) is no bigger than uint32_t, an offset is
+stored in a single parsed pattern element. Otherwise (typically on 64-bit
+systems) it occupies two elements. The following meta items occupy just one
+element, with no data:
+
+META_ACCEPT           (*ACCEPT)
+META_ASTERISK         *
+META_ASTERISK_PLUS    *+
+META_ASTERISK_QUERY   *?
+META_ATOMIC           (?> start of atomic group
+META_CIRCUMFLEX       ^ metacharacter
+META_CLASS            [ start of non-empty class
+META_CLASS_EMPTY      [] empty class - only with PCRE2_ALLOW_EMPTY_CLASS
+META_CLASS_EMPTY_NOT  [^] negative empty class - ditto
+META_CLASS_END        ] end of non-empty class
+META_CLASS_NOT        [^ start non-empty negative class
+META_COMMIT           (*COMMIT)
+META_COND_ASSERT      (?(?assertion)
+META_DOLLAR           $ metacharacter
+META_DOT              . metacharacter
+META_END              End of pattern (this value is 0x80000000)
+META_FAIL             (*FAIL)
+META_KET              ) closing parenthesis
+META_LOOKAHEAD        (?= start of lookahead
+META_LOOKAHEADNOT     (?! start of negative lookahead
+META_NOCAPTURE        (?: no capture parens
+META_PLUS             +
+META_PLUS_PLUS        ++
+META_PLUS_QUERY       +?
+META_PRUNE            (*PRUNE) - no argument
+META_QUERY            ?
+META_QUERY_PLUS       ?+
+META_QUERY_QUERY      ??
+META_RANGE_ESCAPED    hyphen in class range with at least one escape
+META_RANGE_LITERAL    hyphen in class range defined literally
+META_SKIP             (*SKIP) - no argument
+META_THEN             (*THEN) - no argument
+
+The two RANGE values occur only in character classes. They are positioned
+between two literals that define the start and end of the range. In an EBCDIC
+evironment it is necessary to know whether either of the range values was
+specified as an escape. In an ASCII/Unicode environment the distinction is not
+relevant.
+
+The following have data in the lower 16 bits, and may be followed by other data
+elements:
+
+META_ALT              | alternation
+META_BACKREF          back reference
+META_CAPTURE          start of capturing group
+META_ESCAPE           non-literal escape sequence
+META_RECURSE          recursion call
+
+If the data for META_ALT is non-zero, it is inside a lookbehind, and the data
+is the length of its branch, for which OP_REVERSE must be generated.
+
+META_BACKREF, META_CAPTURE, and META_RECURSE have the capture group number as
+their data in the lower 16 bits of the element.
+
+META_BACKREF is followed by an offset if the back reference group number is 10
+or more. The offsets of the first ocurrences of references to groups whose
+numbers are less than 10 are put in cb->small_ref_offset[] (only the first
+occurrence is useful). On 64-bit systems this avoids using more than two parsed
+pattern elements for items such as \3. The offset is used when an error occurs
+because the reference is to a non-existent group.
+
+META_RECURSE is always followed by an offset, for use in error messages.
+
+META_ESCAPE has an ESC_xxx value as its data. For ESC_P and ESC_p, the next
+element contains the 16-bit type and data property values, packed together.
+ESC_g and ESC_k are used only for named references - numerical ones are turned
+into META_RECURSE or META_BACKREF as appropriate. ESC_g and ESC_k are followed
+by a length and an offset into the pattern to specify the name.
+
+The following have one data item that follows in the next vector element:
+
+META_BIGVALUE         Next is a literal >= META_END
+META_OPTIONS          (?i) and friends (data is new option bits)
+META_POSIX            POSIX class item (data identifies the class)
+META_POSIX_NEG        negative POSIX class item (ditto)
+
+The following are followed by a length element, then a number of character code
+values (which should match with the length):
+
+META_MARK             (*MARK:xxxx)
+META_PRUNE_ARG        (*PRUNE:xxx)
+META_SKIP_ARG         (*SKIP:xxxx)
+META_THEN_ARG         (*THEN:xxxx)
+
+The following are followed by a length element, then an offset in the pattern
+that identifies the name:
+
+META_COND_NAME        (?(<name>) or (?('name') or (?(name)
+META_COND_RNAME       (?(R&name)
+META_COND_RNUMBER     (?(Rdigits)
+META_RECURSE_BYNAME   (?&name)
+META_BACKREF_BYNAME   \k'name'
+
+META_COND_RNUMBER is used for names that start with R and continue with digits,
+because this is an ambiguous case. It could be a back reference to a group with
+that name, or it could be a recursion test on a numbered group.
+
+This one is followed by an offset, for use in error messages, then a number:
+
+META_COND_NUMBER       (?([+-]digits)
+
+The following is followed just by an offset, for use in error messages:
+
+META_COND_DEFINE      (?(DEFINE)
+
+The following are also followed just by an offset, but also the lower 16 bits
+of the main word contain the length of the first branch of the lookbehind
+group; this is used when generating OP_REVERSE for that branch.
+
+META_LOOKBEHIND       (?<=
+META_LOOKBEHINDNOT    (?<!
+
+The following are followed by two elements, the minimum and maximum. Repeat
+values are limited to 65535 (MAX_REPEAT). A maximum value of "unlimited" is
+represented by UNLIMITED_REPEAT, which is bigger than MAX_REPEAT:
+
+META_MINMAX           {n,m}  repeat
+META_MINMAX_PLUS      {n,m}+ repeat
+META_MINMAX_QUERY     {n,m}? repeat
+
+This one is followed by three elements. The first is 0 for '>' and 1 for '>=';
+the next two are the major and minor numbers:
+
+META_COND_VERSION     (?(VERSION<op>x.y)
+
+Callouts are converted into one of two items:
+
+META_CALLOUT_NUMBER   (?C with numerical argument
+META_CALLOUT_STRING   (?C with string argument
+
+In both cases, the next two elements contain the offset and length of the next
+item in the pattern. Then there is either one callout number, or a length and
+an offset for the string argument. The length includes both delimiters.
 
 
 Traditional matching function
@@ -154,9 +345,14 @@
 ------------------
 
 The /i, /m, or /s options (PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
-some others) may change in the middle of patterns. Their processing is handled
-entirely at compile time by generating different opcodes for the different
-settings. The runtime functions do not need to keep track of an options state.
+others) may be changed in the middle of patterns by items such as (?i). Their
+processing is handled entirely at compile time by generating different opcodes
+for the different settings. The runtime functions do not need to keep track of
+an options state.
+
+PCRE2_DUPNAMES, PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE
+are tracked and processed during the parsing pre-pass. The others are handled
+from META_OPTIONS items during the main compile phase.
 
 
 Format of compiled patterns
@@ -180,7 +376,7 @@
 
 In this description, we assume the "normal" compilation options. Data values
 that are counts (e.g. quantifiers) are always two bytes long in 8-bit mode
-(most significant byte first), or one code unit in 16-bit and 32-bit modes.
+(most significant byte first), and one code unit in 16-bit and 32-bit modes.
 
 
 Opcodes with no following data
@@ -220,16 +416,16 @@
   OP_ACCEPT              ) These are Perl 5.10's "backtracking control
   OP_COMMIT              ) verbs". If OP_ACCEPT is inside capturing
   OP_FAIL                ) parentheses, it may be preceded by one or more
-  OP_PRUNE               ) OP_CLOSE, each followed by a count that
+  OP_PRUNE               ) OP_CLOSE, each followed by a number that
   OP_SKIP                ) indicates which parentheses must be closed.
   OP_THEN                )
 
 OP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion.
-This ends the assertion, not the entire pattern match. The assertion (?!) is 
+This ends the assertion, not the entire pattern match. The assertion (?!) is
 always optimized to OP_FAIL.
 
 OP_ALLANY is used for '.' when PCRE2_DOTALL is set. It is also used for \C in
-non-UTF modes and in UTF-32 mode (since one code unit still equals one 
+non-UTF modes and in UTF-32 mode (since one code unit still equals one
 character). Another use is for [^] when empty classes are permitted
 (PCRE2_ALLOW_EMPTY_CLASS is set).
 
@@ -248,14 +444,22 @@
 ---------------------------
 
 The OP_CHAR opcode is followed by a single character that is to be matched
-casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
-the character may be more than one code unit long. In UTF-32 mode, characters
-are always exactly one code unit long.
+casefully. For caseless matching of characters that have at most two
+case-equivalent code points, OP_CHARI is used. In UTF-8 or UTF-16 modes, the
+character may be more than one code unit long. In UTF-32 mode, characters are
+always exactly one code unit long.
 
 If there is only one character in a character class, OP_CHAR or OP_CHARI is
 used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is,
 for something like [^a]).
 
+Caseless matching (positive or negative) of characters that have more than two
+case-equivalent code points (which is possible only in UTF mode) is handled by
+compiling a Unicode property item (see below), with the pseudo-property
+PT_CLIST. The value of this property is an offset in a vector called
+"ucd_caseless_sets" which identifies the start of a short list of equivalent
+characters, terminated by the value NOTACHAR (0xffffffff).
+
 
 Repeating single characters
 ---------------------------
@@ -331,7 +535,8 @@
 and a value. The types are a set of #defines of the form PT_xxx, and the values
 are enumerations of the form ucp_xx, defined in the pcre2_ucp.h source file.
 The value is relevant only for PT_GC (General Category), PT_PC (Particular
-Category), and PT_SC (Script).
+Category), PT_SC (Script), and the pseudo-property PT_CLIST, which is used to
+identify a list of case-equivalent characters when there are three or more.
 
 Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
 three code units: OP_PROP or OP_NOTPROP, and then the desired property type and
@@ -343,7 +548,10 @@
 
 If there is only one character in a class, OP_CHAR or OP_CHARI is used for a
 positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
-something like [^a]).
+something like [^a]), except when caselessly matching a character that has more
+than two case-equivalent code points (which can happen only in UTF mode). In
+this case a Unicode property item is used, as described above in "Matching
+literal characters".
 
 A set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated,
 negated, single-character classes. The normal single-character opcodes
@@ -364,8 +572,8 @@
 For classes containing characters with values greater than 255 or that contain
 \p or \P, OP_XCLASS is used. It optionally uses a bit map if any acceptable
 code points are less than 256, followed by a list of pairs (for a range) and/or
-single characters and/or properties. In caseless mode, both cases are
-explicitly listed.
+single characters and/or properties. In caseless mode, all equivalent
+characters are explicitly listed.
 
 OP_XCLASS is followed by a LINK_SIZE value containing the total length of the
 opcode and its data. This is followed by a code unit containing flag bits:
@@ -422,8 +630,8 @@
   OP_CRMINRANGE
   OP_CRPOSRANGE
 
-All but the last three are single-code-unit items, with no data. The others are
-followed by the minimum and maximum repeat counts.
+All but the last three are single-code-unit items, with no data. The range
+opcodes are followed by the minimum and maximum repeat counts.
 
 
 Brackets and alternation
@@ -438,16 +646,17 @@
 
 Non-capturing brackets use the opcode OP_BRA, capturing brackets use OP_CBRA. A
 bracket opcode is followed by a LINK_SIZE value which gives the offset to the
-next alternative OP_ALT or, if there aren't any branches, to the matching
-OP_KET opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset
-to the next one, or to the OP_KET opcode. For capturing brackets, the bracket
-number is a count that immediately follows the offset.
+next alternative OP_ALT or, if there aren't any branches, to the terminating
+opcode. Each OP_ALT is followed by a LINK_SIZE value giving the offset to the
+next one, or to the final opcode. For capturing brackets, the bracket number is
+a count that immediately follows the offset.
 
-OP_KET is used for subpatterns that do not repeat indefinitely, and OP_KETRMIN
-and OP_KETRMAX are used for indefinite repetitions, minimally or maximally
-respectively (see below for possessive repetitions). All three are followed by
-a LINK_SIZE value giving (as a positive number) the offset back to the matching
-bracket opcode.
+There are several opcodes that mark the end of a subpattern group. OP_KET is
+used for subpatterns that do not repeat indefinitely, OP_KETRMIN and
+OP_KETRMAX are used for indefinite repetitions, minimally or maximally
+respectively, and OP_KETRPOS for possessive repetitions (see below for more 
+details). All four are followed by a LINK_SIZE value giving (as a positive
+number) the offset back to the matching bracket opcode.
 
 If a subpattern is quantified such that it is permitted to match zero times, it
 is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
@@ -488,17 +697,9 @@
 Once-only (atomic) groups
 -------------------------
 
-These are just like other subpatterns, but they start with the opcode
-OP_ONCE or OP_ONCE_NC. The former is used when there are no capturing brackets
-within the atomic group; the latter when there are. The distinction is needed
-for when there is a backtrack to before the group - any captures within the
-group must be reset, so it is necessary to retain backtracking points inside
-the group, even after it is complete, in order to do this. When there are no
-captures in an atomic group, all the backtracking can be discarded when it is
-complete. This is more efficient, and also uses less stack.
-
+These are just like other subpatterns, but they start with the opcode OP_ONCE.
 The check for matching an empty string in an unbounded repeat is handled
-entirely at runtime, so there are just these two opcodes for atomic groups.
+entirely at runtime, so there is just this one opcode for atomic groups.
 
 
 Assertions
@@ -544,14 +745,14 @@
 or OP_FALSE.
 
 If a condition is not a back reference, recursion test, DEFINE, or VERSION, it
-must start with an assertion, whose opcode normally immediately follows OP_COND
-or OP_SCOND. However, if automatic callouts are enabled, a callout is inserted
-immediately before the assertion. It is also possible to insert a manual
-callout at this point. Only assertion conditions may have callouts preceding
-the condition.
+must start with a parenthesized assertion, whose opcode normally immediately
+follows OP_COND or OP_SCOND. However, if automatic callouts are enabled, a
+callout is inserted immediately before the assertion. It is also possible to
+insert a manual callout at this point. Only assertion conditions may have
+callouts preceding the condition.
 
-A condition that is the negative assertion (?!) is optimized to OP_FAIL in all 
-parts of the pattern, so this is another opcode that may appear as a condition. 
+A condition that is the negative assertion (?!) is optimized to OP_FAIL in all
+parts of the pattern, so this is another opcode that may appear as a condition.
 It is treated the same as OP_FALSE.
 
 
@@ -561,21 +762,28 @@
 Recursion either matches the current pattern, or some subexpression. The opcode
 OP_RECURSE is followed by a LINK_SIZE value that is the offset to the starting
 bracket from the start of the whole pattern. OP_RECURSE is also used for
-"subroutine" calls, even though they are not strictly a recursion. Repeated
-recursions are automatically wrapped inside OP_ONCE brackets, because otherwise
-some patterns broke them. A non-repeated recursion is not wrapped in OP_ONCE
-brackets, but it is nevertheless still treated as an atomic group.
+"subroutine" calls, even though they are not strictly a recursion. Up till
+release 10.30 recursions were treated as atomic groups, making them
+incompatible with Perl (but PCRE had then well before Perl did). From 10.30,
+backtracking into recursions is supported.
+
+Repeated recursions used to be wrapped inside OP_ONCE brackets, which not only
+forced no backtracking, but also allowed repetition to be handled as for other
+bracketed groups. From 10.30 onwards, repeated recursions are duplicated for
+their minimum repetitions, and then wrapped in non-capturing brackets for the
+remainder. For example, (?1){3} is treated as (?1)(?1)(?1), and (?1){2,4} is
+treated as (?1)(?1)(?:(?1)){0,2}.
 
 
-Callout
--------
+Callouts
+--------
 
-A callout can nowadays have either a numerical argument or a string argument.
-These use OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are
-followed by two LINK_SIZE values giving the offset in the pattern string to the
-start of the following item, and another count giving the length of this item.
-These values make it possible for pcre2test to output useful tracing
-information using callouts.
+A callout may have either a numerical argument or a string argument. These use
+OP_CALLOUT or OP_CALLOUT_STR, respectively. In each case these are followed by
+two LINK_SIZE values giving the offset in the pattern string to the start of
+the following item, and another count giving the length of this item. These
+values make it possible for pcre2test to output useful tracing information
+using callouts.
 
 In the case of a numeric callout, after these two values there is a single code
 unit containing the callout number, in the range 0-255, with 255 being used for
@@ -593,17 +801,17 @@
 the application needs it. In the 8-bit library, the callout in /X(?C'abc')Y/ is
 compiled as the following bytes (decimal numbers represent binary values):
 
-  [OP_CALLOUT]  [0] [10]  [0] [1]  [0] [14]  [0] [5] ['] [a] [b] [c] [0]
-                --------  -------  --------  -------
-                   |         |        |         |
-                   ------- LINK_SIZE items ------
+  [OP_CALLOUT_STR]  [0] [10]  [0] [1]  [0] [14]  [0] [5] ['] [a] [b] [c] [0]
+                    --------  -------  --------  -------
+                       |         |        |         |
+                       ------- LINK_SIZE items ------
 
 Opcode table checking
 ---------------------
 
 The last opcode that is defined in pcre2_internal.h is OP_TABLE_LENGTH. This is
-not a real opcode, but is used to check that tables indexed by opcode are the
-correct length, in order to catch updating errors.
+not a real opcode, but is used to check at compile time that tables indexed by
+opcode are the correct length, in order to catch updating errors.
 
 Philip Hazel
-June 2016
+21 April 2017
commit	9bc971b1e045f368b6d281841bc892804f37b767	[log] [tgz]
author	Elliott Hughes <enh@google.com>	Fri Jul 27 13:23:14 2018 -0700
committer	Elliott Hughes <enh@google.com>	Fri Jul 27 13:23:14 2018 -0700
tree	7ec5bf9c5d8c7f3b73939f42ddbb33585e646cf4
parent	6420c7d130cade76df8fcd0a750710545565b306 [diff] [blame]