Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 1 | .TH PCRE2COMPAT 3 "08 December 2021" "PCRE2 10.40" |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 2 | .SH NAME |
| 3 | PCRE2 - Perl-compatible regular expressions (revised API) |
| 4 | .SH "DIFFERENCES BETWEEN PCRE2 AND PERL" |
| 5 | .rs |
| 6 | .sp |
| 7 | This document describes some of the differences in the ways that PCRE2 and Perl |
| 8 | handle regular expressions. The differences described here are with respect to |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 9 | Perl version 5.34.0, but as both Perl and PCRE2 are continually changing, the |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 10 | information may at times be out of date. |
| 11 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 12 | 1. When PCRE2_DOTALL (equivalent to Perl's /s qualifier) is not set, the |
| 13 | behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.' matches the |
| 14 | next character unless it is the start of a newline sequence. This means that, |
| 15 | if the newline setting is CR, CRLF, or NUL, '.' will match the code point LF |
| 16 | (0x0A) in ASCII/Unicode environments, and NL (either 0x15 or 0x25) when using |
| 17 | EBCDIC. In Perl, '.' appears never to match LF, even when 0x0A is not a newline |
| 18 | indicator. |
| 19 | .P |
| 20 | 2. PCRE2 has only a subset of Perl's Unicode support. Details of what it does |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 21 | have are given in the |
| 22 | .\" HREF |
| 23 | \fBpcre2unicode\fP |
| 24 | .\" |
| 25 | page. |
| 26 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 27 | 3. Like Perl, PCRE2 allows repeat quantifiers on parenthesized assertions, but |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 28 | they do not mean what you might think. For example, (?!a){3} does not assert |
| 29 | that the next three characters are not "a". It just asserts that the next |
| 30 | character is not "a" three times (in principle; PCRE2 optimizes this to run the |
| 31 | assertion just once). Perl allows some repeat quantifiers on other assertions, |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 32 | for example, \eb* , but these do not seem to have any use. PCRE2 does not allow |
| 33 | any kind of quantifier on non-lookaround assertions. |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 34 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 35 | 4. Capture groups that occur inside negative lookaround assertions are counted, |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 36 | but their entries in the offsets vector are set only when a negative assertion |
| 37 | is a condition that has a matching branch (that is, the condition is false). |
| 38 | Perl may set such capture groups in other circumstances. |
| 39 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 40 | 5. The following Perl escape sequences are not supported: \eF, \el, \eL, \eu, |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 41 | \eU, and \eN when followed by a character name. \eN on its own, matching a |
| 42 | non-newline character, and \eN{U+dd..}, matching a Unicode code point, are |
| 43 | supported. The escapes that modify the case of following letters are |
| 44 | implemented by Perl's general string-handling and are not part of its pattern |
| 45 | matching engine. If any of these are encountered by PCRE2, an error is |
| 46 | generated by default. However, if either of the PCRE2_ALT_BSUX or |
| 47 | PCRE2_EXTRA_ALT_BSUX options is set, \eU and \eu are interpreted as ECMAScript |
| 48 | interprets them. |
| 49 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 50 | 6. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE2 is |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 51 | built with Unicode support (the default). The properties that can be tested |
| 52 | with \ep and \eP are limited to the general category properties such as Lu and |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 53 | Nd, script names such as Greek or Han, Bidi_Class, Bidi_Control, and the |
| 54 | derived properties Any and LC (synonym L&). Both PCRE2 and Perl support the Cs |
| 55 | (surrogate) property, but in PCRE2 its use is limited. See the |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 56 | .\" HREF |
| 57 | \fBpcre2pattern\fP |
| 58 | .\" |
| 59 | documentation for details. The long synonyms for property names that Perl |
| 60 | supports (such as \ep{Letter}) are not supported by PCRE2, nor is it permitted |
| 61 | to prefix any of these properties with "Is". |
| 62 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 63 | 7. PCRE2 supports the \eQ...\eE escape for quoting substrings. Characters |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 64 | in between are treated as literals. However, this is slightly different from |
| 65 | Perl in that $ and @ are also handled as literals inside the quotes. In Perl, |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 66 | they cause variable interpolation (PCRE2 does not have variables). Also, Perl |
| 67 | does "double-quotish backslash interpolation" on any backslashes between \eQ |
| 68 | and \eE which, its documentation says, "may lead to confusing results". PCRE2 |
| 69 | treats a backslash between \eQ and \eE just like any other character. Note the |
| 70 | following examples: |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 71 | .sp |
| 72 | Pattern PCRE2 matches Perl matches |
| 73 | .sp |
| 74 | .\" JOIN |
| 75 | \eQabc$xyz\eE abc$xyz abc followed by the |
| 76 | contents of $xyz |
| 77 | \eQabc\e$xyz\eE abc\e$xyz abc\e$xyz |
| 78 | \eQabc\eE\e$\eQxyz\eE abc$xyz abc$xyz |
| 79 | \eQA\eB\eE A\eB A\eB |
| 80 | \eQ\e\eE \e \e\eE |
| 81 | .sp |
| 82 | The \eQ...\eE sequence is recognized both inside and outside character classes |
| 83 | by both PCRE2 and Perl. |
| 84 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 85 | 8. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code}) |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 86 | constructions. However, PCRE2 does have a "callout" feature, which allows an |
| 87 | external function to be called during pattern matching. See the |
| 88 | .\" HREF |
| 89 | \fBpcre2callout\fP |
| 90 | .\" |
| 91 | documentation for details. |
| 92 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 93 | 9. Subroutine calls (whether recursive or not) were treated as atomic groups up |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 94 | to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking |
| 95 | into subroutine calls is now supported, as in Perl. |
| 96 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 97 | 10. In PCRE2, if any of the backtracking control verbs are used in a group that |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 98 | is called as a subroutine (whether or not recursively), their effect is |
| 99 | confined to that group; it does not extend to the surrounding pattern. This is |
| 100 | not always the case in Perl. In particular, if (*THEN) is present in a group |
| 101 | that is called as a subroutine, its action is limited to that group, even if |
| 102 | the group does not contain any | characters. Note that such groups are |
| 103 | processed as anchored at the point where they are tested. |
| 104 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 105 | 11. If a pattern contains more than one backtracking control verb, the first |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 106 | one that is backtracked onto acts. For example, in the pattern |
| 107 | A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C |
| 108 | triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the |
| 109 | same as PCRE2, but there are cases where it differs. |
| 110 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 111 | 12. There are some differences that are concerned with the settings of captured |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 112 | strings when part of a pattern is repeated. For example, matching "aba" against |
| 113 | the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to |
| 114 | "b". |
| 115 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 116 | 13. PCRE2's handling of duplicate capture group numbers and names is not as |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 117 | general as Perl's. This is a consequence of the fact the PCRE2 works internally |
| 118 | just with numbers, using an external table to translate between numbers and |
| 119 | names. In particular, a pattern such as (?|(?<a>A)|(?<b>B)), where the two |
| 120 | capture groups have the same number but different names, is not supported, and |
| 121 | causes an error at compile time. If it were allowed, it would not be possible |
| 122 | to distinguish which group matched, because both names map to capture group |
| 123 | number 1. To avoid this confusing situation, an error is given at compile time. |
| 124 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 125 | 14. Perl used to recognize comments in some places that PCRE2 does not, for |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 126 | example, between the ( and ? at the start of a group. If the /x modifier is |
| 127 | set, Perl allowed white space between ( and ? though the latest Perls give an |
| 128 | error (for a while it was just deprecated). There may still be some cases where |
| 129 | Perl behaves differently. |
| 130 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 131 | 15. Perl, when in warning mode, gives warnings for character classes such as |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 132 | [A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no |
| 133 | warning features, so it gives an error in these cases because they are almost |
| 134 | certainly user mistakes. |
| 135 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 136 | 16. In PCRE2, the upper/lower case character properties Lu and Ll are not |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 137 | affected when case-independent matching is specified. For example, \ep{Lu} |
| 138 | always matches an upper case letter. I think Perl has changed in this respect; |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 139 | in the release at the time of writing (5.34), \ep{Lu} and \ep{Ll} match all |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 140 | letters, regardless of case, when case independence is specified. |
| 141 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 142 | 17. From release 5.32.0, Perl locks out the use of \eK in lookaround |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 143 | assertions. From release 10.38 PCRE2 does the same by default. However, there |
| 144 | is an option for re-enabling the previous behaviour. When this option is set, |
| 145 | \eK is acted on when it occurs in positive assertions, but is ignored in |
| 146 | negative assertions. |
| 147 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 148 | 18. PCRE2 provides some extensions to the Perl regular expression facilities. |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 149 | Perl 5.10 included new features that were not in earlier versions of Perl, some |
| 150 | of which (such as named parentheses) were in PCRE2 for some time before. This |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 151 | list is with respect to Perl 5.34: |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 152 | .sp |
| 153 | (a) Although lookbehind assertions in PCRE2 must match fixed length strings, |
| 154 | each alternative toplevel branch of a lookbehind assertion can match a |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 155 | different length of string. Perl used to require them all to have the same |
| 156 | length, but the latest version has some variable length support. |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 157 | .sp |
| 158 | (b) From PCRE2 10.23, backreferences to groups of fixed length are supported |
| 159 | in lookbehinds, provided that there is no possibility of referencing a |
| 160 | non-unique number or name. Perl does not support backreferences in lookbehinds. |
| 161 | .sp |
| 162 | (c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the $ |
| 163 | meta-character matches only at the very end of the string. |
| 164 | .sp |
| 165 | (d) A backslash followed by a letter with no special meaning is faulted. (Perl |
| 166 | can be made to issue a warning.) |
| 167 | .sp |
| 168 | (e) If PCRE2_UNGREEDY is set, the greediness of the repetition quantifiers is |
| 169 | inverted, that is, by default they are not greedy, but if followed by a |
| 170 | question mark they are. |
| 171 | .sp |
| 172 | (f) PCRE2_ANCHORED can be used at matching time to force a pattern to be tried |
| 173 | only at the first matching position in the subject string. |
| 174 | .sp |
| 175 | (g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and PCRE2_NOTEMPTY_ATSTART |
| 176 | options have no Perl equivalents. |
| 177 | .sp |
| 178 | (h) The \eR escape sequence can be restricted to match only CR, LF, or CRLF |
| 179 | by the PCRE2_BSR_ANYCRLF option. |
| 180 | .sp |
| 181 | (i) The callout facility is PCRE2-specific. Perl supports codeblocks and |
| 182 | variable interpolation, but not general hooks on every match. |
| 183 | .sp |
| 184 | (j) The partial matching facility is PCRE2-specific. |
| 185 | .sp |
| 186 | (k) The alternative matching function (\fBpcre2_dfa_match()\fP matches in a |
| 187 | different way and is not Perl-compatible. |
| 188 | .sp |
| 189 | (l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT) at |
| 190 | the start of a pattern. These set overall options that cannot be changed within |
| 191 | the pattern. |
| 192 | .sp |
| 193 | (m) PCRE2 supports non-atomic positive lookaround assertions. This is an |
| 194 | extension to the lookaround facilities. The default, Perl-compatible |
| 195 | lookarounds are atomic. |
| 196 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 197 | 19. The Perl /a modifier restricts /d numbers to pure ascii, and the /aa |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 198 | modifier restricts /i case-insensitive matching to pure ascii, ignoring Unicode |
| 199 | rules. This separation cannot be represented with PCRE2_UCP. |
| 200 | .P |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 201 | 20. Perl has different limits than PCRE2. See the |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 202 | .\" HREF |
| 203 | \fBpcre2limit\fP |
| 204 | .\" |
| 205 | documentation for details. Perl went with 5.10 from recursion to iteration |
| 206 | keeping the intermediate matches on the heap, which is ~10% slower but does not |
| 207 | fall into any stack-overflow limit. PCRE2 made a similar change at release |
| 208 | 10.30, and also has many build-time and run-time customizable limits. |
| 209 | . |
| 210 | . |
| 211 | .SH AUTHOR |
| 212 | .rs |
| 213 | .sp |
| 214 | .nf |
| 215 | Philip Hazel |
| 216 | Retired from University Computing Service |
| 217 | Cambridge, England. |
| 218 | .fi |
| 219 | . |
| 220 | . |
| 221 | .SH REVISION |
| 222 | .rs |
| 223 | .sp |
| 224 | .nf |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 225 | Last updated: 08 December 2021 |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 226 | Copyright (c) 1997-2021 University of Cambridge. |
| 227 | .fi |