Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 1 | PCRE2TEST(1) General Commands Manual PCRE2TEST(1) |
| 2 | |
| 3 | |
| 4 | |
| 5 | NAME |
| 6 | pcre2test - a program for testing Perl-compatible regular expressions. |
| 7 | |
| 8 | SYNOPSIS |
| 9 | |
| 10 | pcre2test [options] [input file [output file]] |
| 11 | |
| 12 | pcre2test is a test program for the PCRE2 regular expression libraries, |
| 13 | but it can also be used for experimenting with regular expressions. |
| 14 | This document describes the features of the test program; for details |
| 15 | of the regular expressions themselves, see the pcre2pattern documenta- |
| 16 | tion. For details of the PCRE2 library function calls and their op- |
| 17 | tions, see the pcre2api documentation. |
| 18 | |
| 19 | The input for pcre2test is a sequence of regular expression patterns |
| 20 | and subject strings to be matched. There are also command lines for |
| 21 | setting defaults and controlling some special actions. The output shows |
| 22 | the result of each match attempt. Modifiers on external or internal |
| 23 | command lines, the patterns, and the subject lines specify PCRE2 func- |
| 24 | tion options, control how the subject is processed, and what output is |
| 25 | produced. |
| 26 | |
| 27 | There are many obscure modifiers, some of which are specifically de- |
| 28 | signed for use in conjunction with the test script and data files that |
| 29 | are distributed as part of PCRE2. All the modifiers are documented |
| 30 | here, some without much justification, but many of them are unlikely to |
| 31 | be of use except when testing the libraries. |
| 32 | |
| 33 | |
| 34 | PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES |
| 35 | |
| 36 | Different versions of the PCRE2 library can be built to support charac- |
| 37 | ter strings that are encoded in 8-bit, 16-bit, or 32-bit code units. |
| 38 | One, two, or all three of these libraries may be simultaneously in- |
| 39 | stalled. The pcre2test program can be used to test all the libraries. |
| 40 | However, its own input and output are always in 8-bit format. When |
| 41 | testing the 16-bit or 32-bit libraries, patterns and subject strings |
| 42 | are converted to 16-bit or 32-bit format before being passed to the li- |
| 43 | brary functions. Results are converted back to 8-bit code units for |
| 44 | output. |
| 45 | |
| 46 | In the rest of this document, the names of library functions and struc- |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 47 | tures are given in generic form, for example, pcre2_compile(). The ac- |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 48 | tual names used in the libraries have a suffix _8, _16, or _32, as ap- |
| 49 | propriate. |
| 50 | |
| 51 | |
| 52 | INPUT ENCODING |
| 53 | |
| 54 | Input to pcre2test is processed line by line, either by calling the C |
| 55 | library's fgets() function, or via the libreadline or libedit library. |
| 56 | In some Windows environments character 26 (hex 1A) causes an immediate |
| 57 | end of file, and no further data is read, so this character should be |
| 58 | avoided unless you really want that action. |
| 59 | |
| 60 | The input is processed using using C's string functions, so must not |
| 61 | contain binary zeros, even though in Unix-like environments, fgets() |
| 62 | treats any bytes other than newline as data characters. An error is |
| 63 | generated if a binary zero is encountered. By default subject lines are |
| 64 | processed for backslash escapes, which makes it possible to include any |
| 65 | data value in strings that are passed to the library for matching. For |
| 66 | patterns, there is a facility for specifying some or all of the 8-bit |
| 67 | input characters as hexadecimal pairs, which makes it possible to in- |
| 68 | clude binary zeros. |
| 69 | |
| 70 | Input for the 16-bit and 32-bit libraries |
| 71 | |
| 72 | When testing the 16-bit or 32-bit libraries, there is a need to be able |
| 73 | to generate character code points greater than 255 in the strings that |
| 74 | are passed to the library. For subject lines, backslash escapes can be |
| 75 | used. In addition, when the utf modifier (see "Setting compilation op- |
| 76 | tions" below) is set, the pattern and any following subject lines are |
| 77 | interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as ap- |
| 78 | propriate. |
| 79 | |
| 80 | For non-UTF testing of wide characters, the utf8_input modifier can be |
| 81 | used. This is mutually exclusive with utf, and is allowed only in |
| 82 | 16-bit or 32-bit mode. It causes the pattern and following subject |
| 83 | lines to be treated as UTF-8 according to the original definition (RFC |
| 84 | 2279), which allows for character values up to 0x7fffffff. Each charac- |
| 85 | ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case, |
| 86 | values greater than 0xffff cause an error to occur). |
| 87 | |
| 88 | UTF-8 (in its original definition) is not capable of encoding values |
| 89 | greater than 0x7fffffff, but such values can be handled by the 32-bit |
| 90 | library. When testing this library in non-UTF mode with utf8_input set, |
| 91 | if any character is preceded by the byte 0xff (which is an invalid byte |
| 92 | in UTF-8) 0x80000000 is added to the character's value. This is the |
| 93 | only way of passing such code points in a pattern string. For subject |
| 94 | strings, using an escape sequence is preferable. |
| 95 | |
| 96 | |
| 97 | COMMAND LINE OPTIONS |
| 98 | |
| 99 | -8 If the 8-bit library has been built, this option causes it to |
| 100 | be used (this is the default). If the 8-bit library has not |
| 101 | been built, this option causes an error. |
| 102 | |
| 103 | -16 If the 16-bit library has been built, this option causes it |
| 104 | to be used. If only the 16-bit library has been built, this |
| 105 | is the default. If the 16-bit library has not been built, |
| 106 | this option causes an error. |
| 107 | |
| 108 | -32 If the 32-bit library has been built, this option causes it |
| 109 | to be used. If only the 32-bit library has been built, this |
| 110 | is the default. If the 32-bit library has not been built, |
| 111 | this option causes an error. |
| 112 | |
| 113 | -ac Behave as if each pattern has the auto_callout modifier, that |
| 114 | is, insert automatic callouts into every pattern that is com- |
| 115 | piled. |
| 116 | |
| 117 | -AC As for -ac, but in addition behave as if each subject line |
| 118 | has the callout_extra modifier, that is, show additional in- |
| 119 | formation from callouts. |
| 120 | |
| 121 | -b Behave as if each pattern has the fullbincode modifier; the |
| 122 | full internal binary form of the pattern is output after com- |
| 123 | pilation. |
| 124 | |
| 125 | -C Output the version number of the PCRE2 library, and all |
| 126 | available information about the optional features that are |
| 127 | included, and then exit with zero exit code. All other op- |
| 128 | tions are ignored. If both -C and -LM are present, whichever |
| 129 | is first is recognized. |
| 130 | |
| 131 | -C option Output information about a specific build-time option, then |
| 132 | exit. This functionality is intended for use in scripts such |
| 133 | as RunTest. The following options output the value and set |
| 134 | the exit code as indicated: |
| 135 | |
| 136 | ebcdic-nl the code for LF (= NL) in an EBCDIC environment: |
| 137 | 0x15 or 0x25 |
| 138 | 0 if used in an ASCII environment |
| 139 | exit code is always 0 |
| 140 | linksize the configured internal link size (2, 3, or 4) |
| 141 | exit code is set to the link size |
| 142 | newline the default newline setting: |
| 143 | CR, LF, CRLF, ANYCRLF, ANY, or NUL |
| 144 | exit code is always 0 |
| 145 | bsr the default setting for what \R matches: |
| 146 | ANYCRLF or ANY |
| 147 | exit code is always 0 |
| 148 | |
| 149 | The following options output 1 for true or 0 for false, and |
| 150 | set the exit code to the same value: |
| 151 | |
| 152 | backslash-C \C is supported (not locked out) |
| 153 | ebcdic compiled for an EBCDIC environment |
| 154 | jit just-in-time support is available |
| 155 | pcre2-16 the 16-bit library was built |
| 156 | pcre2-32 the 32-bit library was built |
| 157 | pcre2-8 the 8-bit library was built |
| 158 | unicode Unicode support is available |
| 159 | |
| 160 | If an unknown option is given, an error message is output; |
| 161 | the exit code is 0. |
| 162 | |
| 163 | -d Behave as if each pattern has the debug modifier; the inter- |
| 164 | nal form and information about the compiled pattern is output |
| 165 | after compilation; -d is equivalent to -b -i. |
| 166 | |
| 167 | -dfa Behave as if each subject line has the dfa modifier; matching |
| 168 | is done using the pcre2_dfa_match() function instead of the |
| 169 | default pcre2_match(). |
| 170 | |
| 171 | -error number[,number,...] |
| 172 | Call pcre2_get_error_message() for each of the error numbers |
| 173 | in the comma-separated list, display the resulting messages |
| 174 | on the standard output, then exit with zero exit code. The |
| 175 | numbers may be positive or negative. This is a convenience |
| 176 | facility for PCRE2 maintainers. |
| 177 | |
| 178 | -help Output a brief summary these options and then exit. |
| 179 | |
| 180 | -i Behave as if each pattern has the info modifier; information |
| 181 | about the compiled pattern is given after compilation. |
| 182 | |
| 183 | -jit Behave as if each pattern line has the jit modifier; after |
| 184 | successful compilation, each pattern is passed to the just- |
| 185 | in-time compiler, if available. |
| 186 | |
| 187 | -jitfast Behave as if each pattern line has the jitfast modifier; af- |
| 188 | ter successful compilation, each pattern is passed to the |
| 189 | just-in-time compiler, if available, and each subject line is |
| 190 | passed directly to the JIT matcher via its "fast path". |
| 191 | |
| 192 | -jitverify |
| 193 | Behave as if each pattern line has the jitverify modifier; |
| 194 | after successful compilation, each pattern is passed to the |
| 195 | just-in-time compiler, if available, and the use of JIT for |
| 196 | matching is verified. |
| 197 | |
| 198 | -LM List modifiers: write a list of available pattern and subject |
| 199 | modifiers to the standard output, then exit with zero exit |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 200 | code. All other options are ignored. If both -C and any -Lx |
| 201 | options are present, whichever is first is recognized. |
| 202 | |
| 203 | -LP List properties: write a list of recognized Unicode proper- |
| 204 | ties to the standard output, then exit with zero exit code. |
| 205 | All other options are ignored. If both -C and any -Lx options |
| 206 | are present, whichever is first is recognized. |
| 207 | |
| 208 | -LS List scripts: write a list of recogized Unicode script names |
| 209 | to the standard output, then exit with zero exit code. All |
| 210 | other options are ignored. If both -C and any -Lx options are |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 211 | present, whichever is first is recognized. |
| 212 | |
| 213 | -pattern modifier-list |
| 214 | Behave as if each pattern line contains the given modifiers. |
| 215 | |
| 216 | -q Do not output the version number of pcre2test at the start of |
| 217 | execution. |
| 218 | |
| 219 | -S size On Unix-like systems, set the size of the run-time stack to |
| 220 | size mebibytes (units of 1024*1024 bytes). |
| 221 | |
| 222 | -subject modifier-list |
| 223 | Behave as if each subject line contains the given modifiers. |
| 224 | |
| 225 | -t Run each compile and match many times with a timer, and out- |
| 226 | put the resulting times per compile or match. When JIT is |
| 227 | used, separate times are given for the initial compile and |
| 228 | the JIT compile. You can control the number of iterations |
| 229 | that are used for timing by following -t with a number (as a |
| 230 | separate item on the command line). For example, "-t 1000" |
| 231 | iterates 1000 times. The default is to iterate 500,000 times. |
| 232 | |
| 233 | -tm This is like -t except that it times only the matching phase, |
| 234 | not the compile phase. |
| 235 | |
| 236 | -T -TM These behave like -t and -tm, but in addition, at the end of |
| 237 | a run, the total times for all compiles and matches are out- |
| 238 | put. |
| 239 | |
| 240 | -version Output the PCRE2 version number and then exit. |
| 241 | |
| 242 | |
| 243 | DESCRIPTION |
| 244 | |
| 245 | If pcre2test is given two filename arguments, it reads from the first |
| 246 | and writes to the second. If the first name is "-", input is taken from |
| 247 | the standard input. If pcre2test is given only one argument, it reads |
| 248 | from that file and writes to stdout. Otherwise, it reads from stdin and |
| 249 | writes to stdout. |
| 250 | |
| 251 | When pcre2test is built, a configuration option can specify that it |
| 252 | should be linked with the libreadline or libedit library. When this is |
| 253 | done, if the input is from a terminal, it is read using the readline() |
| 254 | function. This provides line-editing and history facilities. The output |
| 255 | from the -help option states whether or not readline() will be used. |
| 256 | |
| 257 | The program handles any number of tests, each of which consists of a |
| 258 | set of input lines. Each set starts with a regular expression pattern, |
| 259 | followed by any number of subject lines to be matched against that pat- |
| 260 | tern. In between sets of test data, command lines that begin with # may |
| 261 | appear. This file format, with some restrictions, can also be processed |
| 262 | by the perltest.sh script that is distributed with PCRE2 as a means of |
| 263 | checking that the behaviour of PCRE2 and Perl is the same. For a speci- |
| 264 | fication of perltest.sh, see the comments near its beginning. See also |
| 265 | the #perltest command below. |
| 266 | |
| 267 | When the input is a terminal, pcre2test prompts for each line of input, |
| 268 | using "re>" to prompt for regular expression patterns, and "data>" to |
| 269 | prompt for subject lines. Command lines starting with # can be entered |
| 270 | only in response to the "re>" prompt. |
| 271 | |
| 272 | Each subject line is matched separately and independently. If you want |
| 273 | to do multi-line matches, you have to use the \n escape sequence (or \r |
| 274 | or \r\n, etc., depending on the newline setting) in a single line of |
| 275 | input to encode the newline sequences. There is no limit on the length |
| 276 | of subject lines; the input buffer is automatically extended if it is |
| 277 | too small. There are replication features that makes it possible to |
| 278 | generate long repetitive pattern or subject lines without having to |
| 279 | supply them explicitly. |
| 280 | |
| 281 | An empty line or the end of the file signals the end of the subject |
| 282 | lines for a test, at which point a new pattern or command line is ex- |
| 283 | pected if there is still input to be read. |
| 284 | |
| 285 | |
| 286 | COMMAND LINES |
| 287 | |
| 288 | In between sets of test data, a line that begins with # is interpreted |
| 289 | as a command line. If the first character is followed by white space or |
| 290 | an exclamation mark, the line is treated as a comment, and ignored. |
| 291 | Otherwise, the following commands are recognized: |
| 292 | |
| 293 | #forbid_utf |
| 294 | |
| 295 | Subsequent patterns automatically have the PCRE2_NEVER_UTF and |
| 296 | PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF |
| 297 | and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of |
| 298 | patterns. This command also forces an error if a subsequent pattern |
| 299 | contains any occurrences of \P, \p, or \X, which are still supported |
| 300 | when PCRE2_UTF is not set, but which require Unicode property support |
| 301 | to be included in the library. |
| 302 | |
| 303 | This is a trigger guard that is used in test files to ensure that UTF |
| 304 | or Unicode property tests are not accidentally added to files that are |
| 305 | used when Unicode support is not included in the library. Setting |
| 306 | PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained |
| 307 | by the use of #pattern; the difference is that #forbid_utf cannot be |
| 308 | unset, and the automatic options are not displayed in pattern informa- |
| 309 | tion, to avoid cluttering up test output. |
| 310 | |
| 311 | #load <filename> |
| 312 | |
| 313 | This command is used to load a set of precompiled patterns from a file, |
| 314 | as described in the section entitled "Saving and restoring compiled |
| 315 | patterns" below. |
| 316 | |
| 317 | #loadtables <filename> |
| 318 | |
| 319 | This command is used to load a set of binary character tables that can |
| 320 | be accessed by the tables=3 qualifier. Such tables can be created by |
| 321 | the pcre2_dftables program with the -b option. |
| 322 | |
| 323 | #newline_default [<newline-list>] |
| 324 | |
| 325 | When PCRE2 is built, a default newline convention can be specified. |
| 326 | This determines which characters and/or character pairs are recognized |
| 327 | as indicating a newline in a pattern or subject string. The default can |
| 328 | be overridden when a pattern is compiled. The standard test files con- |
| 329 | tain tests of various newline conventions, but the majority of the |
| 330 | tests expect a single linefeed to be recognized as a newline by de- |
| 331 | fault. Without special action the tests would fail when PCRE2 is com- |
| 332 | piled with either CR or CRLF as the default newline. |
| 333 | |
| 334 | The #newline_default command specifies a list of newline types that are |
| 335 | acceptable as the default. The types must be one of CR, LF, CRLF, ANY- |
| 336 | CRLF, ANY, or NUL (in upper or lower case), for example: |
| 337 | |
| 338 | #newline_default LF Any anyCRLF |
| 339 | |
| 340 | If the default newline is in the list, this command has no effect. Oth- |
| 341 | erwise, except when testing the POSIX API, a newline modifier that |
| 342 | specifies the first newline convention in the list (LF in the above ex- |
| 343 | ample) is added to any pattern that does not already have a newline |
| 344 | modifier. If the newline list is empty, the feature is turned off. This |
| 345 | command is present in a number of the standard test input files. |
| 346 | |
| 347 | When the POSIX API is being tested there is no way to override the de- |
| 348 | fault newline convention, though it is possible to set the newline con- |
| 349 | vention from within the pattern. A warning is given if the posix or |
| 350 | posix_nosub modifier is used when #newline_default would set a default |
| 351 | for the non-POSIX API. |
| 352 | |
| 353 | #pattern <modifier-list> |
| 354 | |
| 355 | This command sets a default modifier list that applies to all subse- |
| 356 | quent patterns. Modifiers on a pattern can change these settings. |
| 357 | |
| 358 | #perltest |
| 359 | |
| 360 | This line is used in test files that can also be processed by perl- |
| 361 | test.sh to confirm that Perl gives the same results as PCRE2. Subse- |
| 362 | quent tests are checked for the use of pcre2test features that are in- |
| 363 | compatible with the perltest.sh script. |
| 364 | |
| 365 | Patterns must use '/' as their delimiter, and only certain modifiers |
| 366 | are supported. Comment lines, #pattern commands, and #subject commands |
| 367 | that set or unset "mark" are recognized and acted on. The #perltest, |
| 368 | #forbid_utf, and #newline_default commands, which are needed in the |
| 369 | relevant pcre2test files, are silently ignored. All other command lines |
| 370 | are ignored, but give a warning message. The #perltest command helps |
| 371 | detect tests that are accidentally put in the wrong file or use the |
| 372 | wrong delimiter. For more details of the perltest.sh script see the |
| 373 | comments it contains. |
| 374 | |
| 375 | #pop [<modifiers>] |
| 376 | #popcopy [<modifiers>] |
| 377 | |
| 378 | These commands are used to manipulate the stack of compiled patterns, |
| 379 | as described in the section entitled "Saving and restoring compiled |
| 380 | patterns" below. |
| 381 | |
| 382 | #save <filename> |
| 383 | |
| 384 | This command is used to save a set of compiled patterns to a file, as |
| 385 | described in the section entitled "Saving and restoring compiled pat- |
| 386 | terns" below. |
| 387 | |
| 388 | #subject <modifier-list> |
| 389 | |
| 390 | This command sets a default modifier list that applies to all subse- |
| 391 | quent subject lines. Modifiers on a subject line can change these set- |
| 392 | tings. |
| 393 | |
| 394 | |
| 395 | MODIFIER SYNTAX |
| 396 | |
| 397 | Modifier lists are used with both pattern and subject lines. Items in a |
| 398 | list are separated by commas followed by optional white space. Trailing |
| 399 | whitespace in a modifier list is ignored. Some modifiers may be given |
| 400 | for both patterns and subject lines, whereas others are valid only for |
| 401 | one or the other. Each modifier has a long name, for example "an- |
| 402 | chored", and some of them must be followed by an equals sign and a |
| 403 | value, for example, "offset=12". Values cannot contain comma charac- |
| 404 | ters, but may contain spaces. Modifiers that do not take values may be |
| 405 | preceded by a minus sign to turn off a previous setting. |
| 406 | |
| 407 | A few of the more common modifiers can also be specified as single let- |
| 408 | ters, for example "i" for "caseless". In documentation, following the |
| 409 | Perl convention, these are written with a slash ("the /i modifier") for |
| 410 | clarity. Abbreviated modifiers must all be concatenated in the first |
| 411 | item of a modifier list. If the first item is not recognized as a long |
| 412 | modifier name, it is interpreted as a sequence of these abbreviations. |
| 413 | For example: |
| 414 | |
| 415 | /abc/ig,newline=cr,jit=3 |
| 416 | |
| 417 | This is a pattern line whose modifier list starts with two one-letter |
| 418 | modifiers (/i and /g). The lower-case abbreviated modifiers are the |
| 419 | same as used in Perl. |
| 420 | |
| 421 | |
| 422 | PATTERN SYNTAX |
| 423 | |
| 424 | A pattern line must start with one of the following characters (common |
| 425 | symbols, excluding pattern meta-characters): |
| 426 | |
| 427 | / ! " ' ` - = _ : ; , % & @ ~ |
| 428 | |
| 429 | This is interpreted as the pattern's delimiter. A regular expression |
| 430 | may be continued over several input lines, in which case the newline |
| 431 | characters are included within it. It is possible to include the delim- |
| 432 | iter as a literal within the pattern by escaping it with a backslash, |
| 433 | for example |
| 434 | |
| 435 | /abc\/def/ |
| 436 | |
| 437 | If you do this, the escape and the delimiter form part of the pattern, |
| 438 | but since the delimiters are all non-alphanumeric, the inclusion of the |
| 439 | backslash does not affect the pattern's interpretation. Note, however, |
| 440 | that this trick does not work within \Q...\E literal bracketing because |
| 441 | the backslash will itself be interpreted as a literal. If the terminat- |
| 442 | ing delimiter is immediately followed by a backslash, for example, |
| 443 | |
| 444 | /abc/\ |
| 445 | |
| 446 | then a backslash is added to the end of the pattern. This is done to |
| 447 | provide a way of testing the error condition that arises if a pattern |
| 448 | finishes with a backslash, because |
| 449 | |
| 450 | /abc\/ |
| 451 | |
| 452 | is interpreted as the first line of a pattern that starts with "abc/", |
| 453 | causing pcre2test to read the next line as a continuation of the regu- |
| 454 | lar expression. |
| 455 | |
| 456 | A pattern can be followed by a modifier list (details below). |
| 457 | |
| 458 | |
| 459 | SUBJECT LINE SYNTAX |
| 460 | |
| 461 | Before each subject line is passed to pcre2_match(), pcre2_dfa_match(), |
| 462 | or pcre2_jit_match(), leading and trailing white space is removed, and |
| 463 | the line is scanned for backslash escapes, unless the subject_literal |
| 464 | modifier was set for the pattern. The following provide a means of en- |
| 465 | coding non-printing characters in a visible way: |
| 466 | |
| 467 | \a alarm (BEL, \x07) |
| 468 | \b backspace (\x08) |
| 469 | \e escape (\x27) |
| 470 | \f form feed (\x0c) |
| 471 | \n newline (\x0a) |
| 472 | \r carriage return (\x0d) |
| 473 | \t tab (\x09) |
| 474 | \v vertical tab (\x0b) |
| 475 | \nnn octal character (up to 3 octal digits); always |
| 476 | a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode |
| 477 | \o{dd...} octal character (any number of octal digits} |
| 478 | \xhh hexadecimal byte (up to 2 hex digits) |
| 479 | \x{hh...} hexadecimal character (any number of hex digits) |
| 480 | |
| 481 | The use of \x{hh...} is not dependent on the use of the utf modifier on |
| 482 | the pattern. It is recognized always. There may be any number of hexa- |
| 483 | decimal digits inside the braces; invalid values provoke error mes- |
| 484 | sages. |
| 485 | |
| 486 | Note that \xhh specifies one byte rather than one character in UTF-8 |
| 487 | mode; this makes it possible to construct invalid UTF-8 sequences for |
| 488 | testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8 |
| 489 | character in UTF-8 mode, generating more than one byte if the value is |
| 490 | greater than 127. When testing the 8-bit library not in UTF-8 mode, |
| 491 | \x{hh} generates one byte for values less than 256, and causes an error |
| 492 | for greater values. |
| 493 | |
| 494 | In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it |
| 495 | possible to construct invalid UTF-16 sequences for testing purposes. |
| 496 | |
| 497 | In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This |
| 498 | makes it possible to construct invalid UTF-32 sequences for testing |
| 499 | purposes. |
| 500 | |
| 501 | There is a special backslash sequence that specifies replication of one |
| 502 | or more characters: |
| 503 | |
| 504 | \[<characters>]{<count>} |
| 505 | |
| 506 | This makes it possible to test long strings without having to provide |
| 507 | them as part of the file. For example: |
| 508 | |
| 509 | \[abc]{4} |
| 510 | |
| 511 | is converted to "abcabcabcabc". This feature does not support nesting. |
| 512 | To include a closing square bracket in the characters, code it as \x5D. |
| 513 | |
| 514 | A backslash followed by an equals sign marks the end of the subject |
| 515 | string and the start of a modifier list. For example: |
| 516 | |
| 517 | abc\=notbol,notempty |
| 518 | |
| 519 | If the subject string is empty and \= is followed by whitespace, the |
| 520 | line is treated as a comment line, and is not used for matching. For |
| 521 | example: |
| 522 | |
| 523 | \= This is a comment. |
| 524 | abc\= This is an invalid modifier list. |
| 525 | |
| 526 | A backslash followed by any other non-alphanumeric character just es- |
| 527 | capes that character. A backslash followed by anything else causes an |
| 528 | error. However, if the very last character in the line is a backslash |
| 529 | (and there is no modifier list), it is ignored. This gives a way of |
| 530 | passing an empty line as data, since a real empty line terminates the |
| 531 | data input. |
| 532 | |
| 533 | If the subject_literal modifier is set for a pattern, all subject lines |
| 534 | that follow are treated as literals, with no special treatment of back- |
| 535 | slashes. No replication is possible, and any subject modifiers must be |
| 536 | set as defaults by a #subject command. |
| 537 | |
| 538 | |
| 539 | PATTERN MODIFIERS |
| 540 | |
| 541 | There are several types of modifier that can appear in pattern lines. |
| 542 | Except where noted below, they may also be used in #pattern commands. A |
| 543 | pattern's modifier list can add to or override default modifiers that |
| 544 | were set by a previous #pattern command. |
| 545 | |
| 546 | Setting compilation options |
| 547 | |
| 548 | The following modifiers set options for pcre2_compile(). Most of them |
| 549 | set bits in the options argument of that function, but those whose |
| 550 | names start with PCRE2_EXTRA are additional options that are set in the |
| 551 | compile context. For the main options, there are some single-letter ab- |
| 552 | breviations that are the same as Perl options. There is special han- |
| 553 | dling for /x: if a second x is present, PCRE2_EXTENDED is converted |
| 554 | into PCRE2_EXTENDED_MORE as in Perl. A third appearance adds PCRE2_EX- |
| 555 | TENDED as well, though this makes no difference to the way pcre2_com- |
| 556 | pile() behaves. See pcre2api for a description of the effects of these |
| 557 | options. |
| 558 | |
| 559 | allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS |
| 560 | allow_lookaround_bsk set PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK |
| 561 | allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES |
| 562 | alt_bsux set PCRE2_ALT_BSUX |
| 563 | alt_circumflex set PCRE2_ALT_CIRCUMFLEX |
| 564 | alt_verbnames set PCRE2_ALT_VERBNAMES |
| 565 | anchored set PCRE2_ANCHORED |
| 566 | auto_callout set PCRE2_AUTO_CALLOUT |
| 567 | bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL |
| 568 | /i caseless set PCRE2_CASELESS |
| 569 | dollar_endonly set PCRE2_DOLLAR_ENDONLY |
| 570 | /s dotall set PCRE2_DOTALL |
| 571 | dupnames set PCRE2_DUPNAMES |
| 572 | endanchored set PCRE2_ENDANCHORED |
| 573 | escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF |
| 574 | /x extended set PCRE2_EXTENDED |
| 575 | /xx extended_more set PCRE2_EXTENDED_MORE |
| 576 | extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX |
| 577 | firstline set PCRE2_FIRSTLINE |
| 578 | literal set PCRE2_LITERAL |
| 579 | match_line set PCRE2_EXTRA_MATCH_LINE |
| 580 | match_invalid_utf set PCRE2_MATCH_INVALID_UTF |
| 581 | match_unset_backref set PCRE2_MATCH_UNSET_BACKREF |
| 582 | match_word set PCRE2_EXTRA_MATCH_WORD |
| 583 | /m multiline set PCRE2_MULTILINE |
| 584 | never_backslash_c set PCRE2_NEVER_BACKSLASH_C |
| 585 | never_ucp set PCRE2_NEVER_UCP |
| 586 | never_utf set PCRE2_NEVER_UTF |
| 587 | /n no_auto_capture set PCRE2_NO_AUTO_CAPTURE |
| 588 | no_auto_possess set PCRE2_NO_AUTO_POSSESS |
| 589 | no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR |
| 590 | no_start_optimize set PCRE2_NO_START_OPTIMIZE |
| 591 | no_utf_check set PCRE2_NO_UTF_CHECK |
| 592 | ucp set PCRE2_UCP |
| 593 | ungreedy set PCRE2_UNGREEDY |
| 594 | use_offset_limit set PCRE2_USE_OFFSET_LIMIT |
| 595 | utf set PCRE2_UTF |
| 596 | |
| 597 | As well as turning on the PCRE2_UTF option, the utf modifier causes all |
| 598 | non-printing characters in output strings to be printed using the |
| 599 | \x{hh...} notation. Otherwise, those less than 0x100 are output in hex |
| 600 | without the curly brackets. Setting utf in 16-bit or 32-bit mode also |
| 601 | causes pattern and subject strings to be translated to UTF-16 or |
| 602 | UTF-32, respectively, before being passed to library functions. |
| 603 | |
| 604 | Setting compilation controls |
| 605 | |
| 606 | The following modifiers affect the compilation process or request in- |
| 607 | formation about the pattern. There are single-letter abbreviations for |
| 608 | some that are heavily used in the test files. |
| 609 | |
| 610 | bsr=[anycrlf|unicode] specify \R handling |
| 611 | /B bincode show binary code without lengths |
| 612 | callout_info show callout information |
| 613 | convert=<options> request foreign pattern conversion |
| 614 | convert_glob_escape=c set glob escape character |
| 615 | convert_glob_separator=c set glob separator character |
| 616 | convert_length set convert buffer length |
| 617 | debug same as info,fullbincode |
| 618 | framesize show matching frame size |
| 619 | fullbincode show binary code with lengths |
| 620 | /I info show info about compiled pattern |
| 621 | hex unquoted characters are hexadecimal |
| 622 | jit[=<number>] use JIT |
| 623 | jitfast use JIT fast path |
| 624 | jitverify verify JIT use |
| 625 | locale=<name> use this locale |
| 626 | max_pattern_length=<n> set the maximum pattern length |
| 627 | memory show memory used |
| 628 | newline=<type> set newline type |
| 629 | null_context compile with a NULL context |
| 630 | parens_nest_limit=<n> set maximum parentheses depth |
| 631 | posix use the POSIX API |
| 632 | posix_nosub use the POSIX API with REG_NOSUB |
| 633 | push push compiled pattern onto the stack |
| 634 | pushcopy push a copy onto the stack |
| 635 | stackguard=<number> test the stackguard feature |
| 636 | subject_literal treat all subject lines as literal |
| 637 | tables=[0|1|2|3] select internal tables |
| 638 | use_length do not zero-terminate the pattern |
| 639 | utf8_input treat input as UTF-8 |
| 640 | |
| 641 | The effects of these modifiers are described in the following sections. |
| 642 | |
| 643 | Newline and \R handling |
| 644 | |
| 645 | The bsr modifier specifies what \R in a pattern should match. If it is |
| 646 | set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to |
| 647 | "unicode", \R matches any Unicode newline sequence. The default can be |
| 648 | specified when PCRE2 is built; if it is not, the default is set to Uni- |
| 649 | code. |
| 650 | |
| 651 | The newline modifier specifies which characters are to be interpreted |
| 652 | as newlines, both in the pattern and in subject lines. The type must be |
| 653 | one of CR, LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case). |
| 654 | |
| 655 | Information about a pattern |
| 656 | |
| 657 | The debug modifier is a shorthand for info,fullbincode, requesting all |
| 658 | available information. |
| 659 | |
| 660 | The bincode modifier causes a representation of the compiled code to be |
| 661 | output after compilation. This information does not contain length and |
| 662 | offset values, which ensures that the same output is generated for dif- |
| 663 | ferent internal link sizes and different code unit widths. By using |
| 664 | bincode, the same regression tests can be used in different environ- |
| 665 | ments. |
| 666 | |
| 667 | The fullbincode modifier, by contrast, does include length and offset |
| 668 | values. This is used in a few special tests that run only for specific |
| 669 | code unit widths and link sizes, and is also useful for one-off tests. |
| 670 | |
| 671 | The info modifier requests information about the compiled pattern |
| 672 | (whether it is anchored, has a fixed first character, and so on). The |
| 673 | information is obtained from the pcre2_pattern_info() function. Here |
| 674 | are some typical examples: |
| 675 | |
| 676 | re> /(?i)(^a|^b)/m,info |
| 677 | Capture group count = 1 |
| 678 | Compile options: multiline |
| 679 | Overall options: caseless multiline |
| 680 | First code unit at start or follows newline |
| 681 | Subject length lower bound = 1 |
| 682 | |
| 683 | re> /(?i)abc/info |
| 684 | Capture group count = 0 |
| 685 | Compile options: <none> |
| 686 | Overall options: caseless |
| 687 | First code unit = 'a' (caseless) |
| 688 | Last code unit = 'c' (caseless) |
| 689 | Subject length lower bound = 3 |
| 690 | |
| 691 | "Compile options" are those specified by modifiers; "overall options" |
| 692 | have added options that are taken or deduced from the pattern. If both |
| 693 | sets of options are the same, just a single "options" line is output; |
| 694 | if there are no options, the line is omitted. "First code unit" is |
| 695 | where any match must start; if there is more than one they are listed |
| 696 | as "starting code units". "Last code unit" is the last literal code |
| 697 | unit that must be present in any match. This is not necessarily the |
| 698 | last character. These lines are omitted if no starting or ending code |
| 699 | units are recorded. The subject length line is omitted when |
| 700 | no_start_optimize is set because the minimum length is not calculated |
| 701 | when it can never be used. |
| 702 | |
| 703 | The framesize modifier shows the size, in bytes, of the storage frames |
| 704 | used by pcre2_match() for handling backtracking. The size depends on |
| 705 | the number of capturing parentheses in the pattern. |
| 706 | |
| 707 | The callout_info modifier requests information about all the callouts |
| 708 | in the pattern. A list of them is output at the end of any other infor- |
| 709 | mation that is requested. For each callout, either its number or string |
| 710 | is given, followed by the item that follows it in the pattern. |
| 711 | |
| 712 | Passing a NULL context |
| 713 | |
| 714 | Normally, pcre2test passes a context block to pcre2_compile(). If the |
| 715 | null_context modifier is set, however, NULL is passed. This is for |
| 716 | testing that pcre2_compile() behaves correctly in this case (it uses |
| 717 | default values). |
| 718 | |
| 719 | Specifying pattern characters in hexadecimal |
| 720 | |
| 721 | The hex modifier specifies that the characters of the pattern, except |
| 722 | for substrings enclosed in single or double quotes, are to be inter- |
| 723 | preted as pairs of hexadecimal digits. This feature is provided as a |
| 724 | way of creating patterns that contain binary zeros and other non-print- |
| 725 | ing characters. White space is permitted between pairs of digits. For |
| 726 | example, this pattern contains three characters: |
| 727 | |
| 728 | /ab 32 59/hex |
| 729 | |
| 730 | Parts of such a pattern are taken literally if quoted. This pattern |
| 731 | contains nine characters, only two of which are specified in hexadeci- |
| 732 | mal: |
| 733 | |
| 734 | /ab "literal" 32/hex |
| 735 | |
| 736 | Either single or double quotes may be used. There is no way of includ- |
| 737 | ing the delimiter within a substring. The hex and expand modifiers are |
| 738 | mutually exclusive. |
| 739 | |
| 740 | Specifying the pattern's length |
| 741 | |
| 742 | By default, patterns are passed to the compiling functions as zero-ter- |
| 743 | minated strings but can be passed by length instead of being zero-ter- |
| 744 | minated. The use_length modifier causes this to happen. Using a length |
| 745 | happens automatically (whether or not use_length is set) when hex is |
| 746 | set, because patterns specified in hexadecimal may contain binary ze- |
| 747 | ros. |
| 748 | |
| 749 | If hex or use_length is used with the POSIX wrapper API (see "Using the |
| 750 | POSIX wrapper API" below), the REG_PEND extension is used to pass the |
| 751 | pattern's length. |
| 752 | |
| 753 | Specifying wide characters in 16-bit and 32-bit modes |
| 754 | |
| 755 | In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 |
| 756 | and translated to UTF-16 or UTF-32 when the utf modifier is set. For |
| 757 | testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input |
| 758 | modifier can be used. It is mutually exclusive with utf. Input lines |
| 759 | are interpreted as UTF-8 as a means of specifying wide characters. More |
| 760 | details are given in "Input encoding" above. |
| 761 | |
| 762 | Generating long repetitive patterns |
| 763 | |
| 764 | Some tests use long patterns that are very repetitive. Instead of cre- |
| 765 | ating a very long input line for such a pattern, you can use a special |
| 766 | repetition feature, similar to the one described for subject lines |
| 767 | above. If the expand modifier is present on a pattern, parts of the |
| 768 | pattern that have the form |
| 769 | |
| 770 | \[<characters>]{<count>} |
| 771 | |
| 772 | are expanded before the pattern is passed to pcre2_compile(). For exam- |
| 773 | ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction |
| 774 | cannot be nested. An initial "\[" sequence is recognized only if "]{" |
| 775 | followed by decimal digits and "}" is found later in the pattern. If |
| 776 | not, the characters remain in the pattern unaltered. The expand and hex |
| 777 | modifiers are mutually exclusive. |
| 778 | |
| 779 | If part of an expanded pattern looks like an expansion, but is really |
| 780 | part of the actual pattern, unwanted expansion can be avoided by giving |
| 781 | two values in the quantifier. For example, \[AB]{6000,6000} is not rec- |
| 782 | ognized as an expansion item. |
| 783 | |
| 784 | If the info modifier is set on an expanded pattern, the result of the |
| 785 | expansion is included in the information that is output. |
| 786 | |
| 787 | JIT compilation |
| 788 | |
| 789 | Just-in-time (JIT) compiling is a heavyweight optimization that can |
| 790 | greatly speed up pattern matching. See the pcre2jit documentation for |
| 791 | details. JIT compiling happens, optionally, after a pattern has been |
| 792 | successfully compiled into an internal form. The JIT compiler converts |
| 793 | this to optimized machine code. It needs to know whether the match-time |
| 794 | options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, |
| 795 | because different code is generated for the different cases. See the |
| 796 | partial modifier in "Subject Modifiers" below for details of how these |
| 797 | options are specified for each match attempt. |
| 798 | |
| 799 | JIT compilation is requested by the jit pattern modifier, which may op- |
| 800 | tionally be followed by an equals sign and a number in the range 0 to |
| 801 | 7. The three bits that make up the number specify which of the three |
| 802 | JIT operating modes are to be compiled: |
| 803 | |
| 804 | 1 compile JIT code for non-partial matching |
| 805 | 2 compile JIT code for soft partial matching |
| 806 | 4 compile JIT code for hard partial matching |
| 807 | |
| 808 | The possible values for the jit modifier are therefore: |
| 809 | |
| 810 | 0 disable JIT |
| 811 | 1 normal matching only |
| 812 | 2 soft partial matching only |
| 813 | 3 normal and soft partial matching |
| 814 | 4 hard partial matching only |
| 815 | 6 soft and hard partial matching only |
| 816 | 7 all three modes |
| 817 | |
| 818 | If no number is given, 7 is assumed. The phrase "partial matching" |
| 819 | means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the |
| 820 | PCRE2_PARTIAL_HARD option set. Note that such a call may return a com- |
| 821 | plete match; the options enable the possibility of a partial match, but |
| 822 | do not require it. Note also that if you request JIT compilation only |
| 823 | for partial matching (for example, jit=2) but do not set the partial |
| 824 | modifier on a subject line, that match will not use JIT code because |
| 825 | none was compiled for non-partial matching. |
| 826 | |
| 827 | If JIT compilation is successful, the compiled JIT code will automati- |
| 828 | cally be used when an appropriate type of match is run, except when in- |
| 829 | compatible run-time options are specified. For more details, see the |
| 830 | pcre2jit documentation. See also the jitstack modifier below for a way |
| 831 | of setting the size of the JIT stack. |
| 832 | |
| 833 | If the jitfast modifier is specified, matching is done using the JIT |
| 834 | "fast path" interface, pcre2_jit_match(), which skips some of the san- |
| 835 | ity checks that are done by pcre2_match(), and of course does not work |
| 836 | when JIT is not supported. If jitfast is specified without jit, jit=7 |
| 837 | is assumed. |
| 838 | |
| 839 | If the jitverify modifier is specified, information about the compiled |
| 840 | pattern shows whether JIT compilation was or was not successful. If |
| 841 | jitverify is specified without jit, jit=7 is assumed. If JIT compila- |
| 842 | tion is successful when jitverify is set, the text "(JIT)" is added to |
| 843 | the first output line after a match or non match when JIT-compiled code |
| 844 | was actually used in the match. |
| 845 | |
| 846 | Setting a locale |
| 847 | |
| 848 | The locale modifier must specify the name of a locale, for example: |
| 849 | |
| 850 | /pattern/locale=fr_FR |
| 851 | |
| 852 | The given locale is set, pcre2_maketables() is called to build a set of |
| 853 | character tables for the locale, and this is then passed to pcre2_com- |
| 854 | pile() when compiling the regular expression. The same tables are used |
| 855 | when matching the following subject lines. The locale modifier applies |
| 856 | only to the pattern on which it appears, but can be given in a #pattern |
| 857 | command if a default is needed. Setting a locale and alternate charac- |
| 858 | ter tables are mutually exclusive. |
| 859 | |
| 860 | Showing pattern memory |
| 861 | |
| 862 | The memory modifier causes the size in bytes of the memory used to hold |
| 863 | the compiled pattern to be output. This does not include the size of |
| 864 | the pcre2_code block; it is just the actual compiled data. If the pat- |
| 865 | tern is subsequently passed to the JIT compiler, the size of the JIT |
| 866 | compiled code is also output. Here is an example: |
| 867 | |
| 868 | re> /a(b)c/jit,memory |
| 869 | Memory allocation (code space): 21 |
| 870 | Memory allocation (JIT code): 1910 |
| 871 | |
| 872 | |
| 873 | Limiting nested parentheses |
| 874 | |
| 875 | The parens_nest_limit modifier sets a limit on the depth of nested |
| 876 | parentheses in a pattern. Breaching the limit causes a compilation er- |
| 877 | ror. The default for the library is set when PCRE2 is built, but |
| 878 | pcre2test sets its own default of 220, which is required for running |
| 879 | the standard test suite. |
| 880 | |
| 881 | Limiting the pattern length |
| 882 | |
| 883 | The max_pattern_length modifier sets a limit, in code units, to the |
| 884 | length of pattern that pcre2_compile() will accept. Breaching the limit |
| 885 | causes a compilation error. The default is the largest number a |
| 886 | PCRE2_SIZE variable can hold (essentially unlimited). |
| 887 | |
| 888 | Using the POSIX wrapper API |
| 889 | |
| 890 | The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via |
| 891 | the POSIX wrapper API rather than its native API. When posix_nosub is |
| 892 | used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX |
| 893 | wrapper supports only the 8-bit library. Note that it does not imply |
| 894 | POSIX matching semantics; for more detail see the pcre2posix documenta- |
| 895 | tion. The following pattern modifiers set options for the regcomp() |
| 896 | function: |
| 897 | |
| 898 | caseless REG_ICASE |
| 899 | multiline REG_NEWLINE |
| 900 | dotall REG_DOTALL ) |
| 901 | ungreedy REG_UNGREEDY ) These options are not part of |
| 902 | ucp REG_UCP ) the POSIX standard |
| 903 | utf REG_UTF8 ) |
| 904 | |
| 905 | The regerror_buffsize modifier specifies a size for the error buffer |
| 906 | that is passed to regerror() in the event of a compilation error. For |
| 907 | example: |
| 908 | |
| 909 | /abc/posix,regerror_buffsize=20 |
| 910 | |
| 911 | This provides a means of testing the behaviour of regerror() when the |
| 912 | buffer is too small for the error message. If this modifier has not |
| 913 | been set, a large buffer is used. |
| 914 | |
| 915 | The aftertext and allaftertext subject modifiers work as described be- |
| 916 | low. All other modifiers are either ignored, with a warning message, or |
| 917 | cause an error. |
| 918 | |
| 919 | The pattern is passed to regcomp() as a zero-terminated string by de- |
| 920 | fault, but if the use_length or hex modifiers are set, the REG_PEND ex- |
| 921 | tension is used to pass it by length. |
| 922 | |
| 923 | Testing the stack guard feature |
| 924 | |
| 925 | The stackguard modifier is used to test the use of pcre2_set_com- |
| 926 | pile_recursion_guard(), a function that is provided to enable stack |
| 927 | availability to be checked during compilation (see the pcre2api docu- |
| 928 | mentation for details). If the number specified by the modifier is |
| 929 | greater than zero, pcre2_set_compile_recursion_guard() is called to set |
| 930 | up callback from pcre2_compile() to a local function. The argument it |
| 931 | receives is the current nesting parenthesis depth; if this is greater |
| 932 | than the value given by the modifier, non-zero is returned, causing the |
| 933 | compilation to be aborted. |
| 934 | |
| 935 | Using alternative character tables |
| 936 | |
| 937 | The value specified for the tables modifier must be one of the digits |
| 938 | 0, 1, 2, or 3. It causes a specific set of built-in character tables to |
| 939 | be passed to pcre2_compile(). This is used in the PCRE2 tests to check |
| 940 | behaviour with different character tables. The digit specifies the ta- |
| 941 | bles as follows: |
| 942 | |
| 943 | 0 do not pass any special character tables |
| 944 | 1 the default ASCII tables, as distributed in |
| 945 | pcre2_chartables.c.dist |
| 946 | 2 a set of tables defining ISO 8859 characters |
| 947 | 3 a set of tables loaded by the #loadtables command |
| 948 | |
| 949 | In tables 2, some characters whose codes are greater than 128 are iden- |
| 950 | tified as letters, digits, spaces, etc. Tables 3 can be used only after |
| 951 | a #loadtables command has loaded them from a binary file. Setting al- |
| 952 | ternate character tables and a locale are mutually exclusive. |
| 953 | |
| 954 | Setting certain match controls |
| 955 | |
| 956 | The following modifiers are really subject modifiers, and are described |
| 957 | under "Subject Modifiers" below. However, they may be included in a |
| 958 | pattern's modifier list, in which case they are applied to every sub- |
| 959 | ject line that is processed with that pattern. These modifiers do not |
| 960 | affect the compilation process. |
| 961 | |
| 962 | aftertext show text after match |
| 963 | allaftertext show text after captures |
| 964 | allcaptures show all captures |
| 965 | allvector show the entire ovector |
| 966 | allusedtext show all consulted text |
| 967 | altglobal alternative global matching |
| 968 | /g global global matching |
| 969 | jitstack=<n> set size of JIT stack |
| 970 | mark show mark values |
| 971 | replace=<string> specify a replacement string |
| 972 | startchar show starting character when relevant |
| 973 | substitute_callout use substitution callouts |
| 974 | substitute_extended use PCRE2_SUBSTITUTE_EXTENDED |
| 975 | substitute_literal use PCRE2_SUBSTITUTE_LITERAL |
| 976 | substitute_matched use PCRE2_SUBSTITUTE_MATCHED |
| 977 | substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH |
| 978 | substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY |
| 979 | substitute_skip=<n> skip substitution <n> |
| 980 | substitute_stop=<n> skip substitution <n> and following |
| 981 | substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET |
| 982 | substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY |
| 983 | |
| 984 | These modifiers may not appear in a #pattern command. If you want them |
| 985 | as defaults, set them in a #subject command. |
| 986 | |
| 987 | Specifying literal subject lines |
| 988 | |
| 989 | If the subject_literal modifier is present on a pattern, all the sub- |
| 990 | ject lines that it matches are taken as literal strings, with no inter- |
| 991 | pretation of backslashes. It is not possible to set subject modifiers |
| 992 | on such lines, but any that are set as defaults by a #subject command |
| 993 | are recognized. |
| 994 | |
| 995 | Saving a compiled pattern |
| 996 | |
| 997 | When a pattern with the push modifier is successfully compiled, it is |
| 998 | pushed onto a stack of compiled patterns, and pcre2test expects the |
| 999 | next line to contain a new pattern (or a command) instead of a subject |
| 1000 | line. This facility is used when saving compiled patterns to a file, as |
| 1001 | described in the section entitled "Saving and restoring compiled pat- |
| 1002 | terns" below. If pushcopy is used instead of push, a copy of the com- |
| 1003 | piled pattern is stacked, leaving the original as current, ready to |
| 1004 | match the following input lines. This provides a way of testing the |
| 1005 | pcre2_code_copy() function. The push and pushcopy modifiers are in- |
| 1006 | compatible with compilation modifiers such as global that act at match |
| 1007 | time. Any that are specified are ignored (for the stacked copy), with a |
| 1008 | warning message, except for replace, which causes an error. Note that |
| 1009 | jitverify, which is allowed, does not carry through to any subsequent |
| 1010 | matching that uses a stacked pattern. |
| 1011 | |
| 1012 | Testing foreign pattern conversion |
| 1013 | |
| 1014 | The experimental foreign pattern conversion functions in PCRE2 can be |
| 1015 | tested by setting the convert modifier. Its argument is a colon-sepa- |
| 1016 | rated list of options, which set the equivalent option for the |
| 1017 | pcre2_pattern_convert() function: |
| 1018 | |
| 1019 | glob PCRE2_CONVERT_GLOB |
| 1020 | glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR |
| 1021 | glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR |
| 1022 | posix_basic PCRE2_CONVERT_POSIX_BASIC |
| 1023 | posix_extended PCRE2_CONVERT_POSIX_EXTENDED |
| 1024 | unset Unset all options |
| 1025 | |
| 1026 | The "unset" value is useful for turning off a default that has been set |
| 1027 | by a #pattern command. When one of these options is set, the input pat- |
| 1028 | tern is passed to pcre2_pattern_convert(). If the conversion is suc- |
| 1029 | cessful, the result is reflected in the output and then passed to |
| 1030 | pcre2_compile(). The normal utf and no_utf_check options, if set, cause |
| 1031 | the PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be |
| 1032 | passed to pcre2_pattern_convert(). |
| 1033 | |
| 1034 | By default, the conversion function is allowed to allocate a buffer for |
| 1035 | its output. However, if the convert_length modifier is set to a value |
| 1036 | greater than zero, pcre2test passes a buffer of the given length. This |
| 1037 | makes it possible to test the length check. |
| 1038 | |
| 1039 | The convert_glob_escape and convert_glob_separator modifiers can be |
| 1040 | used to specify the escape and separator characters for glob process- |
| 1041 | ing, overriding the defaults, which are operating-system dependent. |
| 1042 | |
| 1043 | |
| 1044 | SUBJECT MODIFIERS |
| 1045 | |
| 1046 | The modifiers that can appear in subject lines and the #subject command |
| 1047 | are of two types. |
| 1048 | |
| 1049 | Setting match options |
| 1050 | |
| 1051 | The following modifiers set options for pcre2_match() or |
| 1052 | pcre2_dfa_match(). See pcreapi for a description of their effects. |
| 1053 | |
| 1054 | anchored set PCRE2_ANCHORED |
| 1055 | endanchored set PCRE2_ENDANCHORED |
| 1056 | dfa_restart set PCRE2_DFA_RESTART |
| 1057 | dfa_shortest set PCRE2_DFA_SHORTEST |
| 1058 | no_jit set PCRE2_NO_JIT |
| 1059 | no_utf_check set PCRE2_NO_UTF_CHECK |
| 1060 | notbol set PCRE2_NOTBOL |
| 1061 | notempty set PCRE2_NOTEMPTY |
| 1062 | notempty_atstart set PCRE2_NOTEMPTY_ATSTART |
| 1063 | noteol set PCRE2_NOTEOL |
| 1064 | partial_hard (or ph) set PCRE2_PARTIAL_HARD |
| 1065 | partial_soft (or ps) set PCRE2_PARTIAL_SOFT |
| 1066 | |
| 1067 | The partial matching modifiers are provided with abbreviations because |
| 1068 | they appear frequently in tests. |
| 1069 | |
| 1070 | If the posix or posix_nosub modifier was present on the pattern, caus- |
| 1071 | ing the POSIX wrapper API to be used, the only option-setting modifiers |
| 1072 | that have any effect are notbol, notempty, and noteol, causing REG_NOT- |
| 1073 | BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to |
| 1074 | regexec(). The other modifiers are ignored, with a warning message. |
| 1075 | |
| 1076 | There is one additional modifier that can be used with the POSIX wrap- |
| 1077 | per. It is ignored (with a warning) if used for non-POSIX matching. |
| 1078 | |
| 1079 | posix_startend=<n>[:<m>] |
| 1080 | |
| 1081 | This causes the subject string to be passed to regexec() using the |
| 1082 | REG_STARTEND option, which uses offsets to specify which part of the |
| 1083 | string is searched. If only one number is given, the end offset is |
| 1084 | passed as the end of the subject string. For more detail of REG_STAR- |
| 1085 | TEND, see the pcre2posix documentation. If the subject string contains |
| 1086 | binary zeros (coded as escapes such as \x{00} because pcre2test does |
| 1087 | not support actual binary zeros in its input), you must use posix_star- |
| 1088 | tend to specify its length. |
| 1089 | |
| 1090 | Setting match controls |
| 1091 | |
| 1092 | The following modifiers affect the matching process or request addi- |
| 1093 | tional information. Some of them may also be specified on a pattern |
| 1094 | line (see above), in which case they apply to every subject line that |
| 1095 | is matched against that pattern, but can be overridden by modifiers on |
| 1096 | the subject. |
| 1097 | |
| 1098 | aftertext show text after match |
| 1099 | allaftertext show text after captures |
| 1100 | allcaptures show all captures |
| 1101 | allvector show the entire ovector |
| 1102 | allusedtext show all consulted text (non-JIT only) |
| 1103 | altglobal alternative global matching |
| 1104 | callout_capture show captures at callout time |
| 1105 | callout_data=<n> set a value to pass via callouts |
| 1106 | callout_error=<n>[:<m>] control callout error |
| 1107 | callout_extra show extra callout information |
| 1108 | callout_fail=<n>[:<m>] control callout failure |
| 1109 | callout_no_where do not show position of a callout |
| 1110 | callout_none do not supply a callout function |
| 1111 | copy=<number or name> copy captured substring |
| 1112 | depth_limit=<n> set a depth limit |
| 1113 | dfa use pcre2_dfa_match() |
| 1114 | find_limits find match and depth limits |
| 1115 | get=<number or name> extract captured substring |
| 1116 | getall extract all captured substrings |
| 1117 | /g global global matching |
| 1118 | heap_limit=<n> set a limit on heap memory (Kbytes) |
| 1119 | jitstack=<n> set size of JIT stack |
| 1120 | mark show mark values |
| 1121 | match_limit=<n> set a match limit |
| 1122 | memory show heap memory usage |
| 1123 | null_context match with a NULL context |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 1124 | null_replacement substitute with NULL replacement |
| 1125 | null_subject match with NULL subject |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 1126 | offset=<n> set starting offset |
| 1127 | offset_limit=<n> set offset limit |
| 1128 | ovector=<n> set size of output vector |
| 1129 | recursion_limit=<n> obsolete synonym for depth_limit |
| 1130 | replace=<string> specify a replacement string |
| 1131 | startchar show startchar when relevant |
| 1132 | startoffset=<n> same as offset=<n> |
| 1133 | substitute_callout use substitution callouts |
| 1134 | substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED |
| 1135 | substitute_literal use PCRE2_SUBSTITUTE_LITERAL |
| 1136 | substitute_matched use PCRE2_SUBSTITUTE_MATCHED |
| 1137 | substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH |
| 1138 | substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY |
| 1139 | substitute_skip=<n> skip substitution number n |
| 1140 | substitute_stop=<n> skip substitution number n and greater |
| 1141 | substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET |
| 1142 | substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY |
| 1143 | zero_terminate pass the subject as zero-terminated |
| 1144 | |
| 1145 | The effects of these modifiers are described in the following sections. |
| 1146 | When matching via the POSIX wrapper API, the aftertext, allaftertext, |
| 1147 | and ovector subject modifiers work as described below. All other modi- |
| 1148 | fiers are either ignored, with a warning message, or cause an error. |
| 1149 | |
| 1150 | Showing more text |
| 1151 | |
| 1152 | The aftertext modifier requests that as well as outputting the part of |
| 1153 | the subject string that matched the entire pattern, pcre2test should in |
| 1154 | addition output the remainder of the subject string. This is useful for |
| 1155 | tests where the subject contains multiple copies of the same substring. |
| 1156 | The allaftertext modifier requests the same action for captured sub- |
| 1157 | strings as well as the main matched substring. In each case the remain- |
| 1158 | der is output on the following line with a plus character following the |
| 1159 | capture number. |
| 1160 | |
| 1161 | The allusedtext modifier requests that all the text that was consulted |
| 1162 | during a successful pattern match by the interpreter should be shown, |
| 1163 | for both full and partial matches. This feature is not supported for |
| 1164 | JIT matching, and if requested with JIT it is ignored (with a warning |
| 1165 | message). Setting this modifier affects the output if there is a look- |
| 1166 | behind at the start of a match, or, for a complete match, a lookahead |
| 1167 | at the end, or if \K is used in the pattern. Characters that precede or |
| 1168 | follow the start and end of the actual match are indicated in the out- |
| 1169 | put by '<' or '>' characters underneath them. Here is an example: |
| 1170 | |
| 1171 | re> /(?<=pqr)abc(?=xyz)/ |
| 1172 | data> 123pqrabcxyz456\=allusedtext |
| 1173 | 0: pqrabcxyz |
| 1174 | <<< >>> |
| 1175 | data> 123pqrabcxy\=ph,allusedtext |
| 1176 | Partial match: pqrabcxy |
| 1177 | <<< |
| 1178 | |
| 1179 | The first, complete match shows that the matched string is "abc", with |
| 1180 | the preceding and following strings "pqr" and "xyz" having been con- |
| 1181 | sulted during the match (when processing the assertions). The partial |
| 1182 | match can indicate only the preceding string. |
| 1183 | |
| 1184 | The startchar modifier requests that the starting character for the |
| 1185 | match be indicated, if it is different to the start of the matched |
| 1186 | string. The only time when this occurs is when \K has been processed as |
| 1187 | part of the match. In this situation, the output for the matched string |
| 1188 | is displayed from the starting character instead of from the match |
| 1189 | point, with circumflex characters under the earlier characters. For ex- |
| 1190 | ample: |
| 1191 | |
| 1192 | re> /abc\Kxyz/ |
| 1193 | data> abcxyz\=startchar |
| 1194 | 0: abcxyz |
| 1195 | ^^^ |
| 1196 | |
| 1197 | Unlike allusedtext, the startchar modifier can be used with JIT. How- |
| 1198 | ever, these two modifiers are mutually exclusive. |
| 1199 | |
| 1200 | Showing the value of all capture groups |
| 1201 | |
| 1202 | The allcaptures modifier requests that the values of all potential cap- |
| 1203 | tured parentheses be output after a match. By default, only those up to |
| 1204 | the highest one actually used in the match are output (corresponding to |
| 1205 | the return code from pcre2_match()). Groups that did not take part in |
| 1206 | the match are output as "<unset>". This modifier is not relevant for |
| 1207 | DFA matching (which does no capturing) and does not apply when replace |
| 1208 | is specified; it is ignored, with a warning message, if present. |
| 1209 | |
| 1210 | Showing the entire ovector, for all outcomes |
| 1211 | |
| 1212 | The allvector modifier requests that the entire ovector be shown, what- |
| 1213 | ever the outcome of the match. Compare allcaptures, which shows only up |
| 1214 | to the maximum number of capture groups for the pattern, and then only |
| 1215 | for a successful complete non-DFA match. This modifier, which acts af- |
| 1216 | ter any match result, and also for DFA matching, provides a means of |
| 1217 | checking that there are no unexpected modifications to ovector fields. |
| 1218 | Before each match attempt, the ovector is filled with a special value, |
| 1219 | and if this is found in both elements of a capturing pair, "<un- |
| 1220 | changed>" is output. After a successful match, this applies to all |
| 1221 | groups after the maximum capture group for the pattern. In other cases |
| 1222 | it applies to the entire ovector. After a partial match, the first two |
| 1223 | elements are the only ones that should be set. After a DFA match, the |
| 1224 | amount of ovector that is used depends on the number of matches that |
| 1225 | were found. |
| 1226 | |
| 1227 | Testing pattern callouts |
| 1228 | |
| 1229 | A callout function is supplied when pcre2test calls the library match- |
| 1230 | ing functions, unless callout_none is specified. Its behaviour can be |
| 1231 | controlled by various modifiers listed above whose names begin with |
| 1232 | callout_. Details are given in the section entitled "Callouts" below. |
| 1233 | Testing callouts from pcre2_substitute() is decribed separately in |
| 1234 | "Testing the substitution function" below. |
| 1235 | |
| 1236 | Finding all matches in a string |
| 1237 | |
| 1238 | Searching for all possible matches within a subject can be requested by |
| 1239 | the global or altglobal modifier. After finding a match, the matching |
| 1240 | function is called again to search the remainder of the subject. The |
| 1241 | difference between global and altglobal is that the former uses the |
| 1242 | start_offset argument to pcre2_match() or pcre2_dfa_match() to start |
| 1243 | searching at a new point within the entire string (which is what Perl |
| 1244 | does), whereas the latter passes over a shortened subject. This makes a |
| 1245 | difference to the matching process if the pattern begins with a lookbe- |
| 1246 | hind assertion (including \b or \B). |
| 1247 | |
| 1248 | If an empty string is matched, the next match is done with the |
| 1249 | PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search |
| 1250 | for another, non-empty, match at the same point in the subject. If this |
| 1251 | match fails, the start offset is advanced, and the normal match is re- |
| 1252 | tried. This imitates the way Perl handles such cases when using the /g |
| 1253 | modifier or the split() function. Normally, the start offset is ad- |
| 1254 | vanced by one character, but if the newline convention recognizes CRLF |
| 1255 | as a newline, and the current character is CR followed by LF, an ad- |
| 1256 | vance of two characters occurs. |
| 1257 | |
| 1258 | Testing substring extraction functions |
| 1259 | |
| 1260 | The copy and get modifiers can be used to test the pcre2_sub- |
| 1261 | string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be |
| 1262 | given more than once, and each can specify a capture group name or num- |
| 1263 | ber, for example: |
| 1264 | |
| 1265 | abcd\=copy=1,copy=3,get=G1 |
| 1266 | |
| 1267 | If the #subject command is used to set default copy and/or get lists, |
| 1268 | these can be unset by specifying a negative number to cancel all num- |
| 1269 | bered groups and an empty name to cancel all named groups. |
| 1270 | |
| 1271 | The getall modifier tests pcre2_substring_list_get(), which extracts |
| 1272 | all captured substrings. |
| 1273 | |
| 1274 | If the subject line is successfully matched, the substrings extracted |
| 1275 | by the convenience functions are output with C, G, or L after the |
| 1276 | string number instead of a colon. This is in addition to the normal |
| 1277 | full list. The string length (that is, the return from the extraction |
| 1278 | function) is given in parentheses after each substring, followed by the |
| 1279 | name when the extraction was by name. |
| 1280 | |
| 1281 | Testing the substitution function |
| 1282 | |
| 1283 | If the replace modifier is set, the pcre2_substitute() function is |
| 1284 | called instead of one of the matching functions (or after one call of |
| 1285 | pcre2_match() in the case of PCRE2_SUBSTITUTE_MATCHED). Note that re- |
| 1286 | placement strings cannot contain commas, because a comma signifies the |
| 1287 | end of a modifier. This is not thought to be an issue in a test pro- |
| 1288 | gram. |
| 1289 | |
| 1290 | Specifying a completely empty replacement string disables this modi- |
| 1291 | fier. However, it is possible to specify an empty replacement by pro- |
| 1292 | viding a buffer length, as described below, for an otherwise empty re- |
| 1293 | placement. |
| 1294 | |
| 1295 | Unlike subject strings, pcre2test does not process replacement strings |
| 1296 | for escape sequences. In UTF mode, a replacement string is checked to |
| 1297 | see if it is a valid UTF-8 string. If so, it is correctly converted to |
| 1298 | a UTF string of the appropriate code unit width. If it is not a valid |
| 1299 | UTF-8 string, the individual code units are copied directly. This pro- |
| 1300 | vides a means of passing an invalid UTF-8 string for testing purposes. |
| 1301 | |
| 1302 | The following modifiers set options (in additional to the normal match |
| 1303 | options) for pcre2_substitute(): |
| 1304 | |
| 1305 | global PCRE2_SUBSTITUTE_GLOBAL |
| 1306 | substitute_extended PCRE2_SUBSTITUTE_EXTENDED |
| 1307 | substitute_literal PCRE2_SUBSTITUTE_LITERAL |
| 1308 | substitute_matched PCRE2_SUBSTITUTE_MATCHED |
| 1309 | substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH |
| 1310 | substitute_replacement_only PCRE2_SUBSTITUTE_REPLACEMENT_ONLY |
| 1311 | substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET |
| 1312 | substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY |
| 1313 | |
| 1314 | See the pcre2api documentation for details of these options. |
| 1315 | |
| 1316 | After a successful substitution, the modified string is output, pre- |
| 1317 | ceded by the number of replacements. This may be zero if there were no |
| 1318 | matches. Here is a simple example of a substitution test: |
| 1319 | |
| 1320 | /abc/replace=xxx |
| 1321 | =abc=abc= |
| 1322 | 1: =xxx=abc= |
| 1323 | =abc=abc=\=global |
| 1324 | 2: =xxx=xxx= |
| 1325 | |
| 1326 | Subject and replacement strings should be kept relatively short (fewer |
| 1327 | than 256 characters) for substitution tests, as fixed-size buffers are |
| 1328 | used. To make it easy to test for buffer overflow, if the replacement |
| 1329 | string starts with a number in square brackets, that number is passed |
| 1330 | to pcre2_substitute() as the size of the output buffer, with the re- |
| 1331 | placement string starting at the next character. Here is an example |
| 1332 | that tests the edge case: |
| 1333 | |
| 1334 | /abc/ |
| 1335 | 123abc123\=replace=[10]XYZ |
| 1336 | 1: 123XYZ123 |
| 1337 | 123abc123\=replace=[9]XYZ |
| 1338 | Failed: error -47: no more memory |
| 1339 | |
| 1340 | The default action of pcre2_substitute() is to return PCRE2_ER- |
| 1341 | ROR_NOMEMORY when the output buffer is too small. However, if the |
| 1342 | PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the substi- |
| 1343 | tute_overflow_length modifier), pcre2_substitute() continues to go |
| 1344 | through the motions of matching and substituting (but not doing any |
| 1345 | callouts), in order to compute the size of buffer that is required. |
| 1346 | When this happens, pcre2test shows the required buffer length (which |
| 1347 | includes space for the trailing zero) as part of the error message. For |
| 1348 | example: |
| 1349 | |
| 1350 | /abc/substitute_overflow_length |
| 1351 | 123abc123\=replace=[9]XYZ |
| 1352 | Failed: error -47: no more memory: 10 code units are needed |
| 1353 | |
| 1354 | A replacement string is ignored with POSIX and DFA matching. Specifying |
| 1355 | partial matching provokes an error return ("bad option value") from |
| 1356 | pcre2_substitute(). |
| 1357 | |
| 1358 | Testing substitute callouts |
| 1359 | |
| 1360 | If the substitute_callout modifier is set, a substitution callout func- |
| 1361 | tion is set up. The null_context modifier must not be set, because the |
| 1362 | address of the callout function is passed in a match context. When the |
| 1363 | callout function is called (after each substitution), details of the |
| 1364 | the input and output strings are output. For example: |
| 1365 | |
| 1366 | /abc/g,replace=<$0>,substitute_callout |
| 1367 | abcdefabcpqr |
| 1368 | 1(1) Old 0 3 "abc" New 0 5 "<abc>" |
| 1369 | 2(1) Old 6 9 "abc" New 8 13 "<abc>" |
| 1370 | 2: <abc>def<abc>pqr |
| 1371 | |
| 1372 | The first number on each callout line is the count of matches. The |
| 1373 | parenthesized number is the number of pairs that are set in the ovector |
| 1374 | (that is, one more than the number of capturing groups that were set). |
| 1375 | Then are listed the offsets of the old substring, its contents, and the |
| 1376 | same for the replacement. |
| 1377 | |
| 1378 | By default, the substitution callout function returns zero, which ac- |
| 1379 | cepts the replacement and causes matching to continue if /g was used. |
| 1380 | Two further modifiers can be used to test other return values. If sub- |
| 1381 | stitute_skip is set to a value greater than zero the callout function |
| 1382 | returns +1 for the match of that number, and similarly substitute_stop |
| 1383 | returns -1. These cause the replacement to be rejected, and -1 causes |
| 1384 | no further matching to take place. If either of them are set, substi- |
| 1385 | tute_callout is assumed. For example: |
| 1386 | |
| 1387 | /abc/g,replace=<$0>,substitute_skip=1 |
| 1388 | abcdefabcpqr |
| 1389 | 1(1) Old 0 3 "abc" New 0 5 "<abc> SKIPPED" |
| 1390 | 2(1) Old 6 9 "abc" New 6 11 "<abc>" |
| 1391 | 2: abcdef<abc>pqr |
| 1392 | abcdefabcpqr\=substitute_stop=1 |
| 1393 | 1(1) Old 0 3 "abc" New 0 5 "<abc> STOPPED" |
| 1394 | 1: abcdefabcpqr |
| 1395 | |
| 1396 | If both are set for the same number, stop takes precedence. Only a sin- |
| 1397 | gle skip or stop is supported, which is sufficient for testing that the |
| 1398 | feature works. |
| 1399 | |
| 1400 | Setting the JIT stack size |
| 1401 | |
| 1402 | The jitstack modifier provides a way of setting the maximum stack size |
| 1403 | that is used by the just-in-time optimization code. It is ignored if |
| 1404 | JIT optimization is not being used. The value is a number of kibibytes |
| 1405 | (units of 1024 bytes). Setting zero reverts to the default of 32KiB. |
| 1406 | Providing a stack that is larger than the default is necessary only for |
| 1407 | very complicated patterns. If jitstack is set non-zero on a subject |
| 1408 | line it overrides any value that was set on the pattern. |
| 1409 | |
| 1410 | Setting heap, match, and depth limits |
| 1411 | |
| 1412 | The heap_limit, match_limit, and depth_limit modifiers set the appro- |
| 1413 | priate limits in the match context. These values are ignored when the |
| 1414 | find_limits modifier is specified. |
| 1415 | |
| 1416 | Finding minimum limits |
| 1417 | |
| 1418 | If the find_limits modifier is present on a subject line, pcre2test |
| 1419 | calls the relevant matching function several times, setting different |
| 1420 | values in the match context via pcre2_set_heap_limit(), |
| 1421 | pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the |
| 1422 | minimum values for each parameter that allows the match to complete |
| 1423 | without error. If JIT is being used, only the match limit is relevant. |
| 1424 | |
| 1425 | When using this modifier, the pattern should not contain any limit set- |
| 1426 | tings such as (*LIMIT_MATCH=...) within it. If such a setting is |
| 1427 | present and is lower than the minimum matching value, the minimum value |
| 1428 | cannot be found because pcre2_set_match_limit() etc. are only able to |
| 1429 | reduce the value of an in-pattern limit; they cannot increase it. |
| 1430 | |
| 1431 | For non-DFA matching, the minimum depth_limit number is a measure of |
| 1432 | how much nested backtracking happens (that is, how deeply the pattern's |
| 1433 | tree is searched). In the case of DFA matching, depth_limit controls |
| 1434 | the depth of recursive calls of the internal function that is used for |
| 1435 | handling pattern recursion, lookaround assertions, and atomic groups. |
| 1436 | |
| 1437 | For non-DFA matching, the match_limit number is a measure of the amount |
| 1438 | of backtracking that takes place, and learning the minimum value can be |
| 1439 | instructive. For most simple matches, the number is quite small, but |
| 1440 | for patterns with very large numbers of matching possibilities, it can |
| 1441 | become large very quickly with increasing length of subject string. In |
| 1442 | the case of DFA matching, match_limit controls the total number of |
| 1443 | calls, both recursive and non-recursive, to the internal matching func- |
| 1444 | tion, thus controlling the overall amount of computing resource that is |
| 1445 | used. |
| 1446 | |
| 1447 | For both kinds of matching, the heap_limit number, which is in |
| 1448 | kibibytes (units of 1024 bytes), limits the amount of heap memory used |
| 1449 | for matching. A value of zero disables the use of any heap memory; many |
| 1450 | simple pattern matches can be done without using the heap, so zero is |
| 1451 | not an unreasonable setting. |
| 1452 | |
| 1453 | Showing MARK names |
| 1454 | |
| 1455 | |
| 1456 | The mark modifier causes the names from backtracking control verbs that |
| 1457 | are returned from calls to pcre2_match() to be displayed. If a mark is |
| 1458 | returned for a match, non-match, or partial match, pcre2test shows it. |
| 1459 | For a match, it is on a line by itself, tagged with "MK:". Otherwise, |
| 1460 | it is added to the non-match message. |
| 1461 | |
| 1462 | Showing memory usage |
| 1463 | |
| 1464 | The memory modifier causes pcre2test to log the sizes of all heap mem- |
| 1465 | ory allocation and freeing calls that occur during a call to |
| 1466 | pcre2_match() or pcre2_dfa_match(). These occur only when a match re- |
| 1467 | quires a bigger vector than the default for remembering backtracking |
| 1468 | points (pcre2_match()) or for internal workspace (pcre2_dfa_match()). |
| 1469 | In many cases there will be no heap memory used and therefore no addi- |
| 1470 | tional output. No heap memory is allocated during matching with JIT, so |
| 1471 | in that case the memory modifier never has any effect. For this modi- |
| 1472 | fier to work, the null_context modifier must not be set on both the |
| 1473 | pattern and the subject, though it can be set on one or the other. |
| 1474 | |
| 1475 | Setting a starting offset |
| 1476 | |
| 1477 | The offset modifier sets an offset in the subject string at which |
| 1478 | matching starts. Its value is a number of code units, not characters. |
| 1479 | |
| 1480 | Setting an offset limit |
| 1481 | |
| 1482 | The offset_limit modifier sets a limit for unanchored matches. If a |
| 1483 | match cannot be found starting at or before this offset in the subject, |
| 1484 | a "no match" return is given. The data value is a number of code units, |
| 1485 | not characters. When this modifier is used, the use_offset_limit modi- |
| 1486 | fier must have been set for the pattern; if not, an error is generated. |
| 1487 | |
| 1488 | Setting the size of the output vector |
| 1489 | |
| 1490 | The ovector modifier applies only to the subject line in which it ap- |
| 1491 | pears, though of course it can also be used to set a default in a #sub- |
| 1492 | ject command. It specifies the number of pairs of offsets that are |
| 1493 | available for storing matching information. The default is 15. |
| 1494 | |
| 1495 | A value of zero is useful when testing the POSIX API because it causes |
| 1496 | regexec() to be called with a NULL capture vector. When not testing the |
| 1497 | POSIX API, a value of zero is used to cause pcre2_match_data_cre- |
| 1498 | ate_from_pattern() to be called, in order to create a match block of |
| 1499 | exactly the right size for the pattern. (It is not possible to create a |
| 1500 | match block with a zero-length ovector; there is always at least one |
| 1501 | pair of offsets.) |
| 1502 | |
| 1503 | Passing the subject as zero-terminated |
| 1504 | |
| 1505 | By default, the subject string is passed to a native API matching func- |
| 1506 | tion with its correct length. In order to test the facility for passing |
| 1507 | a zero-terminated string, the zero_terminate modifier is provided. It |
| 1508 | causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching |
| 1509 | via the POSIX interface, this modifier is ignored, with a warning. |
| 1510 | |
| 1511 | When testing pcre2_substitute(), this modifier also has the effect of |
| 1512 | passing the replacement string as zero-terminated. |
| 1513 | |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 1514 | Passing a NULL context, subject, or replacement |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 1515 | |
| 1516 | Normally, pcre2test passes a context block to pcre2_match(), |
| 1517 | pcre2_dfa_match(), pcre2_jit_match() or pcre2_substitute(). If the |
| 1518 | null_context modifier is set, however, NULL is passed. This is for |
| 1519 | testing that the matching and substitution functions behave correctly |
| 1520 | in this case (they use default values). This modifier cannot be used |
| 1521 | with the find_limits or substitute_callout modifiers. |
| 1522 | |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 1523 | Similarly, for testing purposes, if the null_subject or null_replace- |
| 1524 | ment modifier is set, the subject or replacement string pointers are |
| 1525 | passed as NULL, respectively, to the relevant functions. |
| 1526 | |
Elliott Hughes | 5b80804 | 2021-10-01 10:56:10 -0700 | [diff] [blame] | 1527 | |
| 1528 | THE ALTERNATIVE MATCHING FUNCTION |
| 1529 | |
| 1530 | By default, pcre2test uses the standard PCRE2 matching function, |
| 1531 | pcre2_match() to match each subject line. PCRE2 also supports an alter- |
| 1532 | native matching function, pcre2_dfa_match(), which operates in a dif- |
| 1533 | ferent way, and has some restrictions. The differences between the two |
| 1534 | functions are described in the pcre2matching documentation. |
| 1535 | |
| 1536 | If the dfa modifier is set, the alternative matching function is used. |
| 1537 | This function finds all possible matches at a given point in the sub- |
| 1538 | ject. If, however, the dfa_shortest modifier is set, processing stops |
| 1539 | after the first match is found. This is always the shortest possible |
| 1540 | match. |
| 1541 | |
| 1542 | |
| 1543 | DEFAULT OUTPUT FROM pcre2test |
| 1544 | |
| 1545 | This section describes the output when the normal matching function, |
| 1546 | pcre2_match(), is being used. |
| 1547 | |
| 1548 | When a match succeeds, pcre2test outputs the list of captured sub- |
| 1549 | strings, starting with number 0 for the string that matched the whole |
| 1550 | pattern. Otherwise, it outputs "No match" when the return is PCRE2_ER- |
| 1551 | ROR_NOMATCH, or "Partial match:" followed by the partially matching |
| 1552 | substring when the return is PCRE2_ERROR_PARTIAL. (Note that this is |
| 1553 | the entire substring that was inspected during the partial match; it |
| 1554 | may include characters before the actual match start if a lookbehind |
| 1555 | assertion, \K, \b, or \B was involved.) |
| 1556 | |
| 1557 | For any other return, pcre2test outputs the PCRE2 negative error number |
| 1558 | and a short descriptive phrase. If the error is a failed UTF string |
| 1559 | check, the code unit offset of the start of the failing character is |
| 1560 | also output. Here is an example of an interactive pcre2test run. |
| 1561 | |
| 1562 | $ pcre2test |
| 1563 | PCRE2 version 10.22 2016-07-29 |
| 1564 | |
| 1565 | re> /^abc(\d+)/ |
| 1566 | data> abc123 |
| 1567 | 0: abc123 |
| 1568 | 1: 123 |
| 1569 | data> xyz |
| 1570 | No match |
| 1571 | |
| 1572 | Unset capturing substrings that are not followed by one that is set are |
| 1573 | not shown by pcre2test unless the allcaptures modifier is specified. In |
| 1574 | the following example, there are two capturing substrings, but when the |
| 1575 | first data line is matched, the second, unset substring is not shown. |
| 1576 | An "internal" unset substring is shown as "<unset>", as for the second |
| 1577 | data line. |
| 1578 | |
| 1579 | re> /(a)|(b)/ |
| 1580 | data> a |
| 1581 | 0: a |
| 1582 | 1: a |
| 1583 | data> b |
| 1584 | 0: b |
| 1585 | 1: <unset> |
| 1586 | 2: b |
| 1587 | |
| 1588 | If the strings contain any non-printing characters, they are output as |
| 1589 | \xhh escapes if the value is less than 256 and UTF mode is not set. |
| 1590 | Otherwise they are output as \x{hh...} escapes. See below for the defi- |
| 1591 | nition of non-printing characters. If the aftertext modifier is set, |
| 1592 | the output for substring 0 is followed by the the rest of the subject |
| 1593 | string, identified by "0+" like this: |
| 1594 | |
| 1595 | re> /cat/aftertext |
| 1596 | data> cataract |
| 1597 | 0: cat |
| 1598 | 0+ aract |
| 1599 | |
| 1600 | If global matching is requested, the results of successive matching at- |
| 1601 | tempts are output in sequence, like this: |
| 1602 | |
| 1603 | re> /\Bi(\w\w)/g |
| 1604 | data> Mississippi |
| 1605 | 0: iss |
| 1606 | 1: ss |
| 1607 | 0: iss |
| 1608 | 1: ss |
| 1609 | 0: ipp |
| 1610 | 1: pp |
| 1611 | |
| 1612 | "No match" is output only if the first match attempt fails. Here is an |
| 1613 | example of a failure message (the offset 4 that is specified by the |
| 1614 | offset modifier is past the end of the subject string): |
| 1615 | |
| 1616 | re> /xyz/ |
| 1617 | data> xyz\=offset=4 |
| 1618 | Error -24 (bad offset value) |
| 1619 | |
| 1620 | Note that whereas patterns can be continued over several lines (a plain |
| 1621 | ">" prompt is used for continuations), subject lines may not. However |
| 1622 | newlines can be included in a subject by means of the \n escape (or \r, |
| 1623 | \r\n, etc., depending on the newline sequence setting). |
| 1624 | |
| 1625 | |
| 1626 | OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION |
| 1627 | |
| 1628 | When the alternative matching function, pcre2_dfa_match(), is used, the |
| 1629 | output consists of a list of all the matches that start at the first |
| 1630 | point in the subject where there is at least one match. For example: |
| 1631 | |
| 1632 | re> /(tang|tangerine|tan)/ |
| 1633 | data> yellow tangerine\=dfa |
| 1634 | 0: tangerine |
| 1635 | 1: tang |
| 1636 | 2: tan |
| 1637 | |
| 1638 | Using the normal matching function on this data finds only "tang". The |
| 1639 | longest matching string is always given first (and numbered zero). Af- |
| 1640 | ter a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", fol- |
| 1641 | lowed by the partially matching substring. Note that this is the entire |
| 1642 | substring that was inspected during the partial match; it may include |
| 1643 | characters before the actual match start if a lookbehind assertion, \b, |
| 1644 | or \B was involved. (\K is not supported for DFA matching.) |
| 1645 | |
| 1646 | If global matching is requested, the search for further matches resumes |
| 1647 | at the end of the longest match. For example: |
| 1648 | |
| 1649 | re> /(tang|tangerine|tan)/g |
| 1650 | data> yellow tangerine and tangy sultana\=dfa |
| 1651 | 0: tangerine |
| 1652 | 1: tang |
| 1653 | 2: tan |
| 1654 | 0: tang |
| 1655 | 1: tan |
| 1656 | 0: tan |
| 1657 | |
| 1658 | The alternative matching function does not support substring capture, |
| 1659 | so the modifiers that are concerned with captured substrings are not |
| 1660 | relevant. |
| 1661 | |
| 1662 | |
| 1663 | RESTARTING AFTER A PARTIAL MATCH |
| 1664 | |
| 1665 | When the alternative matching function has given the PCRE2_ERROR_PAR- |
| 1666 | TIAL return, indicating that the subject partially matched the pattern, |
| 1667 | you can restart the match with additional subject data by means of the |
| 1668 | dfa_restart modifier. For example: |
| 1669 | |
| 1670 | re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
| 1671 | data> 23ja\=ps,dfa |
| 1672 | Partial match: 23ja |
| 1673 | data> n05\=dfa,dfa_restart |
| 1674 | 0: n05 |
| 1675 | |
| 1676 | For further information about partial matching, see the pcre2partial |
| 1677 | documentation. |
| 1678 | |
| 1679 | |
| 1680 | CALLOUTS |
| 1681 | |
| 1682 | If the pattern contains any callout requests, pcre2test's callout func- |
| 1683 | tion is called during matching unless callout_none is specified. This |
| 1684 | works with both matching functions, and with JIT, though there are some |
| 1685 | differences in behaviour. The output for callouts with numerical argu- |
| 1686 | ments and those with string arguments is slightly different. |
| 1687 | |
| 1688 | Callouts with numerical arguments |
| 1689 | |
| 1690 | By default, the callout function displays the callout number, the start |
| 1691 | and current positions in the subject text at the callout time, and the |
| 1692 | next pattern item to be tested. For example: |
| 1693 | |
| 1694 | --->pqrabcdef |
| 1695 | 0 ^ ^ \d |
| 1696 | |
| 1697 | This output indicates that callout number 0 occurred for a match at- |
| 1698 | tempt starting at the fourth character of the subject string, when the |
| 1699 | pointer was at the seventh character, and when the next pattern item |
| 1700 | was \d. Just one circumflex is output if the start and current posi- |
| 1701 | tions are the same, or if the current position precedes the start posi- |
| 1702 | tion, which can happen if the callout is in a lookbehind assertion. |
| 1703 | |
| 1704 | Callouts numbered 255 are assumed to be automatic callouts, inserted as |
| 1705 | a result of the auto_callout pattern modifier. In this case, instead of |
| 1706 | showing the callout number, the offset in the pattern, preceded by a |
| 1707 | plus, is output. For example: |
| 1708 | |
| 1709 | re> /\d?[A-E]\*/auto_callout |
| 1710 | data> E* |
| 1711 | --->E* |
| 1712 | +0 ^ \d? |
| 1713 | +3 ^ [A-E] |
| 1714 | +8 ^^ \* |
| 1715 | +10 ^ ^ |
| 1716 | 0: E* |
| 1717 | |
| 1718 | If a pattern contains (*MARK) items, an additional line is output when- |
| 1719 | ever a change of latest mark is passed to the callout function. For ex- |
| 1720 | ample: |
| 1721 | |
| 1722 | re> /a(*MARK:X)bc/auto_callout |
| 1723 | data> abc |
| 1724 | --->abc |
| 1725 | +0 ^ a |
| 1726 | +1 ^^ (*MARK:X) |
| 1727 | +10 ^^ b |
| 1728 | Latest Mark: X |
| 1729 | +11 ^ ^ c |
| 1730 | +12 ^ ^ |
| 1731 | 0: abc |
| 1732 | |
| 1733 | The mark changes between matching "a" and "b", but stays the same for |
| 1734 | the rest of the match, so nothing more is output. If, as a result of |
| 1735 | backtracking, the mark reverts to being unset, the text "<unset>" is |
| 1736 | output. |
| 1737 | |
| 1738 | Callouts with string arguments |
| 1739 | |
| 1740 | The output for a callout with a string argument is similar, except that |
| 1741 | instead of outputting a callout number before the position indicators, |
| 1742 | the callout string and its offset in the pattern string are output be- |
| 1743 | fore the reflection of the subject string, and the subject string is |
| 1744 | reflected for each callout. For example: |
| 1745 | |
| 1746 | re> /^ab(?C'first')cd(?C"second")ef/ |
| 1747 | data> abcdefg |
| 1748 | Callout (7): 'first' |
| 1749 | --->abcdefg |
| 1750 | ^ ^ c |
| 1751 | Callout (20): "second" |
| 1752 | --->abcdefg |
| 1753 | ^ ^ e |
| 1754 | 0: abcdef |
| 1755 | |
| 1756 | |
| 1757 | Callout modifiers |
| 1758 | |
| 1759 | The callout function in pcre2test returns zero (carry on matching) by |
| 1760 | default, but you can use a callout_fail modifier in a subject line to |
| 1761 | change this and other parameters of the callout (see below). |
| 1762 | |
| 1763 | If the callout_capture modifier is set, the current captured groups are |
| 1764 | output when a callout occurs. This is useful only for non-DFA matching, |
| 1765 | as pcre2_dfa_match() does not support capturing, so no captures are |
| 1766 | ever shown. |
| 1767 | |
| 1768 | The normal callout output, showing the callout number or pattern offset |
| 1769 | (as described above) is suppressed if the callout_no_where modifier is |
| 1770 | set. |
| 1771 | |
| 1772 | When using the interpretive matching function pcre2_match() without |
| 1773 | JIT, setting the callout_extra modifier causes additional output from |
| 1774 | pcre2test's callout function to be generated. For the first callout in |
| 1775 | a match attempt at a new starting position in the subject, "New match |
| 1776 | attempt" is output. If there has been a backtrack since the last call- |
| 1777 | out (or start of matching if this is the first callout), "Backtrack" is |
| 1778 | output, followed by "No other matching paths" if the backtrack ended |
| 1779 | the previous match attempt. For example: |
| 1780 | |
| 1781 | re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess |
| 1782 | data> aac\=callout_extra |
| 1783 | New match attempt |
| 1784 | --->aac |
| 1785 | +0 ^ ( |
| 1786 | +1 ^ a+ |
| 1787 | +3 ^ ^ ) |
| 1788 | +4 ^ ^ b |
| 1789 | Backtrack |
| 1790 | --->aac |
| 1791 | +3 ^^ ) |
| 1792 | +4 ^^ b |
| 1793 | Backtrack |
| 1794 | No other matching paths |
| 1795 | New match attempt |
| 1796 | --->aac |
| 1797 | +0 ^ ( |
| 1798 | +1 ^ a+ |
| 1799 | +3 ^^ ) |
| 1800 | +4 ^^ b |
| 1801 | Backtrack |
| 1802 | No other matching paths |
| 1803 | New match attempt |
| 1804 | --->aac |
| 1805 | +0 ^ ( |
| 1806 | +1 ^ a+ |
| 1807 | Backtrack |
| 1808 | No other matching paths |
| 1809 | New match attempt |
| 1810 | --->aac |
| 1811 | +0 ^ ( |
| 1812 | +1 ^ a+ |
| 1813 | No match |
| 1814 | |
| 1815 | Notice that various optimizations must be turned off if you want all |
| 1816 | possible matching paths to be scanned. If no_start_optimize is not |
| 1817 | used, there is an immediate "no match", without any callouts, because |
| 1818 | the starting optimization fails to find "b" in the subject, which it |
| 1819 | knows must be present for any match. If no_auto_possess is not used, |
| 1820 | the "a+" item is turned into "a++", which reduces the number of back- |
| 1821 | tracks. |
| 1822 | |
| 1823 | The callout_extra modifier has no effect if used with the DFA matching |
| 1824 | function, or with JIT. |
| 1825 | |
| 1826 | Return values from callouts |
| 1827 | |
| 1828 | The default return from the callout function is zero, which allows |
| 1829 | matching to continue. The callout_fail modifier can be given one or two |
| 1830 | numbers. If there is only one number, 1 is returned instead of 0 (caus- |
| 1831 | ing matching to backtrack) when a callout of that number is reached. If |
| 1832 | two numbers (<n>:<m>) are given, 1 is returned when callout <n> is |
| 1833 | reached and there have been at least <m> callouts. The callout_error |
| 1834 | modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus- |
| 1835 | ing the entire matching process to be aborted. If both these modifiers |
| 1836 | are set for the same callout number, callout_error takes precedence. |
| 1837 | Note that callouts with string arguments are always given the number |
| 1838 | zero. |
| 1839 | |
| 1840 | The callout_data modifier can be given an unsigned or a negative num- |
| 1841 | ber. This is set as the "user data" that is passed to the matching |
| 1842 | function, and passed back when the callout function is invoked. Any |
| 1843 | value other than zero is used as a return from pcre2test's callout |
| 1844 | function. |
| 1845 | |
| 1846 | Inserting callouts can be helpful when using pcre2test to check compli- |
| 1847 | cated regular expressions. For further information about callouts, see |
| 1848 | the pcre2callout documentation. |
| 1849 | |
| 1850 | |
| 1851 | NON-PRINTING CHARACTERS |
| 1852 | |
| 1853 | When pcre2test is outputting text in the compiled version of a pattern, |
| 1854 | bytes other than 32-126 are always treated as non-printing characters |
| 1855 | and are therefore shown as hex escapes. |
| 1856 | |
| 1857 | When pcre2test is outputting text that is a matched part of a subject |
| 1858 | string, it behaves in the same way, unless a different locale has been |
| 1859 | set for the pattern (using the locale modifier). In this case, the is- |
| 1860 | print() function is used to distinguish printing and non-printing char- |
| 1861 | acters. |
| 1862 | |
| 1863 | |
| 1864 | SAVING AND RESTORING COMPILED PATTERNS |
| 1865 | |
| 1866 | It is possible to save compiled patterns on disc or elsewhere, and |
| 1867 | reload them later, subject to a number of restrictions. JIT data cannot |
| 1868 | be saved. The host on which the patterns are reloaded must be running |
| 1869 | the same version of PCRE2, with the same code unit width, and must also |
| 1870 | have the same endianness, pointer width and PCRE2_SIZE type. Before |
| 1871 | compiled patterns can be saved they must be serialized, that is, con- |
| 1872 | verted to a stream of bytes. A single byte stream may contain any num- |
| 1873 | ber of compiled patterns, but they must all use the same character ta- |
| 1874 | bles. A single copy of the tables is included in the byte stream (its |
| 1875 | size is 1088 bytes). |
| 1876 | |
| 1877 | The functions whose names begin with pcre2_serialize_ are used for se- |
| 1878 | rializing and de-serializing. They are described in the pcre2serialize |
| 1879 | documentation. In this section we describe the features of pcre2test |
| 1880 | that can be used to test these functions. |
| 1881 | |
| 1882 | Note that "serialization" in PCRE2 does not convert compiled patterns |
| 1883 | to an abstract format like Java or .NET. It just makes a reloadable |
| 1884 | byte code stream. Hence the restrictions on reloading mentioned above. |
| 1885 | |
| 1886 | In pcre2test, when a pattern with push modifier is successfully com- |
| 1887 | piled, it is pushed onto a stack of compiled patterns, and pcre2test |
| 1888 | expects the next line to contain a new pattern (or command) instead of |
| 1889 | a subject line. By contrast, the pushcopy modifier causes a copy of the |
| 1890 | compiled pattern to be stacked, leaving the original available for im- |
| 1891 | mediate matching. By using push and/or pushcopy, a number of patterns |
| 1892 | can be compiled and retained. These modifiers are incompatible with |
| 1893 | posix, and control modifiers that act at match time are ignored (with a |
| 1894 | message) for the stacked patterns. The jitverify modifier applies only |
| 1895 | at compile time. |
| 1896 | |
| 1897 | The command |
| 1898 | |
| 1899 | #save <filename> |
| 1900 | |
| 1901 | causes all the stacked patterns to be serialized and the result written |
| 1902 | to the named file. Afterwards, all the stacked patterns are freed. The |
| 1903 | command |
| 1904 | |
| 1905 | #load <filename> |
| 1906 | |
| 1907 | reads the data in the file, and then arranges for it to be de-serial- |
| 1908 | ized, with the resulting compiled patterns added to the pattern stack. |
| 1909 | The pattern on the top of the stack can be retrieved by the #pop com- |
| 1910 | mand, which must be followed by lines of subjects that are to be |
| 1911 | matched with the pattern, terminated as usual by an empty line or end |
| 1912 | of file. This command may be followed by a modifier list containing |
| 1913 | only control modifiers that act after a pattern has been compiled. In |
| 1914 | particular, hex, posix, posix_nosub, push, and pushcopy are not al- |
| 1915 | lowed, nor are any option-setting modifiers. The JIT modifiers are, |
| 1916 | however permitted. Here is an example that saves and reloads two pat- |
| 1917 | terns. |
| 1918 | |
| 1919 | /abc/push |
| 1920 | /xyz/push |
| 1921 | #save tempfile |
| 1922 | #load tempfile |
| 1923 | #pop info |
| 1924 | xyz |
| 1925 | |
| 1926 | #pop jit,bincode |
| 1927 | abc |
| 1928 | |
| 1929 | If jitverify is used with #pop, it does not automatically imply jit, |
| 1930 | which is different behaviour from when it is used on a pattern. |
| 1931 | |
| 1932 | The #popcopy command is analagous to the pushcopy modifier in that it |
| 1933 | makes current a copy of the topmost stack pattern, leaving the original |
| 1934 | still on the stack. |
| 1935 | |
| 1936 | |
| 1937 | SEE ALSO |
| 1938 | |
| 1939 | pcre2(3), pcre2api(3), pcre2callout(3), pcre2jit, pcre2matching(3), |
| 1940 | pcre2partial(d), pcre2pattern(3), pcre2serialize(3). |
| 1941 | |
| 1942 | |
| 1943 | AUTHOR |
| 1944 | |
| 1945 | Philip Hazel |
| 1946 | Retired from University Computing Service |
| 1947 | Cambridge, England. |
| 1948 | |
| 1949 | |
| 1950 | REVISION |
| 1951 | |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 1952 | Last updated: 12 January 2022 |
| 1953 | Copyright (c) 1997-2022 University of Cambridge. |