Blame - doc/pcre2test.txt - platform/external/pcre

Elliott Hughes

5b80804

2021-10-01 10:56:10 -0700

[diff] [blame]

1

PCRE2TEST(1) General Commands Manual PCRE2TEST(1)

NAME

pcre2test - a program for testing Perl-compatible regular expressions.

SYNOPSIS

pcre2test [options] [input file [output file]]

11

12

pcre2test is a test program for the PCRE2 regular expression libraries,

13

but it can also be used for experimenting with regular expressions.

14

This document describes the features of the test program; for details

15

of the regular expressions themselves, see the pcre2pattern documenta-

16

tion. For details of the PCRE2 library function calls and their op-

17

tions, see the pcre2api documentation.

18

19

The input for pcre2test is a sequence of regular expression patterns

20

and subject strings to be matched. There are also command lines for

21

setting defaults and controlling some special actions. The output shows

22

the result of each match attempt. Modifiers on external or internal

23

command lines, the patterns, and the subject lines specify PCRE2 func-

24

tion options, control how the subject is processed, and what output is

25

produced.

26

27

There are many obscure modifiers, some of which are specifically de-

28

signed for use in conjunction with the test script and data files that

29

are distributed as part of PCRE2. All the modifiers are documented

30

here, some without much justification, but many of them are unlikely to

31

be of use except when testing the libraries.

32

33

34

PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES

35

36

Different versions of the PCRE2 library can be built to support charac-

37

ter strings that are encoded in 8-bit, 16-bit, or 32-bit code units.

38

One, two, or all three of these libraries may be simultaneously in-

39

stalled. The pcre2test program can be used to test all the libraries.

40

However, its own input and output are always in 8-bit format. When

41

testing the 16-bit or 32-bit libraries, patterns and subject strings

42

are converted to 16-bit or 32-bit format before being passed to the li-

43

brary functions. Results are converted back to 8-bit code units for

44

output.

45

46

In the rest of this document, the names of library functions and struc-

Elliott Hughes

4e19c8e

2022-04-15 15:11:02 -0700

[diff] [blame]

47

tures are given in generic form, for example, pcre2_compile(). The ac-

Elliott Hughes

5b80804

2021-10-01 10:56:10 -0700

[diff] [blame]

48

tual names used in the libraries have a suffix _8, _16, or _32, as ap-

propriate.

INPUT ENCODING

Input to pcre2test is processed line by line, either by calling the C

55

library's fgets() function, or via the libreadline or libedit library.

56

In some Windows environments character 26 (hex 1A) causes an immediate

57

end of file, and no further data is read, so this character should be

58

avoided unless you really want that action.

59

60

The input is processed using using C's string functions, so must not

61

contain binary zeros, even though in Unix-like environments, fgets()

62

treats any bytes other than newline as data characters. An error is

63

generated if a binary zero is encountered. By default subject lines are

64

processed for backslash escapes, which makes it possible to include any

65

data value in strings that are passed to the library for matching. For

66

patterns, there is a facility for specifying some or all of the 8-bit

67

input characters as hexadecimal pairs, which makes it possible to in-

68

clude binary zeros.

69

70

Input for the 16-bit and 32-bit libraries

71

72

When testing the 16-bit or 32-bit libraries, there is a need to be able

73

to generate character code points greater than 255 in the strings that

74

are passed to the library. For subject lines, backslash escapes can be

75

used. In addition, when the utf modifier (see "Setting compilation op-

76

tions" below) is set, the pattern and any following subject lines are

77

interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as ap-

78

propriate.

79

80

For non-UTF testing of wide characters, the utf8_input modifier can be

81

used. This is mutually exclusive with utf, and is allowed only in

82

16-bit or 32-bit mode. It causes the pattern and following subject

83

lines to be treated as UTF-8 according to the original definition (RFC

84

2279), which allows for character values up to 0x7fffffff. Each charac-

85

ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case,

86

values greater than 0xffff cause an error to occur).

87

88

UTF-8 (in its original definition) is not capable of encoding values

89

greater than 0x7fffffff, but such values can be handled by the 32-bit

90

library. When testing this library in non-UTF mode with utf8_input set,

91

if any character is preceded by the byte 0xff (which is an invalid byte

92

in UTF-8) 0x80000000 is added to the character's value. This is the

93

only way of passing such code points in a pattern string. For subject

94

strings, using an escape sequence is preferable.

COMMAND LINE OPTIONS

-8 If the 8-bit library has been built, this option causes it to

100

be used (this is the default). If the 8-bit library has not

101

been built, this option causes an error.

102

103

-16 If the 16-bit library has been built, this option causes it

104

to be used. If only the 16-bit library has been built, this

105

is the default. If the 16-bit library has not been built,

106

this option causes an error.

107

108

-32 If the 32-bit library has been built, this option causes it

109

to be used. If only the 32-bit library has been built, this

110

is the default. If the 32-bit library has not been built,

111

this option causes an error.

112

113

-ac Behave as if each pattern has the auto_callout modifier, that

114

is, insert automatic callouts into every pattern that is com-

115

piled.

116

117

-AC As for -ac, but in addition behave as if each subject line

118

has the callout_extra modifier, that is, show additional in-

119

formation from callouts.

120

121

-b Behave as if each pattern has the fullbincode modifier; the

122

full internal binary form of the pattern is output after com-

123

pilation.

124

125

-C Output the version number of the PCRE2 library, and all

126

available information about the optional features that are

127

included, and then exit with zero exit code. All other op-

128

tions are ignored. If both -C and -LM are present, whichever

129

is first is recognized.

130

131

-C option Output information about a specific build-time option, then

132

exit. This functionality is intended for use in scripts such

133

as RunTest. The following options output the value and set

134

the exit code as indicated:

135

136

ebcdic-nl the code for LF (= NL) in an EBCDIC environment:

137

0x15 or 0x25

138

0 if used in an ASCII environment

139

exit code is always 0

140

linksize the configured internal link size (2, 3, or 4)

141

exit code is set to the link size

142

newline the default newline setting:

143

CR, LF, CRLF, ANYCRLF, ANY, or NUL

144

exit code is always 0

145

bsr the default setting for what \R matches:

146

ANYCRLF or ANY

147

exit code is always 0

148

149

The following options output 1 for true or 0 for false, and

150

set the exit code to the same value:

151

152

backslash-C \C is supported (not locked out)

153

ebcdic compiled for an EBCDIC environment

154

jit just-in-time support is available

155

pcre2-16 the 16-bit library was built

156

pcre2-32 the 32-bit library was built

157

pcre2-8 the 8-bit library was built

158

unicode Unicode support is available

159

160

If an unknown option is given, an error message is output;

161

the exit code is 0.

162

163

-d Behave as if each pattern has the debug modifier; the inter-

164

nal form and information about the compiled pattern is output

165

after compilation; -d is equivalent to -b -i.

166

167

-dfa Behave as if each subject line has the dfa modifier; matching

168

is done using the pcre2_dfa_match() function instead of the

169

default pcre2_match().

170

171

-error number[,number,...]

172

Call pcre2_get_error_message() for each of the error numbers

173

in the comma-separated list, display the resulting messages

174

on the standard output, then exit with zero exit code. The

175

numbers may be positive or negative. This is a convenience

176

facility for PCRE2 maintainers.

177

178

-help Output a brief summary these options and then exit.

179

180

-i Behave as if each pattern has the info modifier; information

181

about the compiled pattern is given after compilation.

182

183

-jit Behave as if each pattern line has the jit modifier; after

184

successful compilation, each pattern is passed to the just-

185

in-time compiler, if available.

186

187

-jitfast Behave as if each pattern line has the jitfast modifier; af-

188

ter successful compilation, each pattern is passed to the

189

just-in-time compiler, if available, and each subject line is

190

passed directly to the JIT matcher via its "fast path".

191

192

-jitverify

193

Behave as if each pattern line has the jitverify modifier;

194

after successful compilation, each pattern is passed to the

195

just-in-time compiler, if available, and the use of JIT for

196

matching is verified.

197

198

-LM List modifiers: write a list of available pattern and subject

199

modifiers to the standard output, then exit with zero exit

Elliott Hughes

4e19c8e

2022-04-15 15:11:02 -0700

[diff] [blame]

200

code. All other options are ignored. If both -C and any -Lx

201

options are present, whichever is first is recognized.

202

203

-LP List properties: write a list of recognized Unicode proper-

204

ties to the standard output, then exit with zero exit code.

205

All other options are ignored. If both -C and any -Lx options

206

are present, whichever is first is recognized.

207

208

-LS List scripts: write a list of recogized Unicode script names

209

to the standard output, then exit with zero exit code. All

210

other options are ignored. If both -C and any -Lx options are

Elliott Hughes

5b80804

2021-10-01 10:56:10 -0700

[diff] [blame]

211

present, whichever is first is recognized.

212

213

-pattern modifier-list

214

Behave as if each pattern line contains the given modifiers.

215

216

-q Do not output the version number of pcre2test at the start of

217

execution.

218

219

-S size On Unix-like systems, set the size of the run-time stack to

220

size mebibytes (units of 1024*1024 bytes).

221

222

-subject modifier-list

223

Behave as if each subject line contains the given modifiers.

224

225

-t Run each compile and match many times with a timer, and out-

226

put the resulting times per compile or match. When JIT is

227

used, separate times are given for the initial compile and

228

the JIT compile. You can control the number of iterations

229

that are used for timing by following -t with a number (as a

230

separate item on the command line). For example, "-t 1000"

231

iterates 1000 times. The default is to iterate 500,000 times.

232

233

-tm This is like -t except that it times only the matching phase,

234

not the compile phase.

235

236

-T -TM These behave like -t and -tm, but in addition, at the end of

237

a run, the total times for all compiles and matches are out-

238

put.

239

240

-version Output the PCRE2 version number and then exit.

DESCRIPTION

If pcre2test is given two filename arguments, it reads from the first

246

and writes to the second. If the first name is "-", input is taken from

247

the standard input. If pcre2test is given only one argument, it reads

248

from that file and writes to stdout. Otherwise, it reads from stdin and

249

writes to stdout.

250

251

When pcre2test is built, a configuration option can specify that it

252

should be linked with the libreadline or libedit library. When this is

253

done, if the input is from a terminal, it is read using the readline()

254

function. This provides line-editing and history facilities. The output

255

from the -help option states whether or not readline() will be used.

256

257

The program handles any number of tests, each of which consists of a

258

set of input lines. Each set starts with a regular expression pattern,

259

followed by any number of subject lines to be matched against that pat-

260

tern. In between sets of test data, command lines that begin with # may

261

appear. This file format, with some restrictions, can also be processed

262

by the perltest.sh script that is distributed with PCRE2 as a means of

263

checking that the behaviour of PCRE2 and Perl is the same. For a speci-

264

fication of perltest.sh, see the comments near its beginning. See also

265

the #perltest command below.

266

267

When the input is a terminal, pcre2test prompts for each line of input,

268

using "re>" to prompt for regular expression patterns, and "data>" to

269

prompt for subject lines. Command lines starting with # can be entered

270

only in response to the "re>" prompt.

271

272

Each subject line is matched separately and independently. If you want

273

to do multi-line matches, you have to use the \n escape sequence (or \r

274

or \r\n, etc., depending on the newline setting) in a single line of

275

input to encode the newline sequences. There is no limit on the length

276

of subject lines; the input buffer is automatically extended if it is

277

too small. There are replication features that makes it possible to

278

generate long repetitive pattern or subject lines without having to

279

supply them explicitly.

280

281

An empty line or the end of the file signals the end of the subject

282

lines for a test, at which point a new pattern or command line is ex-

283

pected if there is still input to be read.

COMMAND LINES

In between sets of test data, a line that begins with # is interpreted

289

as a command line. If the first character is followed by white space or

290

an exclamation mark, the line is treated as a comment, and ignored.

291

Otherwise, the following commands are recognized:

#forbid_utf

Subsequent patterns automatically have the PCRE2_NEVER_UTF and

296

PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF

297

and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of

298

patterns. This command also forces an error if a subsequent pattern

299

contains any occurrences of \P, \p, or \X, which are still supported

300

when PCRE2_UTF is not set, but which require Unicode property support

301

to be included in the library.

302

303

This is a trigger guard that is used in test files to ensure that UTF

304

or Unicode property tests are not accidentally added to files that are

305

used when Unicode support is not included in the library. Setting

306

PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained

307

by the use of #pattern; the difference is that #forbid_utf cannot be

308

unset, and the automatic options are not displayed in pattern informa-

309

tion, to avoid cluttering up test output.

#load <filename>

This command is used to load a set of precompiled patterns from a file,

314

as described in the section entitled "Saving and restoring compiled

315

patterns" below.

316

317

#loadtables <filename>

318

319

This command is used to load a set of binary character tables that can

320

be accessed by the tables=3 qualifier. Such tables can be created by

321

the pcre2_dftables program with the -b option.

322

323

#newline_default [<newline-list>]

324

325

When PCRE2 is built, a default newline convention can be specified.

326

This determines which characters and/or character pairs are recognized

327

as indicating a newline in a pattern or subject string. The default can

328

be overridden when a pattern is compiled. The standard test files con-

329

tain tests of various newline conventions, but the majority of the

330

tests expect a single linefeed to be recognized as a newline by de-

331

fault. Without special action the tests would fail when PCRE2 is com-

332

piled with either CR or CRLF as the default newline.

333

334

The #newline_default command specifies a list of newline types that are

335

acceptable as the default. The types must be one of CR, LF, CRLF, ANY-

336

CRLF, ANY, or NUL (in upper or lower case), for example:

337

338

#newline_default LF Any anyCRLF

339

340

If the default newline is in the list, this command has no effect. Oth-

341

erwise, except when testing the POSIX API, a newline modifier that

342

specifies the first newline convention in the list (LF in the above ex-

343

ample) is added to any pattern that does not already have a newline

344

modifier. If the newline list is empty, the feature is turned off. This

345

command is present in a number of the standard test input files.

346

347

When the POSIX API is being tested there is no way to override the de-

348

fault newline convention, though it is possible to set the newline con-

349

vention from within the pattern. A warning is given if the posix or

350

posix_nosub modifier is used when #newline_default would set a default

351

for the non-POSIX API.

352

353

#pattern <modifier-list>

354

355

This command sets a default modifier list that applies to all subse-

356

quent patterns. Modifiers on a pattern can change these settings.

#perltest

This line is used in test files that can also be processed by perl-

361

test.sh to confirm that Perl gives the same results as PCRE2. Subse-

362

quent tests are checked for the use of pcre2test features that are in-

363

compatible with the perltest.sh script.

364

365

Patterns must use '/' as their delimiter, and only certain modifiers

366

are supported. Comment lines, #pattern commands, and #subject commands

367

that set or unset "mark" are recognized and acted on. The #perltest,

368

#forbid_utf, and #newline_default commands, which are needed in the

369

relevant pcre2test files, are silently ignored. All other command lines

370

are ignored, but give a warning message. The #perltest command helps

371

detect tests that are accidentally put in the wrong file or use the

372

wrong delimiter. For more details of the perltest.sh script see the

373

comments it contains.

374

375

#pop [<modifiers>]

376

#popcopy [<modifiers>]

377

378

These commands are used to manipulate the stack of compiled patterns,

379

as described in the section entitled "Saving and restoring compiled

patterns" below.

#save <filename>

This command is used to save a set of compiled patterns to a file, as

385

described in the section entitled "Saving and restoring compiled pat-

386

terns" below.

387

388

#subject <modifier-list>

389

390

This command sets a default modifier list that applies to all subse-

391

quent subject lines. Modifiers on a subject line can change these set-

tings.

MODIFIER SYNTAX

Modifier lists are used with both pattern and subject lines. Items in a

398

list are separated by commas followed by optional white space. Trailing

399

whitespace in a modifier list is ignored. Some modifiers may be given

400

for both patterns and subject lines, whereas others are valid only for

401

one or the other. Each modifier has a long name, for example "an-

402

chored", and some of them must be followed by an equals sign and a

403

value, for example, "offset=12". Values cannot contain comma charac-

404

ters, but may contain spaces. Modifiers that do not take values may be

405

preceded by a minus sign to turn off a previous setting.

406

407

A few of the more common modifiers can also be specified as single let-

408

ters, for example "i" for "caseless". In documentation, following the

409

Perl convention, these are written with a slash ("the /i modifier") for

410

clarity. Abbreviated modifiers must all be concatenated in the first

411

item of a modifier list. If the first item is not recognized as a long

412

modifier name, it is interpreted as a sequence of these abbreviations.

413

For example:

414

415

/abc/ig,newline=cr,jit=3

416

417

This is a pattern line whose modifier list starts with two one-letter

418

modifiers (/i and /g). The lower-case abbreviated modifiers are the

419

same as used in Perl.

PATTERN SYNTAX

A pattern line must start with one of the following characters (common

425

symbols, excluding pattern meta-characters):

426

427

/ ! " ' ` - = _ : ; , % & @ ~

428

429

This is interpreted as the pattern's delimiter. A regular expression

430

may be continued over several input lines, in which case the newline

431

characters are included within it. It is possible to include the delim-

432

iter as a literal within the pattern by escaping it with a backslash,

for example

/abc\/def/

If you do this, the escape and the delimiter form part of the pattern,

438

but since the delimiters are all non-alphanumeric, the inclusion of the

439

backslash does not affect the pattern's interpretation. Note, however,

440

that this trick does not work within \Q...\E literal bracketing because

441

the backslash will itself be interpreted as a literal. If the terminat-

442

ing delimiter is immediately followed by a backslash, for example,

/abc/\

then a backslash is added to the end of the pattern. This is done to

447

provide a way of testing the error condition that arises if a pattern

448

finishes with a backslash, because

/abc\/

is interpreted as the first line of a pattern that starts with "abc/",

453

causing pcre2test to read the next line as a continuation of the regu-

454

lar expression.

455

456

A pattern can be followed by a modifier list (details below).

SUBJECT LINE SYNTAX

Before each subject line is passed to pcre2_match(), pcre2_dfa_match(),

462

or pcre2_jit_match(), leading and trailing white space is removed, and

463

the line is scanned for backslash escapes, unless the subject_literal

464

modifier was set for the pattern. The following provide a means of en-

465

coding non-printing characters in a visible way:

\a alarm (BEL, \x07)

\b backspace (\x08)

\e escape (\x27)

\f form feed (\x0c)

\n newline (\x0a)

\r carriage return (\x0d)

473

\t tab (\x09)

474

\v vertical tab (\x0b)

475

\nnn octal character (up to 3 octal digits); always

476

a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode

477

\o{dd...} octal character (any number of octal digits}

478

\xhh hexadecimal byte (up to 2 hex digits)

479

\x{hh...} hexadecimal character (any number of hex digits)

480

481

The use of \x{hh...} is not dependent on the use of the utf modifier on

482

the pattern. It is recognized always. There may be any number of hexa-

483

decimal digits inside the braces; invalid values provoke error mes-

484

sages.

485

486

Note that \xhh specifies one byte rather than one character in UTF-8

487

mode; this makes it possible to construct invalid UTF-8 sequences for

488

testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8

489

character in UTF-8 mode, generating more than one byte if the value is

490

greater than 127. When testing the 8-bit library not in UTF-8 mode,

491

\x{hh} generates one byte for values less than 256, and causes an error

492

for greater values.

493

494

In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it

495

possible to construct invalid UTF-16 sequences for testing purposes.

496

497

In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This

498

makes it possible to construct invalid UTF-32 sequences for testing

499

purposes.

500

501

There is a special backslash sequence that specifies replication of one

502

or more characters:

503

504

\[<characters>]{<count>}

505

506

This makes it possible to test long strings without having to provide

507

them as part of the file. For example:

\[abc]{4}

is converted to "abcabcabcabc". This feature does not support nesting.

512

To include a closing square bracket in the characters, code it as \x5D.

513

514

A backslash followed by an equals sign marks the end of the subject

515

string and the start of a modifier list. For example:

abc\=notbol,notempty

If the subject string is empty and \= is followed by whitespace, the

520

line is treated as a comment line, and is not used for matching. For

521

example:

522

523

\= This is a comment.

524

abc\= This is an invalid modifier list.

525

526

A backslash followed by any other non-alphanumeric character just es-

527

capes that character. A backslash followed by anything else causes an

528

error. However, if the very last character in the line is a backslash

529

(and there is no modifier list), it is ignored. This gives a way of

530

passing an empty line as data, since a real empty line terminates the

531

data input.

532

533

If the subject_literal modifier is set for a pattern, all subject lines

534

that follow are treated as literals, with no special treatment of back-

535

slashes. No replication is possible, and any subject modifiers must be

536

set as defaults by a #subject command.

PATTERN MODIFIERS

There are several types of modifier that can appear in pattern lines.

542

Except where noted below, they may also be used in #pattern commands. A

543

pattern's modifier list can add to or override default modifiers that

544

were set by a previous #pattern command.

545

546

Setting compilation options

547

548

The following modifiers set options for pcre2_compile(). Most of them

549

set bits in the options argument of that function, but those whose

550

names start with PCRE2_EXTRA are additional options that are set in the

551

compile context. For the main options, there are some single-letter ab-

552

breviations that are the same as Perl options. There is special han-

553

dling for /x: if a second x is present, PCRE2_EXTENDED is converted

554

into PCRE2_EXTENDED_MORE as in Perl. A third appearance adds PCRE2_EX-

555

TENDED as well, though this makes no difference to the way pcre2_com-

556

pile() behaves. See pcre2api for a description of the effects of these

557

options.

558

559

allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS

560

allow_lookaround_bsk set PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK

561

allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES

562

alt_bsux set PCRE2_ALT_BSUX

563

alt_circumflex set PCRE2_ALT_CIRCUMFLEX

564

alt_verbnames set PCRE2_ALT_VERBNAMES

565

anchored set PCRE2_ANCHORED

566

auto_callout set PCRE2_AUTO_CALLOUT

567

bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL

568

/i caseless set PCRE2_CASELESS

569

dollar_endonly set PCRE2_DOLLAR_ENDONLY

570

/s dotall set PCRE2_DOTALL

571

dupnames set PCRE2_DUPNAMES

572

endanchored set PCRE2_ENDANCHORED

573

escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF

574

/x extended set PCRE2_EXTENDED

575

/xx extended_more set PCRE2_EXTENDED_MORE

576

extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX

577

firstline set PCRE2_FIRSTLINE

578

literal set PCRE2_LITERAL

579

match_line set PCRE2_EXTRA_MATCH_LINE

580

match_invalid_utf set PCRE2_MATCH_INVALID_UTF

581

match_unset_backref set PCRE2_MATCH_UNSET_BACKREF

582

match_word set PCRE2_EXTRA_MATCH_WORD

583

/m multiline set PCRE2_MULTILINE

584

never_backslash_c set PCRE2_NEVER_BACKSLASH_C

585

never_ucp set PCRE2_NEVER_UCP

586

never_utf set PCRE2_NEVER_UTF

587

/n no_auto_capture set PCRE2_NO_AUTO_CAPTURE

588

no_auto_possess set PCRE2_NO_AUTO_POSSESS

589

no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR

590

no_start_optimize set PCRE2_NO_START_OPTIMIZE

591

no_utf_check set PCRE2_NO_UTF_CHECK

592

ucp set PCRE2_UCP

593

ungreedy set PCRE2_UNGREEDY

594

use_offset_limit set PCRE2_USE_OFFSET_LIMIT

595

utf set PCRE2_UTF

596

597

As well as turning on the PCRE2_UTF option, the utf modifier causes all

598

non-printing characters in output strings to be printed using the

599

\x{hh...} notation. Otherwise, those less than 0x100 are output in hex

600

without the curly brackets. Setting utf in 16-bit or 32-bit mode also

601

causes pattern and subject strings to be translated to UTF-16 or

602

UTF-32, respectively, before being passed to library functions.

603

604

Setting compilation controls

605

606

The following modifiers affect the compilation process or request in-

607

formation about the pattern. There are single-letter abbreviations for

608

some that are heavily used in the test files.

609

610

bsr=[anycrlf|unicode] specify \R handling

611

/B bincode show binary code without lengths

612

callout_info show callout information

613

convert=<options> request foreign pattern conversion

614

convert_glob_escape=c set glob escape character

615

convert_glob_separator=c set glob separator character

616

convert_length set convert buffer length

617

debug same as info,fullbincode

618

framesize show matching frame size

619

fullbincode show binary code with lengths

620

/I info show info about compiled pattern

621

hex unquoted characters are hexadecimal

622

jit[=<number>] use JIT

623

jitfast use JIT fast path

624

jitverify verify JIT use

625

locale=<name> use this locale

626

max_pattern_length=<n> set the maximum pattern length

627

memory show memory used

628

newline=<type> set newline type

629

null_context compile with a NULL context

630

parens_nest_limit=<n> set maximum parentheses depth

631

posix use the POSIX API

632

posix_nosub use the POSIX API with REG_NOSUB

633

push push compiled pattern onto the stack

634

pushcopy push a copy onto the stack

635

stackguard=<number> test the stackguard feature

636

subject_literal treat all subject lines as literal

637

tables=[0|1|2|3] select internal tables

638

use_length do not zero-terminate the pattern

639

utf8_input treat input as UTF-8

640

641

The effects of these modifiers are described in the following sections.

642

643

Newline and \R handling

644

645

The bsr modifier specifies what \R in a pattern should match. If it is

646

set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to

647

"unicode", \R matches any Unicode newline sequence. The default can be

648

specified when PCRE2 is built; if it is not, the default is set to Uni-

649

code.

650

651

The newline modifier specifies which characters are to be interpreted

652

as newlines, both in the pattern and in subject lines. The type must be

653

one of CR, LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case).

654

655

Information about a pattern

656

657

The debug modifier is a shorthand for info,fullbincode, requesting all

658

available information.

659

660

The bincode modifier causes a representation of the compiled code to be

661

output after compilation. This information does not contain length and

662

offset values, which ensures that the same output is generated for dif-

663

ferent internal link sizes and different code unit widths. By using

664

bincode, the same regression tests can be used in different environ-

665

ments.

666

667

The fullbincode modifier, by contrast, does include length and offset

668

values. This is used in a few special tests that run only for specific

669

code unit widths and link sizes, and is also useful for one-off tests.

670

671

The info modifier requests information about the compiled pattern

672

(whether it is anchored, has a fixed first character, and so on). The

673

information is obtained from the pcre2_pattern_info() function. Here

674

are some typical examples:

675

676

re> /(?i)(^a|^b)/m,info

677

Capture group count = 1

678

Compile options: multiline

679

Overall options: caseless multiline

680

First code unit at start or follows newline

681

Subject length lower bound = 1

682

683

re> /(?i)abc/info

684

Capture group count = 0

685

Compile options: <none>

686

Overall options: caseless

687

First code unit = 'a' (caseless)

688

Last code unit = 'c' (caseless)

689

Subject length lower bound = 3

690

691

"Compile options" are those specified by modifiers; "overall options"

692

have added options that are taken or deduced from the pattern. If both

693

sets of options are the same, just a single "options" line is output;

694

if there are no options, the line is omitted. "First code unit" is

695

where any match must start; if there is more than one they are listed

696

as "starting code units". "Last code unit" is the last literal code

697

unit that must be present in any match. This is not necessarily the

698

last character. These lines are omitted if no starting or ending code

699

units are recorded. The subject length line is omitted when

700

no_start_optimize is set because the minimum length is not calculated

701

when it can never be used.

702

703

The framesize modifier shows the size, in bytes, of the storage frames

704

used by pcre2_match() for handling backtracking. The size depends on

705

the number of capturing parentheses in the pattern.

706

707

The callout_info modifier requests information about all the callouts

708

in the pattern. A list of them is output at the end of any other infor-

709

mation that is requested. For each callout, either its number or string

710

is given, followed by the item that follows it in the pattern.

711

712

Passing a NULL context

713

714

Normally, pcre2test passes a context block to pcre2_compile(). If the

715

null_context modifier is set, however, NULL is passed. This is for

716

testing that pcre2_compile() behaves correctly in this case (it uses

717

default values).

718

719

Specifying pattern characters in hexadecimal

720

721

The hex modifier specifies that the characters of the pattern, except

722

for substrings enclosed in single or double quotes, are to be inter-

723

preted as pairs of hexadecimal digits. This feature is provided as a

724

way of creating patterns that contain binary zeros and other non-print-

725

ing characters. White space is permitted between pairs of digits. For

726

example, this pattern contains three characters:

/ab 32 59/hex

Parts of such a pattern are taken literally if quoted. This pattern

731

contains nine characters, only two of which are specified in hexadeci-

mal:

/ab "literal" 32/hex

Either single or double quotes may be used. There is no way of includ-

737

ing the delimiter within a substring. The hex and expand modifiers are

738

mutually exclusive.

739

740

Specifying the pattern's length

741

742

By default, patterns are passed to the compiling functions as zero-ter-

743

minated strings but can be passed by length instead of being zero-ter-

744

minated. The use_length modifier causes this to happen. Using a length

745

happens automatically (whether or not use_length is set) when hex is

746

set, because patterns specified in hexadecimal may contain binary ze-

747

ros.

748

749

If hex or use_length is used with the POSIX wrapper API (see "Using the

750

POSIX wrapper API" below), the REG_PEND extension is used to pass the

751

pattern's length.

752

753

Specifying wide characters in 16-bit and 32-bit modes

754

755

In 16-bit and 32-bit modes, all input is automatically treated as UTF-8

756

and translated to UTF-16 or UTF-32 when the utf modifier is set. For

757

testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input

758

modifier can be used. It is mutually exclusive with utf. Input lines

759

are interpreted as UTF-8 as a means of specifying wide characters. More

760

details are given in "Input encoding" above.

761

762

Generating long repetitive patterns

763

764

Some tests use long patterns that are very repetitive. Instead of cre-

765

ating a very long input line for such a pattern, you can use a special

766

repetition feature, similar to the one described for subject lines

767

above. If the expand modifier is present on a pattern, parts of the

768

pattern that have the form

769

770

\[<characters>]{<count>}

771

772

are expanded before the pattern is passed to pcre2_compile(). For exam-

773

ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction

774

cannot be nested. An initial "\[" sequence is recognized only if "]{"

775

followed by decimal digits and "}" is found later in the pattern. If

776

not, the characters remain in the pattern unaltered. The expand and hex

777

modifiers are mutually exclusive.

778

779

If part of an expanded pattern looks like an expansion, but is really

780

part of the actual pattern, unwanted expansion can be avoided by giving

781

two values in the quantifier. For example, \[AB]{6000,6000} is not rec-

782

ognized as an expansion item.

783

784

If the info modifier is set on an expanded pattern, the result of the

785

expansion is included in the information that is output.

JIT compilation

Just-in-time (JIT) compiling is a heavyweight optimization that can

790

greatly speed up pattern matching. See the pcre2jit documentation for

791

details. JIT compiling happens, optionally, after a pattern has been

792

successfully compiled into an internal form. The JIT compiler converts

793

this to optimized machine code. It needs to know whether the match-time

794

options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used,

795

because different code is generated for the different cases. See the

796

partial modifier in "Subject Modifiers" below for details of how these

797

options are specified for each match attempt.

798

799

JIT compilation is requested by the jit pattern modifier, which may op-

800

tionally be followed by an equals sign and a number in the range 0 to

801

7. The three bits that make up the number specify which of the three

802

JIT operating modes are to be compiled:

803

804

1 compile JIT code for non-partial matching

805

2 compile JIT code for soft partial matching

806

4 compile JIT code for hard partial matching

807

808

The possible values for the jit modifier are therefore:

809

810

0 disable JIT

811

1 normal matching only

812

2 soft partial matching only

813

3 normal and soft partial matching

814

4 hard partial matching only

815

6 soft and hard partial matching only

816

7 all three modes

817

818

If no number is given, 7 is assumed. The phrase "partial matching"

819

means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the

820

PCRE2_PARTIAL_HARD option set. Note that such a call may return a com-

821

plete match; the options enable the possibility of a partial match, but

822

do not require it. Note also that if you request JIT compilation only

823

for partial matching (for example, jit=2) but do not set the partial

824

modifier on a subject line, that match will not use JIT code because

825

none was compiled for non-partial matching.

826

827

If JIT compilation is successful, the compiled JIT code will automati-

828

cally be used when an appropriate type of match is run, except when in-

829

compatible run-time options are specified. For more details, see the

830

pcre2jit documentation. See also the jitstack modifier below for a way

831

of setting the size of the JIT stack.

832

833

If the jitfast modifier is specified, matching is done using the JIT

834

"fast path" interface, pcre2_jit_match(), which skips some of the san-

835

ity checks that are done by pcre2_match(), and of course does not work

836

when JIT is not supported. If jitfast is specified without jit, jit=7

837

is assumed.

838

839

If the jitverify modifier is specified, information about the compiled

840

pattern shows whether JIT compilation was or was not successful. If

841

jitverify is specified without jit, jit=7 is assumed. If JIT compila-

842

tion is successful when jitverify is set, the text "(JIT)" is added to

843

the first output line after a match or non match when JIT-compiled code

844

was actually used in the match.

Setting a locale

The locale modifier must specify the name of a locale, for example:

849

850

/pattern/locale=fr_FR

851

852

The given locale is set, pcre2_maketables() is called to build a set of

853

character tables for the locale, and this is then passed to pcre2_com-

854

pile() when compiling the regular expression. The same tables are used

855

when matching the following subject lines. The locale modifier applies

856

only to the pattern on which it appears, but can be given in a #pattern

857

command if a default is needed. Setting a locale and alternate charac-

858

ter tables are mutually exclusive.

859

860

Showing pattern memory

861

862

The memory modifier causes the size in bytes of the memory used to hold

863

the compiled pattern to be output. This does not include the size of

864

the pcre2_code block; it is just the actual compiled data. If the pat-

865

tern is subsequently passed to the JIT compiler, the size of the JIT

866

compiled code is also output. Here is an example:

867

868

re> /a(b)c/jit,memory

869

Memory allocation (code space): 21

870

Memory allocation (JIT code): 1910

871

872

873

Limiting nested parentheses

874

875

The parens_nest_limit modifier sets a limit on the depth of nested

876

parentheses in a pattern. Breaching the limit causes a compilation er-

877

ror. The default for the library is set when PCRE2 is built, but

878

pcre2test sets its own default of 220, which is required for running

879

the standard test suite.

880

881

Limiting the pattern length

882

883

The max_pattern_length modifier sets a limit, in code units, to the

884

length of pattern that pcre2_compile() will accept. Breaching the limit

885

causes a compilation error. The default is the largest number a

886

PCRE2_SIZE variable can hold (essentially unlimited).

887

888

Using the POSIX wrapper API

889

890

The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via

891

the POSIX wrapper API rather than its native API. When posix_nosub is

892

used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX

893

wrapper supports only the 8-bit library. Note that it does not imply

894

POSIX matching semantics; for more detail see the pcre2posix documenta-

895

tion. The following pattern modifiers set options for the regcomp()

function:

caseless REG_ICASE

multiline REG_NEWLINE

900

dotall REG_DOTALL )

901

ungreedy REG_UNGREEDY ) These options are not part of

902

ucp REG_UCP ) the POSIX standard

903

utf REG_UTF8 )

904

905

The regerror_buffsize modifier specifies a size for the error buffer

906

that is passed to regerror() in the event of a compilation error. For

907

example:

908

909

/abc/posix,regerror_buffsize=20

910

911

This provides a means of testing the behaviour of regerror() when the

912

buffer is too small for the error message. If this modifier has not

913

been set, a large buffer is used.

914

915

The aftertext and allaftertext subject modifiers work as described be-

916

low. All other modifiers are either ignored, with a warning message, or

917

cause an error.

918

919

The pattern is passed to regcomp() as a zero-terminated string by de-

920

fault, but if the use_length or hex modifiers are set, the REG_PEND ex-

921

tension is used to pass it by length.

922

923

Testing the stack guard feature

924

925

The stackguard modifier is used to test the use of pcre2_set_com-

926

pile_recursion_guard(), a function that is provided to enable stack

927

availability to be checked during compilation (see the pcre2api docu-

928

mentation for details). If the number specified by the modifier is

929

greater than zero, pcre2_set_compile_recursion_guard() is called to set

930

up callback from pcre2_compile() to a local function. The argument it

931

receives is the current nesting parenthesis depth; if this is greater

932

than the value given by the modifier, non-zero is returned, causing the

933

compilation to be aborted.

934

935

Using alternative character tables

936

937

The value specified for the tables modifier must be one of the digits

938

0, 1, 2, or 3. It causes a specific set of built-in character tables to

939

be passed to pcre2_compile(). This is used in the PCRE2 tests to check

940

behaviour with different character tables. The digit specifies the ta-

941

bles as follows:

942

943

0 do not pass any special character tables

944

1 the default ASCII tables, as distributed in

945

pcre2_chartables.c.dist

946

2 a set of tables defining ISO 8859 characters

947

3 a set of tables loaded by the #loadtables command

948

949

In tables 2, some characters whose codes are greater than 128 are iden-

950

tified as letters, digits, spaces, etc. Tables 3 can be used only after

951

a #loadtables command has loaded them from a binary file. Setting al-

952

ternate character tables and a locale are mutually exclusive.

953

954

Setting certain match controls

955

956

The following modifiers are really subject modifiers, and are described

957

under "Subject Modifiers" below. However, they may be included in a

958

pattern's modifier list, in which case they are applied to every sub-

959

ject line that is processed with that pattern. These modifiers do not

960

affect the compilation process.

961

962

aftertext show text after match

963

allaftertext show text after captures

964

allcaptures show all captures

965

allvector show the entire ovector

966

allusedtext show all consulted text

967

altglobal alternative global matching

968

/g global global matching

969

jitstack=<n> set size of JIT stack

970

mark show mark values

971

replace=<string> specify a replacement string

972

startchar show starting character when relevant

973

substitute_callout use substitution callouts

974

substitute_extended use PCRE2_SUBSTITUTE_EXTENDED

975

substitute_literal use PCRE2_SUBSTITUTE_LITERAL

976

substitute_matched use PCRE2_SUBSTITUTE_MATCHED

977

substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH

978

substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY

979

substitute_skip=<n> skip substitution <n>

980

substitute_stop=<n> skip substitution <n> and following

981

substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET

982

substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY

983

984

These modifiers may not appear in a #pattern command. If you want them

985

as defaults, set them in a #subject command.

986

987

Specifying literal subject lines

988

989

If the subject_literal modifier is present on a pattern, all the sub-

990

ject lines that it matches are taken as literal strings, with no inter-

991

pretation of backslashes. It is not possible to set subject modifiers

992

on such lines, but any that are set as defaults by a #subject command

993

are recognized.

994

995

Saving a compiled pattern

996

997

When a pattern with the push modifier is successfully compiled, it is

998

pushed onto a stack of compiled patterns, and pcre2test expects the

999

next line to contain a new pattern (or a command) instead of a subject

1000

line. This facility is used when saving compiled patterns to a file, as

1001

described in the section entitled "Saving and restoring compiled pat-

1002

terns" below. If pushcopy is used instead of push, a copy of the com-

1003

piled pattern is stacked, leaving the original as current, ready to

1004

match the following input lines. This provides a way of testing the

1005

pcre2_code_copy() function. The push and pushcopy modifiers are in-

1006

compatible with compilation modifiers such as global that act at match

1007

time. Any that are specified are ignored (for the stacked copy), with a

1008

warning message, except for replace, which causes an error. Note that

1009

jitverify, which is allowed, does not carry through to any subsequent

1010

matching that uses a stacked pattern.

1011

1012

Testing foreign pattern conversion

1013

1014

The experimental foreign pattern conversion functions in PCRE2 can be

1015

tested by setting the convert modifier. Its argument is a colon-sepa-

1016

rated list of options, which set the equivalent option for the

1017

pcre2_pattern_convert() function:

1018

1019

glob PCRE2_CONVERT_GLOB

1020

glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR

1021

glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR

1022

posix_basic PCRE2_CONVERT_POSIX_BASIC

1023

posix_extended PCRE2_CONVERT_POSIX_EXTENDED

1024

unset Unset all options

1025

1026

The "unset" value is useful for turning off a default that has been set

1027

by a #pattern command. When one of these options is set, the input pat-

1028

tern is passed to pcre2_pattern_convert(). If the conversion is suc-

1029

cessful, the result is reflected in the output and then passed to

1030

pcre2_compile(). The normal utf and no_utf_check options, if set, cause

1031

the PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be

1032

passed to pcre2_pattern_convert().

1033

1034

By default, the conversion function is allowed to allocate a buffer for

1035

its output. However, if the convert_length modifier is set to a value

1036

greater than zero, pcre2test passes a buffer of the given length. This

1037

makes it possible to test the length check.

1038

1039

The convert_glob_escape and convert_glob_separator modifiers can be

1040

used to specify the escape and separator characters for glob process-

1041

ing, overriding the defaults, which are operating-system dependent.

SUBJECT MODIFIERS

The modifiers that can appear in subject lines and the #subject command

1047

are of two types.

1048

1049

Setting match options

1050

1051

The following modifiers set options for pcre2_match() or

1052

pcre2_dfa_match(). See pcreapi for a description of their effects.

1053

1054

anchored set PCRE2_ANCHORED

1055

endanchored set PCRE2_ENDANCHORED

1056

dfa_restart set PCRE2_DFA_RESTART

1057

dfa_shortest set PCRE2_DFA_SHORTEST

1058

no_jit set PCRE2_NO_JIT

1059

no_utf_check set PCRE2_NO_UTF_CHECK

1060

notbol set PCRE2_NOTBOL

1061

notempty set PCRE2_NOTEMPTY

1062

notempty_atstart set PCRE2_NOTEMPTY_ATSTART

1063

noteol set PCRE2_NOTEOL

1064

partial_hard (or ph) set PCRE2_PARTIAL_HARD

1065

partial_soft (or ps) set PCRE2_PARTIAL_SOFT

1066

1067

The partial matching modifiers are provided with abbreviations because

1068

they appear frequently in tests.

1069

1070

If the posix or posix_nosub modifier was present on the pattern, caus-

1071

ing the POSIX wrapper API to be used, the only option-setting modifiers

1072

that have any effect are notbol, notempty, and noteol, causing REG_NOT-

1073

BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to

1074

regexec(). The other modifiers are ignored, with a warning message.

1075

1076

There is one additional modifier that can be used with the POSIX wrap-

1077

per. It is ignored (with a warning) if used for non-POSIX matching.

1078

1079

posix_startend=<n>[:<m>]

1080

1081

This causes the subject string to be passed to regexec() using the

1082

REG_STARTEND option, which uses offsets to specify which part of the

1083

string is searched. If only one number is given, the end offset is

1084

passed as the end of the subject string. For more detail of REG_STAR-

1085

TEND, see the pcre2posix documentation. If the subject string contains

1086

binary zeros (coded as escapes such as \x{00} because pcre2test does

1087

not support actual binary zeros in its input), you must use posix_star-

1088

tend to specify its length.

1089

1090

Setting match controls

1091

1092

The following modifiers affect the matching process or request addi-

1093

tional information. Some of them may also be specified on a pattern

1094

line (see above), in which case they apply to every subject line that

1095

is matched against that pattern, but can be overridden by modifiers on

1096

the subject.

1097

1098

aftertext show text after match

1099

allaftertext show text after captures

1100

allcaptures show all captures

1101

allvector show the entire ovector

1102

allusedtext show all consulted text (non-JIT only)

1103

altglobal alternative global matching

1104

callout_capture show captures at callout time

1105

callout_data=<n> set a value to pass via callouts

1106

callout_error=<n>[:<m>] control callout error

1107

callout_extra show extra callout information

1108

callout_fail=<n>[:<m>] control callout failure

1109

callout_no_where do not show position of a callout

1110

callout_none do not supply a callout function

1111

copy=<number or name> copy captured substring

1112

depth_limit=<n> set a depth limit

1113

dfa use pcre2_dfa_match()

1114

find_limits find match and depth limits

1115

get=<number or name> extract captured substring

1116

getall extract all captured substrings

1117

/g global global matching

1118

heap_limit=<n> set a limit on heap memory (Kbytes)

1119

jitstack=<n> set size of JIT stack

1120

mark show mark values

1121

match_limit=<n> set a match limit

1122

memory show heap memory usage

1123

null_context match with a NULL context

Elliott Hughes

4e19c8e

2022-04-15 15:11:02 -0700

[diff] [blame]

1124

null_replacement substitute with NULL replacement

1125

null_subject match with NULL subject

Elliott Hughes

5b80804

2021-10-01 10:56:10 -0700

[diff] [blame]

1126

offset=<n> set starting offset

1127

offset_limit=<n> set offset limit

1128

ovector=<n> set size of output vector

1129

recursion_limit=<n> obsolete synonym for depth_limit

1130

replace=<string> specify a replacement string

1131

startchar show startchar when relevant

1132

startoffset=<n> same as offset=<n>

1133

substitute_callout use substitution callouts

1134

substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED

1135

substitute_literal use PCRE2_SUBSTITUTE_LITERAL

1136

substitute_matched use PCRE2_SUBSTITUTE_MATCHED

1137

substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH

1138

substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY

1139

substitute_skip=<n> skip substitution number n

1140

substitute_stop=<n> skip substitution number n and greater

1141

substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET

1142

substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY

1143

zero_terminate pass the subject as zero-terminated

1144

1145

The effects of these modifiers are described in the following sections.

1146

When matching via the POSIX wrapper API, the aftertext, allaftertext,

1147

and ovector subject modifiers work as described below. All other modi-

1148

fiers are either ignored, with a warning message, or cause an error.

Showing more text

The aftertext modifier requests that as well as outputting the part of

1153

the subject string that matched the entire pattern, pcre2test should in

1154

addition output the remainder of the subject string. This is useful for

1155

tests where the subject contains multiple copies of the same substring.

1156

The allaftertext modifier requests the same action for captured sub-

1157

strings as well as the main matched substring. In each case the remain-

1158

der is output on the following line with a plus character following the

1159

capture number.

1160

1161

The allusedtext modifier requests that all the text that was consulted

1162

during a successful pattern match by the interpreter should be shown,

1163

for both full and partial matches. This feature is not supported for

1164

JIT matching, and if requested with JIT it is ignored (with a warning

1165

message). Setting this modifier affects the output if there is a look-

1166

behind at the start of a match, or, for a complete match, a lookahead

1167

at the end, or if \K is used in the pattern. Characters that precede or

1168

follow the start and end of the actual match are indicated in the out-

1169

put by '<' or '>' characters underneath them. Here is an example:

1170

1171

re> /(?<=pqr)abc(?=xyz)/

1172

data> 123pqrabcxyz456\=allusedtext

1173

0: pqrabcxyz

1174

<<< >>>

1175

data> 123pqrabcxy\=ph,allusedtext

1176

Partial match: pqrabcxy

1177

<<<

1178

1179

The first, complete match shows that the matched string is "abc", with

1180

the preceding and following strings "pqr" and "xyz" having been con-

1181

sulted during the match (when processing the assertions). The partial

1182

match can indicate only the preceding string.

1183

1184

The startchar modifier requests that the starting character for the

1185

match be indicated, if it is different to the start of the matched

1186

string. The only time when this occurs is when \K has been processed as

1187

part of the match. In this situation, the output for the matched string

1188

is displayed from the starting character instead of from the match

1189

point, with circumflex characters under the earlier characters. For ex-

ample:

re> /abc\Kxyz/

data> abcxyz\=startchar

0: abcxyz

^^^

Unlike allusedtext, the startchar modifier can be used with JIT. How-

1198

ever, these two modifiers are mutually exclusive.

1199

1200

Showing the value of all capture groups

1201

1202

The allcaptures modifier requests that the values of all potential cap-

1203

tured parentheses be output after a match. By default, only those up to

1204

the highest one actually used in the match are output (corresponding to

1205

the return code from pcre2_match()). Groups that did not take part in

1206

the match are output as "<unset>". This modifier is not relevant for

1207

DFA matching (which does no capturing) and does not apply when replace

1208

is specified; it is ignored, with a warning message, if present.

1209

1210

Showing the entire ovector, for all outcomes

1211

1212

The allvector modifier requests that the entire ovector be shown, what-

1213

ever the outcome of the match. Compare allcaptures, which shows only up

1214

to the maximum number of capture groups for the pattern, and then only

1215

for a successful complete non-DFA match. This modifier, which acts af-

1216

ter any match result, and also for DFA matching, provides a means of

1217

checking that there are no unexpected modifications to ovector fields.

1218

Before each match attempt, the ovector is filled with a special value,

1219

and if this is found in both elements of a capturing pair, "<un-

1220

changed>" is output. After a successful match, this applies to all

1221

groups after the maximum capture group for the pattern. In other cases

1222

it applies to the entire ovector. After a partial match, the first two

1223

elements are the only ones that should be set. After a DFA match, the

1224

amount of ovector that is used depends on the number of matches that

1225

were found.

1226

1227

Testing pattern callouts

1228

1229

A callout function is supplied when pcre2test calls the library match-

1230

ing functions, unless callout_none is specified. Its behaviour can be

1231

controlled by various modifiers listed above whose names begin with

1232

callout_. Details are given in the section entitled "Callouts" below.

1233

Testing callouts from pcre2_substitute() is decribed separately in

1234

"Testing the substitution function" below.

1235

1236

Finding all matches in a string

1237

1238

Searching for all possible matches within a subject can be requested by

1239

the global or altglobal modifier. After finding a match, the matching

1240

function is called again to search the remainder of the subject. The

1241

difference between global and altglobal is that the former uses the

1242

start_offset argument to pcre2_match() or pcre2_dfa_match() to start

1243

searching at a new point within the entire string (which is what Perl

1244

does), whereas the latter passes over a shortened subject. This makes a

1245

difference to the matching process if the pattern begins with a lookbe-

1246

hind assertion (including \b or \B).

1247

1248

If an empty string is matched, the next match is done with the

1249

PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search

1250

for another, non-empty, match at the same point in the subject. If this

1251

match fails, the start offset is advanced, and the normal match is re-

1252

tried. This imitates the way Perl handles such cases when using the /g

1253

modifier or the split() function. Normally, the start offset is ad-

1254

vanced by one character, but if the newline convention recognizes CRLF

1255

as a newline, and the current character is CR followed by LF, an ad-

1256

vance of two characters occurs.

1257

1258

Testing substring extraction functions

1259

1260

The copy and get modifiers can be used to test the pcre2_sub-

1261

string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be

1262

given more than once, and each can specify a capture group name or num-

1263

ber, for example:

1264

1265

abcd\=copy=1,copy=3,get=G1

1266

1267

If the #subject command is used to set default copy and/or get lists,

1268

these can be unset by specifying a negative number to cancel all num-

1269

bered groups and an empty name to cancel all named groups.

1270

1271

The getall modifier tests pcre2_substring_list_get(), which extracts

1272

all captured substrings.

1273

1274

If the subject line is successfully matched, the substrings extracted

1275

by the convenience functions are output with C, G, or L after the

1276

string number instead of a colon. This is in addition to the normal

1277

full list. The string length (that is, the return from the extraction

1278

function) is given in parentheses after each substring, followed by the

1279

name when the extraction was by name.

1280

1281

Testing the substitution function

1282

1283

If the replace modifier is set, the pcre2_substitute() function is

1284

called instead of one of the matching functions (or after one call of

1285

pcre2_match() in the case of PCRE2_SUBSTITUTE_MATCHED). Note that re-

1286

placement strings cannot contain commas, because a comma signifies the

1287

end of a modifier. This is not thought to be an issue in a test pro-

1288

gram.

1289

1290

Specifying a completely empty replacement string disables this modi-

1291

fier. However, it is possible to specify an empty replacement by pro-

1292

viding a buffer length, as described below, for an otherwise empty re-

1293

placement.

1294

1295

Unlike subject strings, pcre2test does not process replacement strings

1296

for escape sequences. In UTF mode, a replacement string is checked to

1297

see if it is a valid UTF-8 string. If so, it is correctly converted to

1298

a UTF string of the appropriate code unit width. If it is not a valid

1299

UTF-8 string, the individual code units are copied directly. This pro-

1300

vides a means of passing an invalid UTF-8 string for testing purposes.

1301

1302

The following modifiers set options (in additional to the normal match

1303

options) for pcre2_substitute():

1304

1305

global PCRE2_SUBSTITUTE_GLOBAL

1306

substitute_extended PCRE2_SUBSTITUTE_EXTENDED

1307

substitute_literal PCRE2_SUBSTITUTE_LITERAL

1308

substitute_matched PCRE2_SUBSTITUTE_MATCHED

1309

substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH

1310

substitute_replacement_only PCRE2_SUBSTITUTE_REPLACEMENT_ONLY

1311

substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET

1312

substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY

1313

1314

See the pcre2api documentation for details of these options.

1315

1316

After a successful substitution, the modified string is output, pre-

1317

ceded by the number of replacements. This may be zero if there were no

1318

matches. Here is a simple example of a substitution test:

/abc/replace=xxx

=abc=abc=

1: =xxx=abc=

=abc=abc=\=global

2: =xxx=xxx=

Subject and replacement strings should be kept relatively short (fewer

1327

than 256 characters) for substitution tests, as fixed-size buffers are

1328

used. To make it easy to test for buffer overflow, if the replacement

1329

string starts with a number in square brackets, that number is passed

1330

to pcre2_substitute() as the size of the output buffer, with the re-

1331

placement string starting at the next character. Here is an example

1332

that tests the edge case:

1333

1334

/abc/

1335

123abc123\=replace=[10]XYZ

1336

1: 123XYZ123

1337

123abc123\=replace=[9]XYZ

1338

Failed: error -47: no more memory

1339

1340

The default action of pcre2_substitute() is to return PCRE2_ER-

1341

ROR_NOMEMORY when the output buffer is too small. However, if the

1342

PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the substi-

1343

tute_overflow_length modifier), pcre2_substitute() continues to go

1344

through the motions of matching and substituting (but not doing any

1345

callouts), in order to compute the size of buffer that is required.

1346

When this happens, pcre2test shows the required buffer length (which

1347

includes space for the trailing zero) as part of the error message. For

1348

example:

1349

1350

/abc/substitute_overflow_length

1351

123abc123\=replace=[9]XYZ

1352

Failed: error -47: no more memory: 10 code units are needed

1353

1354

A replacement string is ignored with POSIX and DFA matching. Specifying

1355

partial matching provokes an error return ("bad option value") from

1356

pcre2_substitute().

1357

1358

Testing substitute callouts

1359

1360

If the substitute_callout modifier is set, a substitution callout func-

1361

tion is set up. The null_context modifier must not be set, because the

1362

address of the callout function is passed in a match context. When the

1363

callout function is called (after each substitution), details of the

1364

the input and output strings are output. For example:

1365

1366

/abc/g,replace=<$0>,substitute_callout

1367

abcdefabcpqr

1368

1(1) Old 0 3 "abc" New 0 5 "<abc>"

1369

2(1) Old 6 9 "abc" New 8 13 "<abc>"

1370

2: <abc>def<abc>pqr

1371

1372

The first number on each callout line is the count of matches. The

1373

parenthesized number is the number of pairs that are set in the ovector

1374

(that is, one more than the number of capturing groups that were set).

1375

Then are listed the offsets of the old substring, its contents, and the

1376

same for the replacement.

1377

1378

By default, the substitution callout function returns zero, which ac-

1379

cepts the replacement and causes matching to continue if /g was used.

1380

Two further modifiers can be used to test other return values. If sub-

1381

stitute_skip is set to a value greater than zero the callout function

1382

returns +1 for the match of that number, and similarly substitute_stop

1383

returns -1. These cause the replacement to be rejected, and -1 causes

1384

no further matching to take place. If either of them are set, substi-

1385

tute_callout is assumed. For example:

1386

1387

/abc/g,replace=<$0>,substitute_skip=1

1388

abcdefabcpqr

1389

1(1) Old 0 3 "abc" New 0 5 "<abc> SKIPPED"

1390

2(1) Old 6 9 "abc" New 6 11 "<abc>"

1391

2: abcdef<abc>pqr

1392

abcdefabcpqr\=substitute_stop=1

1393

1(1) Old 0 3 "abc" New 0 5 "<abc> STOPPED"

1394

1: abcdefabcpqr

1395

1396

If both are set for the same number, stop takes precedence. Only a sin-

1397

gle skip or stop is supported, which is sufficient for testing that the

1398

feature works.

1399

1400

Setting the JIT stack size

1401

1402

The jitstack modifier provides a way of setting the maximum stack size

1403

that is used by the just-in-time optimization code. It is ignored if

1404

JIT optimization is not being used. The value is a number of kibibytes

1405

(units of 1024 bytes). Setting zero reverts to the default of 32KiB.

1406

Providing a stack that is larger than the default is necessary only for

1407

very complicated patterns. If jitstack is set non-zero on a subject

1408

line it overrides any value that was set on the pattern.

1409

1410

Setting heap, match, and depth limits

1411

1412

The heap_limit, match_limit, and depth_limit modifiers set the appro-

1413

priate limits in the match context. These values are ignored when the

1414

find_limits modifier is specified.

1415

1416

Finding minimum limits

1417

1418

If the find_limits modifier is present on a subject line, pcre2test

1419

calls the relevant matching function several times, setting different

1420

values in the match context via pcre2_set_heap_limit(),

1421

pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the

1422

minimum values for each parameter that allows the match to complete

1423

without error. If JIT is being used, only the match limit is relevant.

1424

1425

When using this modifier, the pattern should not contain any limit set-

1426

tings such as (*LIMIT_MATCH=...) within it. If such a setting is

1427

present and is lower than the minimum matching value, the minimum value

1428

cannot be found because pcre2_set_match_limit() etc. are only able to

1429

reduce the value of an in-pattern limit; they cannot increase it.

1430

1431

For non-DFA matching, the minimum depth_limit number is a measure of

1432

how much nested backtracking happens (that is, how deeply the pattern's

1433

tree is searched). In the case of DFA matching, depth_limit controls

1434

the depth of recursive calls of the internal function that is used for

1435

handling pattern recursion, lookaround assertions, and atomic groups.

1436

1437

For non-DFA matching, the match_limit number is a measure of the amount

1438

of backtracking that takes place, and learning the minimum value can be

1439

instructive. For most simple matches, the number is quite small, but

1440

for patterns with very large numbers of matching possibilities, it can

1441

become large very quickly with increasing length of subject string. In

1442

the case of DFA matching, match_limit controls the total number of

1443

calls, both recursive and non-recursive, to the internal matching func-

1444

tion, thus controlling the overall amount of computing resource that is

1445

used.

1446

1447

For both kinds of matching, the heap_limit number, which is in

1448

kibibytes (units of 1024 bytes), limits the amount of heap memory used

1449

for matching. A value of zero disables the use of any heap memory; many

1450

simple pattern matches can be done without using the heap, so zero is

1451

not an unreasonable setting.

Showing MARK names

The mark modifier causes the names from backtracking control verbs that

1457

are returned from calls to pcre2_match() to be displayed. If a mark is

1458

returned for a match, non-match, or partial match, pcre2test shows it.

1459

For a match, it is on a line by itself, tagged with "MK:". Otherwise,

1460

it is added to the non-match message.

Showing memory usage

The memory modifier causes pcre2test to log the sizes of all heap mem-

1465

ory allocation and freeing calls that occur during a call to

1466

pcre2_match() or pcre2_dfa_match(). These occur only when a match re-

1467

quires a bigger vector than the default for remembering backtracking

1468

points (pcre2_match()) or for internal workspace (pcre2_dfa_match()).

1469

In many cases there will be no heap memory used and therefore no addi-

1470

tional output. No heap memory is allocated during matching with JIT, so

1471

in that case the memory modifier never has any effect. For this modi-

1472

fier to work, the null_context modifier must not be set on both the

1473

pattern and the subject, though it can be set on one or the other.

1474

1475

Setting a starting offset

1476

1477

The offset modifier sets an offset in the subject string at which

1478

matching starts. Its value is a number of code units, not characters.

1479

1480

Setting an offset limit

1481

1482

The offset_limit modifier sets a limit for unanchored matches. If a

1483

match cannot be found starting at or before this offset in the subject,

1484

a "no match" return is given. The data value is a number of code units,

1485

not characters. When this modifier is used, the use_offset_limit modi-

1486

fier must have been set for the pattern; if not, an error is generated.

1487

1488

Setting the size of the output vector

1489

1490

The ovector modifier applies only to the subject line in which it ap-

1491

pears, though of course it can also be used to set a default in a #sub-

1492

ject command. It specifies the number of pairs of offsets that are

1493

available for storing matching information. The default is 15.

1494

1495

A value of zero is useful when testing the POSIX API because it causes

1496

regexec() to be called with a NULL capture vector. When not testing the

1497

POSIX API, a value of zero is used to cause pcre2_match_data_cre-

1498

ate_from_pattern() to be called, in order to create a match block of

1499

exactly the right size for the pattern. (It is not possible to create a

1500

match block with a zero-length ovector; there is always at least one

1501

pair of offsets.)

1502

1503

Passing the subject as zero-terminated

1504

1505

By default, the subject string is passed to a native API matching func-

1506

tion with its correct length. In order to test the facility for passing

1507

a zero-terminated string, the zero_terminate modifier is provided. It

1508

causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching

1509

via the POSIX interface, this modifier is ignored, with a warning.

1510

1511

When testing pcre2_substitute(), this modifier also has the effect of

1512

passing the replacement string as zero-terminated.

1513

Elliott Hughes

4e19c8e

2022-04-15 15:11:02 -0700

[diff] [blame]

1514

Passing a NULL context, subject, or replacement

Elliott Hughes

5b80804

2021-10-01 10:56:10 -0700

[diff] [blame]

1515

1516

Normally, pcre2test passes a context block to pcre2_match(),

1517

pcre2_dfa_match(), pcre2_jit_match() or pcre2_substitute(). If the

1518

null_context modifier is set, however, NULL is passed. This is for

1519

testing that the matching and substitution functions behave correctly

1520

in this case (they use default values). This modifier cannot be used

1521

with the find_limits or substitute_callout modifiers.

1522

Elliott Hughes

4e19c8e

2022-04-15 15:11:02 -0700

[diff] [blame]

1523

Similarly, for testing purposes, if the null_subject or null_replace-

1524

ment modifier is set, the subject or replacement string pointers are

1525

passed as NULL, respectively, to the relevant functions.

1526

Elliott Hughes

5b80804

2021-10-01 10:56:10 -0700

[diff] [blame]

1527

1528

THE ALTERNATIVE MATCHING FUNCTION

1529

1530

By default, pcre2test uses the standard PCRE2 matching function,

1531

pcre2_match() to match each subject line. PCRE2 also supports an alter-

1532

native matching function, pcre2_dfa_match(), which operates in a dif-

1533

ferent way, and has some restrictions. The differences between the two

1534

functions are described in the pcre2matching documentation.

1535

1536

If the dfa modifier is set, the alternative matching function is used.

1537

This function finds all possible matches at a given point in the sub-

1538

ject. If, however, the dfa_shortest modifier is set, processing stops

1539

after the first match is found. This is always the shortest possible

match.

DEFAULT OUTPUT FROM pcre2test

1544

1545

This section describes the output when the normal matching function,

1546

pcre2_match(), is being used.

1547

1548

When a match succeeds, pcre2test outputs the list of captured sub-

1549

strings, starting with number 0 for the string that matched the whole

1550

pattern. Otherwise, it outputs "No match" when the return is PCRE2_ER-

1551

ROR_NOMATCH, or "Partial match:" followed by the partially matching

1552

substring when the return is PCRE2_ERROR_PARTIAL. (Note that this is

1553

the entire substring that was inspected during the partial match; it

1554

may include characters before the actual match start if a lookbehind

1555

assertion, \K, \b, or \B was involved.)

1556

1557

For any other return, pcre2test outputs the PCRE2 negative error number

1558

and a short descriptive phrase. If the error is a failed UTF string

1559

check, the code unit offset of the start of the failing character is

1560

also output. Here is an example of an interactive pcre2test run.

1561

1562

$ pcre2test

1563

PCRE2 version 10.22 2016-07-29

re> /^abc(\d+)/

data> abc123

0: abc123

1: 123

data> xyz

No match

Unset capturing substrings that are not followed by one that is set are

1573

not shown by pcre2test unless the allcaptures modifier is specified. In

1574

the following example, there are two capturing substrings, but when the

1575

first data line is matched, the second, unset substring is not shown.

1576

An "internal" unset substring is shown as "<unset>", as for the second

data line.

re> /(a)|(b)/

data> a

0: a

1: a

data> b

0: b

1: <unset>

2: b

If the strings contain any non-printing characters, they are output as

1589

\xhh escapes if the value is less than 256 and UTF mode is not set.

1590

Otherwise they are output as \x{hh...} escapes. See below for the defi-

1591

nition of non-printing characters. If the aftertext modifier is set,

1592

the output for substring 0 is followed by the the rest of the subject

1593

string, identified by "0+" like this:

re> /cat/aftertext

data> cataract

0: cat

0+ aract

If global matching is requested, the results of successive matching at-

1601

tempts are output in sequence, like this:

re> /\Bi(\w\w)/g

data> Mississippi

0: iss

1: ss

0: iss

1: ss

0: ipp

1: pp

"No match" is output only if the first match attempt fails. Here is an

1613

example of a failure message (the offset 4 that is specified by the

1614

offset modifier is past the end of the subject string):

re> /xyz/

data> xyz\=offset=4

Error -24 (bad offset value)

1619

1620

Note that whereas patterns can be continued over several lines (a plain

1621

">" prompt is used for continuations), subject lines may not. However

1622

newlines can be included in a subject by means of the \n escape (or \r,

1623

\r\n, etc., depending on the newline sequence setting).

1624

1625

1626

OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION

1627

1628

When the alternative matching function, pcre2_dfa_match(), is used, the

1629

output consists of a list of all the matches that start at the first

1630

point in the subject where there is at least one match. For example:

1631

1632

re> /(tang|tangerine|tan)/

1633

data> yellow tangerine\=dfa

0: tangerine

1: tang

2: tan

Using the normal matching function on this data finds only "tang". The

1639

longest matching string is always given first (and numbered zero). Af-

1640

ter a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", fol-

1641

lowed by the partially matching substring. Note that this is the entire

1642

substring that was inspected during the partial match; it may include

1643

characters before the actual match start if a lookbehind assertion, \b,

1644

or \B was involved. (\K is not supported for DFA matching.)

1645

1646

If global matching is requested, the search for further matches resumes

1647

at the end of the longest match. For example:

1648

1649

re> /(tang|tangerine|tan)/g

1650

data> yellow tangerine and tangy sultana\=dfa

0: tangerine

1: tang

2: tan

0: tang

1: tan

0: tan

The alternative matching function does not support substring capture,

1659

so the modifiers that are concerned with captured substrings are not

relevant.

RESTARTING AFTER A PARTIAL MATCH

1664

1665

When the alternative matching function has given the PCRE2_ERROR_PAR-

1666

TIAL return, indicating that the subject partially matched the pattern,

1667

you can restart the match with additional subject data by means of the

1668

dfa_restart modifier. For example:

1669

1670

re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/

1671

data> 23ja\=ps,dfa

1672

Partial match: 23ja

1673

data> n05\=dfa,dfa_restart

1674

0: n05

1675

1676

For further information about partial matching, see the pcre2partial

documentation.

CALLOUTS

If the pattern contains any callout requests, pcre2test's callout func-

1683

tion is called during matching unless callout_none is specified. This

1684

works with both matching functions, and with JIT, though there are some

1685

differences in behaviour. The output for callouts with numerical argu-

1686

ments and those with string arguments is slightly different.

1687

1688

Callouts with numerical arguments

1689

1690

By default, the callout function displays the callout number, the start

1691

and current positions in the subject text at the callout time, and the

1692

next pattern item to be tested. For example:

--->pqrabcdef

0 ^ ^ \d

This output indicates that callout number 0 occurred for a match at-

1698

tempt starting at the fourth character of the subject string, when the

1699

pointer was at the seventh character, and when the next pattern item

1700

was \d. Just one circumflex is output if the start and current posi-

1701

tions are the same, or if the current position precedes the start posi-

1702

tion, which can happen if the callout is in a lookbehind assertion.

1703

1704

Callouts numbered 255 are assumed to be automatic callouts, inserted as

1705

a result of the auto_callout pattern modifier. In this case, instead of

1706

showing the callout number, the offset in the pattern, preceded by a

1707

plus, is output. For example:

1708

1709

re> /\d?[A-E]\*/auto_callout

data> E*

--->E*

+0 ^ \d?

+3 ^ [A-E]

+8 ^^ \*

+10 ^ ^

0: E*

If a pattern contains (*MARK) items, an additional line is output when-

1719

ever a change of latest mark is passed to the callout function. For ex-

1720

ample:

1721

1722

re> /a(*MARK:X)bc/auto_callout

data> abc

--->abc

+0 ^ a

+1 ^^ (*MARK:X)

+10 ^^ b

Latest Mark: X

+11 ^ ^ c

+12 ^ ^

0: abc

The mark changes between matching "a" and "b", but stays the same for

1734

the rest of the match, so nothing more is output. If, as a result of

1735

backtracking, the mark reverts to being unset, the text "<unset>" is

1736

output.

1737

1738

Callouts with string arguments

1739

1740

The output for a callout with a string argument is similar, except that

1741

instead of outputting a callout number before the position indicators,

1742

the callout string and its offset in the pattern string are output be-

1743

fore the reflection of the subject string, and the subject string is

1744

reflected for each callout. For example:

1745

1746

re> /^ab(?C'first')cd(?C"second")ef/

data> abcdefg

Callout (7): 'first'

--->abcdefg

^ ^ c

Callout (20): "second"

--->abcdefg

^ ^ e

0: abcdef

Callout modifiers

The callout function in pcre2test returns zero (carry on matching) by

1760

default, but you can use a callout_fail modifier in a subject line to

1761

change this and other parameters of the callout (see below).

1762

1763

If the callout_capture modifier is set, the current captured groups are

1764

output when a callout occurs. This is useful only for non-DFA matching,

1765

as pcre2_dfa_match() does not support capturing, so no captures are

1766

ever shown.

1767

1768

The normal callout output, showing the callout number or pattern offset

1769

(as described above) is suppressed if the callout_no_where modifier is

1770

set.

1771

1772

When using the interpretive matching function pcre2_match() without

1773

JIT, setting the callout_extra modifier causes additional output from

1774

pcre2test's callout function to be generated. For the first callout in

1775

a match attempt at a new starting position in the subject, "New match

1776

attempt" is output. If there has been a backtrack since the last call-

1777

out (or start of matching if this is the first callout), "Backtrack" is

1778

output, followed by "No other matching paths" if the backtrack ended

1779

the previous match attempt. For example:

1780

1781

re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess

1782

data> aac\=callout_extra

New match attempt

--->aac

+0 ^ (

+1 ^ a+

+3 ^ ^ )

+4 ^ ^ b

Backtrack

--->aac

+3 ^^ )

+4 ^^ b

Backtrack

No other matching paths

New match attempt

--->aac

+0 ^ (

+1 ^ a+

+3 ^^ )

+4 ^^ b

Backtrack

No other matching paths

New match attempt

--->aac

+0 ^ (

+1 ^ a+

Backtrack

No other matching paths

New match attempt

--->aac

+0 ^ (

+1 ^ a+

No match

Notice that various optimizations must be turned off if you want all

1816

possible matching paths to be scanned. If no_start_optimize is not

1817

used, there is an immediate "no match", without any callouts, because

1818

the starting optimization fails to find "b" in the subject, which it

1819

knows must be present for any match. If no_auto_possess is not used,

1820

the "a+" item is turned into "a++", which reduces the number of back-

1821

tracks.

1822

1823

The callout_extra modifier has no effect if used with the DFA matching

1824

function, or with JIT.

1825

1826

Return values from callouts

1827

1828

The default return from the callout function is zero, which allows

1829

matching to continue. The callout_fail modifier can be given one or two

1830

numbers. If there is only one number, 1 is returned instead of 0 (caus-

1831

ing matching to backtrack) when a callout of that number is reached. If

1832

two numbers (<n>:<m>) are given, 1 is returned when callout <n> is

1833

reached and there have been at least <m> callouts. The callout_error

1834

modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus-

1835

ing the entire matching process to be aborted. If both these modifiers

1836

are set for the same callout number, callout_error takes precedence.

1837

Note that callouts with string arguments are always given the number

1838

zero.

1839

1840

The callout_data modifier can be given an unsigned or a negative num-

1841

ber. This is set as the "user data" that is passed to the matching

1842

function, and passed back when the callout function is invoked. Any

1843

value other than zero is used as a return from pcre2test's callout

1844

function.

1845

1846

Inserting callouts can be helpful when using pcre2test to check compli-

1847

cated regular expressions. For further information about callouts, see

1848

the pcre2callout documentation.

1849

1850

1851

NON-PRINTING CHARACTERS

1852

1853

When pcre2test is outputting text in the compiled version of a pattern,

1854

bytes other than 32-126 are always treated as non-printing characters

1855

and are therefore shown as hex escapes.

1856

1857

When pcre2test is outputting text that is a matched part of a subject

1858

string, it behaves in the same way, unless a different locale has been

1859

set for the pattern (using the locale modifier). In this case, the is-

1860

print() function is used to distinguish printing and non-printing char-

acters.

SAVING AND RESTORING COMPILED PATTERNS

1865

1866

It is possible to save compiled patterns on disc or elsewhere, and

1867

reload them later, subject to a number of restrictions. JIT data cannot

1868

be saved. The host on which the patterns are reloaded must be running

1869

the same version of PCRE2, with the same code unit width, and must also

1870

have the same endianness, pointer width and PCRE2_SIZE type. Before

1871

compiled patterns can be saved they must be serialized, that is, con-

1872

verted to a stream of bytes. A single byte stream may contain any num-

1873

ber of compiled patterns, but they must all use the same character ta-

1874

bles. A single copy of the tables is included in the byte stream (its

1875

size is 1088 bytes).

1876

1877

The functions whose names begin with pcre2_serialize_ are used for se-

1878

rializing and de-serializing. They are described in the pcre2serialize

1879

documentation. In this section we describe the features of pcre2test

1880

that can be used to test these functions.

1881

1882

Note that "serialization" in PCRE2 does not convert compiled patterns

1883

to an abstract format like Java or .NET. It just makes a reloadable

1884

byte code stream. Hence the restrictions on reloading mentioned above.

1885

1886

In pcre2test, when a pattern with push modifier is successfully com-

1887

piled, it is pushed onto a stack of compiled patterns, and pcre2test

1888

expects the next line to contain a new pattern (or command) instead of

1889

a subject line. By contrast, the pushcopy modifier causes a copy of the

1890

compiled pattern to be stacked, leaving the original available for im-

1891

mediate matching. By using push and/or pushcopy, a number of patterns

1892

can be compiled and retained. These modifiers are incompatible with

1893

posix, and control modifiers that act at match time are ignored (with a

1894

message) for the stacked patterns. The jitverify modifier applies only

at compile time.

The command

#save <filename>

causes all the stacked patterns to be serialized and the result written

1902

to the named file. Afterwards, all the stacked patterns are freed. The

command

#load <filename>

reads the data in the file, and then arranges for it to be de-serial-

1908

ized, with the resulting compiled patterns added to the pattern stack.

1909

The pattern on the top of the stack can be retrieved by the #pop com-

1910

mand, which must be followed by lines of subjects that are to be

1911

matched with the pattern, terminated as usual by an empty line or end

1912

of file. This command may be followed by a modifier list containing

1913

only control modifiers that act after a pattern has been compiled. In

1914

particular, hex, posix, posix_nosub, push, and pushcopy are not al-

1915

lowed, nor are any option-setting modifiers. The JIT modifiers are,

1916

however permitted. Here is an example that saves and reloads two pat-

terns.

/abc/push

/xyz/push

#save tempfile

#load tempfile

#pop info

xyz

#pop jit,bincode

abc

If jitverify is used with #pop, it does not automatically imply jit,

1930

which is different behaviour from when it is used on a pattern.

1931

1932

The #popcopy command is analagous to the pushcopy modifier in that it

1933

makes current a copy of the topmost stack pattern, leaving the original

still on the stack.