blob: e07d3c07a0869a3f0ba831237bdabbc8779b9c35 [file] [log] [blame]
Elliott Hughes5b808042021-10-01 10:56:10 -07001README file for PCRE2 (Perl-compatible regular expression library)
2------------------------------------------------------------------
3
4PCRE2 is a re-working of the original PCRE1 library to provide an entirely new
5API. Since its initial release in 2015, there has been further development of
6the code and it now differs from PCRE1 in more than just the API. There are new
7features, and the internals have been improved. The original PCRE1 library is
8now obsolete and should not be used in new projects. The latest release of
9PCRE2 is available in .tar.gz, tar.bz2, or .zip form from this GitHub
10repository:
11
12https://github.com/PhilipHazel/pcre2/releases
13
14There is a mailing list for discussion about the development of PCRE2 at
15pcre2-dev@googlegroups.com. You can subscribe by sending an email to
16pcre2-dev+subscribe@googlegroups.com.
17
18You can access the archives and also subscribe or manage your subscription
19here:
20
21https://groups.google.com/pcre2-dev
22
23Please read the NEWS file if you are upgrading from a previous release. The
24contents of this README file are:
25
26 The PCRE2 APIs
27 Documentation for PCRE2
28 Contributions by users of PCRE2
29 Building PCRE2 on non-Unix-like systems
30 Building PCRE2 without using autotools
31 Building PCRE2 using autotools
32 Retrieving configuration information
33 Shared libraries
34 Cross-compiling using autotools
35 Making new tarballs
36 Testing PCRE2
37 Character tables
38 File manifest
39
40
41The PCRE2 APIs
42--------------
43
44PCRE2 is written in C, and it has its own API. There are three sets of
45functions, one for the 8-bit library, which processes strings of bytes, one for
46the 16-bit library, which processes strings of 16-bit values, and one for the
4732-bit library, which processes strings of 32-bit values. Unlike PCRE1, there
48are no C++ wrappers.
49
50The distribution does contain a set of C wrapper functions for the 8-bit
51library that are based on the POSIX regular expression API (see the pcre2posix
52man page). These are built into a library called libpcre2-posix. Note that this
53just provides a POSIX calling interface to PCRE2; the regular expressions
54themselves still follow Perl syntax and semantics. The POSIX API is restricted,
55and does not give full access to all of PCRE2's facilities.
56
57The header file for the POSIX-style functions is called pcre2posix.h. The
58official POSIX name is regex.h, but I did not want to risk possible problems
59with existing files of that name by distributing it that way. To use PCRE2 with
60an existing program that uses the POSIX API, pcre2posix.h will have to be
61renamed or pointed at by a link (or the program modified, of course). See the
62pcre2posix documentation for more details.
63
64
65Documentation for PCRE2
66-----------------------
67
68If you install PCRE2 in the normal way on a Unix-like system, you will end up
69with a set of man pages whose names all start with "pcre2". The one that is
70just called "pcre2" lists all the others. In addition to these man pages, the
71PCRE2 documentation is supplied in two other forms:
72
73 1. There are files called doc/pcre2.txt, doc/pcre2grep.txt, and
74 doc/pcre2test.txt in the source distribution. The first of these is a
75 concatenation of the text forms of all the section 3 man pages except the
76 listing of pcre2demo.c and those that summarize individual functions. The
77 other two are the text forms of the section 1 man pages for the pcre2grep
78 and pcre2test commands. These text forms are provided for ease of scanning
79 with text editors or similar tools. They are installed in
80 <prefix>/share/doc/pcre2, where <prefix> is the installation prefix
81 (defaulting to /usr/local).
82
83 2. A set of files containing all the documentation in HTML form, hyperlinked
84 in various ways, and rooted in a file called index.html, is distributed in
85 doc/html and installed in <prefix>/share/doc/pcre2/html.
86
87
88Building PCRE2 on non-Unix-like systems
89---------------------------------------
90
91For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
92your system supports the use of "configure" and "make" you may be able to build
93PCRE2 using autotools in the same way as for many Unix-like systems.
94
95PCRE2 can also be configured using CMake, which can be run in various ways
96(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
97NON-AUTOTOOLS-BUILD has information about CMake.
98
99PCRE2 has been compiled on many different operating systems. It should be
100straightforward to build PCRE2 on any system that has a Standard C compiler and
101library, because it uses only Standard C functions.
102
103
104Building PCRE2 without using autotools
105--------------------------------------
106
107The use of autotools (in particular, libtool) is problematic in some
108environments, even some that are Unix or Unix-like. See the NON-AUTOTOOLS-BUILD
109file for ways of building PCRE2 without using autotools.
110
111
112Building PCRE2 using autotools
113------------------------------
114
115The following instructions assume the use of the widely used "configure; make;
116make install" (autotools) process.
117
118To build PCRE2 on system that supports autotools, first run the "configure"
119command from the PCRE2 distribution directory, with your current directory set
120to the directory where you want the files to be created. This command is a
121standard GNU "autoconf" configuration script, for which generic instructions
122are supplied in the file INSTALL.
123
124Most commonly, people build PCRE2 within its own distribution directory, and in
125this case, on many systems, just running "./configure" is sufficient. However,
126the usual methods of changing standard defaults are available. For example:
127
128CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local
129
130This command specifies that the C compiler should be run with the flags '-O2
131-Wall' instead of the default, and that "make install" should install PCRE2
132under /opt/local instead of the default /usr/local.
133
134If you want to build in a different directory, just run "configure" with that
135directory as current. For example, suppose you have unpacked the PCRE2 source
136into /source/pcre2/pcre2-xxx, but you want to build it in
137/build/pcre2/pcre2-xxx:
138
139cd /build/pcre2/pcre2-xxx
140/source/pcre2/pcre2-xxx/configure
141
142PCRE2 is written in C and is normally compiled as a C library. However, it is
143possible to build it as a C++ library, though the provided building apparatus
144does not have any features to support this.
145
146There are some optional features that can be included or omitted from the PCRE2
147library. They are also documented in the pcre2build man page.
148
149. By default, both shared and static libraries are built. You can change this
150 by adding one of these options to the "configure" command:
151
152 --disable-shared
153 --disable-static
154
155 (See also "Shared libraries on Unix-like systems" below.)
156
157. By default, only the 8-bit library is built. If you add --enable-pcre2-16 to
158 the "configure" command, the 16-bit library is also built. If you add
159 --enable-pcre2-32 to the "configure" command, the 32-bit library is also
160 built. If you want only the 16-bit or 32-bit library, use --disable-pcre2-8
161 to disable building the 8-bit library.
162
163. If you want to include support for just-in-time (JIT) compiling, which can
164 give large performance improvements on certain platforms, add --enable-jit to
165 the "configure" command. This support is available only for certain hardware
166 architectures. If you try to enable it on an unsupported architecture, there
167 will be a compile time error. If in doubt, use --enable-jit=auto, which
168 enables JIT only if the current hardware is supported.
169
170. If you are enabling JIT under SELinux environment you may also want to add
171 --enable-jit-sealloc, which enables the use of an executable memory allocator
172 that is compatible with SELinux. Warning: this allocator is experimental!
173 It does not support fork() operation and may crash when no disk space is
174 available. This option has no effect if JIT is disabled.
175
176. If you do not want to make use of the default support for UTF-8 Unicode
177 character strings in the 8-bit library, UTF-16 Unicode character strings in
178 the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
179 library, you can add --disable-unicode to the "configure" command. This
180 reduces the size of the libraries. It is not possible to configure one
181 library with Unicode support, and another without, in the same configuration.
182 It is also not possible to use --enable-ebcdic (see below) with Unicode
183 support, so if this option is set, you must also use --disable-unicode.
184
185 When Unicode support is available, the use of a UTF encoding still has to be
186 enabled by setting the PCRE2_UTF option at run time or starting a pattern
187 with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
188 either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
189
190 As well as supporting UTF strings, Unicode support includes support for the
191 \P, \p, and \X sequences that recognize Unicode character properties.
192 However, only the basic two-letter properties such as Lu are supported.
193 Escape sequences such as \d and \w in patterns do not by default make use of
194 Unicode properties, but can be made to do so by setting the PCRE2_UCP option
195 or starting a pattern with (*UCP).
196
197. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
198 of the preceding, or any of the Unicode newline sequences, or the NUL (zero)
199 character as indicating the end of a line. Whatever you specify at build time
200 is the default; the caller of PCRE2 can change the selection at run time. The
201 default newline indicator is a single LF character (the Unix standard). You
202 can specify the default newline indicator by adding --enable-newline-is-cr,
203 --enable-newline-is-lf, --enable-newline-is-crlf,
204 --enable-newline-is-anycrlf, --enable-newline-is-any, or
205 --enable-newline-is-nul to the "configure" command, respectively.
206
207. By default, the sequence \R in a pattern matches any Unicode line ending
208 sequence. This is independent of the option specifying what PCRE2 considers
209 to be the end of a line (see above). However, the caller of PCRE2 can
210 restrict \R to match only CR, LF, or CRLF. You can make this the default by
211 adding --enable-bsr-anycrlf to the "configure" command (bsr = "backslash R").
212
213. In a pattern, the escape sequence \C matches a single code unit, even in a
214 UTF mode. This can be dangerous because it breaks up multi-code-unit
215 characters. You can build PCRE2 with the use of \C permanently locked out by
216 adding --enable-never-backslash-C (note the upper case C) to the "configure"
217 command. When \C is allowed by the library, individual applications can lock
218 it out by calling pcre2_compile() with the PCRE2_NEVER_BACKSLASH_C option.
219
220. PCRE2 has a counter that limits the depth of nesting of parentheses in a
221 pattern. This limits the amount of system stack that a pattern uses when it
222 is compiled. The default is 250, but you can change it by setting, for
223 example,
224
225 --with-parens-nest-limit=500
226
227. PCRE2 has a counter that can be set to limit the amount of computing resource
228 it uses when matching a pattern. If the limit is exceeded during a match, the
229 match fails. The default is ten million. You can change the default by
230 setting, for example,
231
232 --with-match-limit=500000
233
234 on the "configure" command. This is just the default; individual calls to
235 pcre2_match() or pcre2_dfa_match() can supply their own value. There is more
236 discussion in the pcre2api man page (search for pcre2_set_match_limit).
237
238. There is a separate counter that limits the depth of nested backtracking
239 (pcre2_match()) or nested function calls (pcre2_dfa_match()) during a
240 matching process, which indirectly limits the amount of heap memory that is
241 used, and in the case of pcre2_dfa_match() the amount of stack as well. This
242 counter also has a default of ten million, which is essentially "unlimited".
243 You can change the default by setting, for example,
244
245 --with-match-limit-depth=5000
246
247 There is more discussion in the pcre2api man page (search for
248 pcre2_set_depth_limit).
249
250. You can also set an explicit limit on the amount of heap memory used by
251 the pcre2_match() and pcre2_dfa_match() interpreters:
252
253 --with-heap-limit=500
254
255 The units are kibibytes (units of 1024 bytes). This limit does not apply when
256 the JIT optimization (which has its own memory control features) is used.
257 There is more discussion on the pcre2api man page (search for
258 pcre2_set_heap_limit).
259
260. In the 8-bit library, the default maximum compiled pattern size is around
261 64 kibibytes. You can increase this by adding --with-link-size=3 to the
262 "configure" command. PCRE2 then uses three bytes instead of two for offsets
263 to different parts of the compiled pattern. In the 16-bit library,
264 --with-link-size=3 is the same as --with-link-size=4, which (in both
265 libraries) uses four-byte offsets. Increasing the internal link size reduces
266 performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
267 link size setting is ignored, as 4-byte offsets are always used.
268
269. For speed, PCRE2 uses four tables for manipulating and identifying characters
270 whose code point values are less than 256. By default, it uses a set of
271 tables for ASCII encoding that is part of the distribution. If you specify
272
273 --enable-rebuild-chartables
274
275 a program called pcre2_dftables is compiled and run in the default C locale
276 when you obey "make". It builds a source file called pcre2_chartables.c. If
277 you do not specify this option, pcre2_chartables.c is created as a copy of
278 pcre2_chartables.c.dist. See "Character tables" below for further
279 information.
280
281. It is possible to compile PCRE2 for use on systems that use EBCDIC as their
282 character code (as opposed to ASCII/Unicode) by specifying
283
284 --enable-ebcdic --disable-unicode
285
286 This automatically implies --enable-rebuild-chartables (see above). However,
287 when PCRE2 is built this way, it always operates in EBCDIC. It cannot support
288 both EBCDIC and UTF-8/16/32. There is a second option, --enable-ebcdic-nl25,
289 which specifies that the code value for the EBCDIC NL character is 0x25
290 instead of the default 0x15.
291
292. If you specify --enable-debug, additional debugging code is included in the
293 build. This option is intended for use by the PCRE2 maintainers.
294
295. In environments where valgrind is installed, if you specify
296
297 --enable-valgrind
298
299 PCRE2 will use valgrind annotations to mark certain memory regions as
300 unaddressable. This allows it to detect invalid memory accesses, and is
301 mostly useful for debugging PCRE2 itself.
302
303. In environments where the gcc compiler is used and lcov is installed, if you
304 specify
305
306 --enable-coverage
307
308 the build process implements a code coverage report for the test suite. The
309 report is generated by running "make coverage". If ccache is installed on
310 your system, it must be disabled when building PCRE2 for coverage reporting.
311 You can do this by setting the environment variable CCACHE_DISABLE=1 before
312 running "make" to build PCRE2. There is more information about coverage
313 reporting in the "pcre2build" documentation.
314
315. When JIT support is enabled, pcre2grep automatically makes use of it, unless
316 you add --disable-pcre2grep-jit to the "configure" command.
317
318. There is support for calling external programs during matching in the
319 pcre2grep command, using PCRE2's callout facility with string arguments. This
320 support can be disabled by adding --disable-pcre2grep-callout to the
321 "configure" command. There are two kinds of callout: one that generates
322 output from inbuilt code, and another that calls an external program. The
323 latter has special support for Windows and VMS; otherwise it assumes the
324 existence of the fork() function. This facility can be disabled by adding
325 --disable-pcre2grep-callout-fork to the "configure" command.
326
327. The pcre2grep program currently supports only 8-bit data files, and so
328 requires the 8-bit PCRE2 library. It is possible to compile pcre2grep to use
329 libz and/or libbz2, in order to read .gz and .bz2 files (respectively), by
330 specifying one or both of
331
332 --enable-pcre2grep-libz
333 --enable-pcre2grep-libbz2
334
335 Of course, the relevant libraries must be installed on your system.
336
337. The default starting size (in bytes) of the internal buffer used by pcre2grep
338 can be set by, for example:
339
340 --with-pcre2grep-bufsize=51200
341
342 The value must be a plain integer. The default is 20480. The amount of memory
343 used by pcre2grep is actually three times this number, to allow for "before"
344 and "after" lines. If very long lines are encountered, the buffer is
345 automatically enlarged, up to a fixed maximum size.
346
347. The default maximum size of pcre2grep's internal buffer can be set by, for
348 example:
349
350 --with-pcre2grep-max-bufsize=2097152
351
352 The default is either 1048576 or the value of --with-pcre2grep-bufsize,
353 whichever is the larger.
354
355. It is possible to compile pcre2test so that it links with the libreadline
356 or libedit libraries, by specifying, respectively,
357
358 --enable-pcre2test-libreadline or --enable-pcre2test-libedit
359
360 If this is done, when pcre2test's input is from a terminal, it reads it using
361 the readline() function. This provides line-editing and history facilities.
362 Note that libreadline is GPL-licenced, so if you distribute a binary of
363 pcre2test linked in this way, there may be licensing issues. These can be
364 avoided by linking with libedit (which has a BSD licence) instead.
365
366 Enabling libreadline causes the -lreadline option to be added to the
367 pcre2test build. In many operating environments with a sytem-installed
368 readline library this is sufficient. However, in some environments (e.g. if
369 an unmodified distribution version of readline is in use), it may be
370 necessary to specify something like LIBS="-lncurses" as well. This is
371 because, to quote the readline INSTALL, "Readline uses the termcap functions,
372 but does not link with the termcap or curses library itself, allowing
373 applications which link with readline the to choose an appropriate library."
374 If you get error messages about missing functions tgetstr, tgetent, tputs,
375 tgetflag, or tgoto, this is the problem, and linking with the ncurses library
376 should fix it.
377
378. The C99 standard defines formatting modifiers z and t for size_t and
379 ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers in
380 environments other than Microsoft Visual Studio when __STDC_VERSION__ is
381 defined and has a value greater than or equal to 199901L (indicating C99).
382 However, there is at least one environment that claims to be C99 but does not
383 support these modifiers. If --disable-percent-zt is specified, no use is made
384 of the z or t modifiers. Instead of %td or %zu, %lu is used, with a cast for
385 size_t values.
386
387. There is a special option called --enable-fuzz-support for use by people who
388 want to run fuzzing tests on PCRE2. At present this applies only to the 8-bit
389 library. If set, it causes an extra library called libpcre2-fuzzsupport.a to
390 be built, but not installed. This contains a single function called
391 LLVMFuzzerTestOneInput() whose arguments are a pointer to a string and the
392 length of the string. When called, this function tries to compile the string
393 as a pattern, and if that succeeds, to match it. This is done both with no
394 options and with some random options bits that are generated from the string.
395 Setting --enable-fuzz-support also causes a binary called pcre2fuzzcheck to
396 be created. This is normally run under valgrind or used when PCRE2 is
397 compiled with address sanitizing enabled. It calls the fuzzing function and
398 outputs information about it is doing. The input strings are specified by
399 arguments: if an argument starts with "=" the rest of it is a literal input
400 string. Otherwise, it is assumed to be a file name, and the contents of the
401 file are the test string.
402
403. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
404 which caused pcre2_match() to use individual blocks on the heap for
405 backtracking instead of recursive function calls (which use the stack). This
406 is now obsolete since pcre2_match() was refactored always to use the heap (in
407 a much more efficient way than before). This option is retained for backwards
408 compatibility, but has no effect other than to output a warning.
409
410The "configure" script builds the following files for the basic C library:
411
412. Makefile the makefile that builds the library
413. src/config.h build-time configuration options for the library
414. src/pcre2.h the public PCRE2 header file
415. pcre2-config script that shows the building settings such as CFLAGS
416 that were set for "configure"
417. libpcre2-8.pc )
418. libpcre2-16.pc ) data for the pkg-config command
419. libpcre2-32.pc )
420. libpcre2-posix.pc )
421. libtool script that builds shared and/or static libraries
422
423Versions of config.h and pcre2.h are distributed in the src directory of PCRE2
424tarballs under the names config.h.generic and pcre2.h.generic. These are
425provided for those who have to build PCRE2 without using "configure" or CMake.
426If you use "configure" or CMake, the .generic versions are not used.
427
428The "configure" script also creates config.status, which is an executable
429script that can be run to recreate the configuration, and config.log, which
430contains compiler output from tests that "configure" runs.
431
432Once "configure" has run, you can run "make". This builds whichever of the
433libraries libpcre2-8, libpcre2-16 and libpcre2-32 are configured, and a test
434program called pcre2test. If you enabled JIT support with --enable-jit, another
435test program called pcre2_jit_test is built as well. If the 8-bit library is
436built, libpcre2-posix and the pcre2grep command are also built. Running
437"make" with the -j option may speed up compilation on multiprocessor systems.
438
439The command "make check" runs all the appropriate tests. Details of the PCRE2
440tests are given below in a separate section of this document. The -j option of
441"make" can also be used when running the tests.
442
443You can use "make install" to install PCRE2 into live directories on your
444system. The following are installed (file names are all relative to the
445<prefix> that is set when "configure" is run):
446
447 Commands (bin):
448 pcre2test
449 pcre2grep (if 8-bit support is enabled)
450 pcre2-config
451
452 Libraries (lib):
453 libpcre2-8 (if 8-bit support is enabled)
454 libpcre2-16 (if 16-bit support is enabled)
455 libpcre2-32 (if 32-bit support is enabled)
456 libpcre2-posix (if 8-bit support is enabled)
457
458 Configuration information (lib/pkgconfig):
459 libpcre2-8.pc
460 libpcre2-16.pc
461 libpcre2-32.pc
462 libpcre2-posix.pc
463
464 Header files (include):
465 pcre2.h
466 pcre2posix.h
467
468 Man pages (share/man/man{1,3}):
469 pcre2grep.1
470 pcre2test.1
471 pcre2-config.1
472 pcre2.3
473 pcre2*.3 (lots more pages, all starting "pcre2")
474
475 HTML documentation (share/doc/pcre2/html):
476 index.html
477 *.html (lots more pages, hyperlinked from index.html)
478
479 Text file documentation (share/doc/pcre2):
480 AUTHORS
481 COPYING
482 ChangeLog
483 LICENCE
484 NEWS
485 README
486 pcre2.txt (a concatenation of the man(3) pages)
487 pcre2test.txt the pcre2test man page
488 pcre2grep.txt the pcre2grep man page
489 pcre2-config.txt the pcre2-config man page
490
491If you want to remove PCRE2 from your system, you can run "make uninstall".
492This removes all the files that "make install" installed. However, it does not
493remove any directories, because these are often shared with other programs.
494
495
496Retrieving configuration information
497------------------------------------
498
499Running "make install" installs the command pcre2-config, which can be used to
500recall information about the PCRE2 configuration and installation. For example:
501
502 pcre2-config --version
503
504prints the version number, and
505
506 pcre2-config --libs8
507
508outputs information about where the 8-bit library is installed. This command
509can be included in makefiles for programs that use PCRE2, saving the programmer
510from having to remember too many details. Run pcre2-config with no arguments to
511obtain a list of possible arguments.
512
513The pkg-config command is another system for saving and retrieving information
514about installed libraries. Instead of separate commands for each library, a
515single command is used. For example:
516
517 pkg-config --libs libpcre2-16
518
519The data is held in *.pc files that are installed in a directory called
520<prefix>/lib/pkgconfig.
521
522
523Shared libraries
524----------------
525
526The default distribution builds PCRE2 as shared libraries and static libraries,
527as long as the operating system supports shared libraries. Shared library
528support relies on the "libtool" script which is built as part of the
529"configure" process.
530
531The libtool script is used to compile and link both shared and static
532libraries. They are placed in a subdirectory called .libs when they are newly
533built. The programs pcre2test and pcre2grep are built to use these uninstalled
534libraries (by means of wrapper scripts in the case of shared libraries). When
535you use "make install" to install shared libraries, pcre2grep and pcre2test are
536automatically re-built to use the newly installed shared libraries before being
537installed themselves. However, the versions left in the build directory still
538use the uninstalled libraries.
539
540To build PCRE2 using static libraries only you must use --disable-shared when
541configuring it. For example:
542
543./configure --prefix=/usr/gnu --disable-shared
544
545Then run "make" in the usual way. Similarly, you can use --disable-static to
546build only shared libraries.
547
548
549Cross-compiling using autotools
550-------------------------------
551
552You can specify CC and CFLAGS in the normal way to the "configure" command, in
553order to cross-compile PCRE2 for some other host. However, you should NOT
554specify --enable-rebuild-chartables, because if you do, the pcre2_dftables.c
555source file is compiled and run on the local host, in order to generate the
556inbuilt character tables (the pcre2_chartables.c file). This will probably not
557work, because pcre2_dftables.c needs to be compiled with the local compiler,
558not the cross compiler.
559
560When --enable-rebuild-chartables is not specified, pcre2_chartables.c is
561created by making a copy of pcre2_chartables.c.dist, which is a default set of
562tables that assumes ASCII code. Cross-compiling with the default tables should
563not be a problem.
564
565If you need to modify the character tables when cross-compiling, you should
566move pcre2_chartables.c.dist out of the way, then compile pcre2_dftables.c by
567hand and run it on the local host to make a new version of
568pcre2_chartables.c.dist. See the pcre2build section "Creating character tables
569at build time" for more details.
570
571
572Making new tarballs
573-------------------
574
575The command "make dist" creates two PCRE2 tarballs, in tar.gz and zip formats.
576The command "make distcheck" does the same, but then does a trial build of the
577new distribution to ensure that it works.
578
579If you have modified any of the man page sources in the doc directory, you
580should first run the PrepareRelease script before making a distribution. This
581script creates the .txt and HTML forms of the documentation from the man pages.
582
583
584Testing PCRE2
585-------------
586
587To test the basic PCRE2 library on a Unix-like system, run the RunTest script.
588There is another script called RunGrepTest that tests the pcre2grep command.
589When JIT support is enabled, a third test program called pcre2_jit_test is
590built. Both the scripts and all the program tests are run if you obey "make
591check". For other environments, see the instructions in NON-AUTOTOOLS-BUILD.
592
593The RunTest script runs the pcre2test test program (which is documented in its
594own man page) on each of the relevant testinput files in the testdata
595directory, and compares the output with the contents of the corresponding
596testoutput files. RunTest uses a file called testtry to hold the main output
597from pcre2test. Other files whose names begin with "test" are used as working
598files in some tests.
599
600Some tests are relevant only when certain build-time options were selected. For
601example, the tests for UTF-8/16/32 features are run only when Unicode support
602is available. RunTest outputs a comment when it skips a test.
603
604Many (but not all) of the tests that are not skipped are run twice if JIT
605support is available. On the second run, JIT compilation is forced. This
606testing can be suppressed by putting "nojit" on the RunTest command line.
607
608The entire set of tests is run once for each of the 8-bit, 16-bit and 32-bit
609libraries that are enabled. If you want to run just one set of tests, call
610RunTest with either the -8, -16 or -32 option.
611
612If valgrind is installed, you can run the tests under it by putting "valgrind"
613on the RunTest command line. To run pcre2test on just one or more specific test
614files, give their numbers as arguments to RunTest, for example:
615
616 RunTest 2 7 11
617
618You can also specify ranges of tests such as 3-6 or 3- (meaning 3 to the
619end), or a number preceded by ~ to exclude a test. For example:
620
621 Runtest 3-15 ~10
622
623This runs tests 3 to 15, excluding test 10, and just ~13 runs all the tests
624except test 13. Whatever order the arguments are in, the tests are always run
625in numerical order.
626
627You can also call RunTest with the single argument "list" to cause it to output
628a list of tests.
629
630The test sequence starts with "test 0", which is a special test that has no
631input file, and whose output is not checked. This is because it will be
632different on different hardware and with different configurations. The test
633exists in order to exercise some of pcre2test's code that would not otherwise
634be run.
635
636Tests 1 and 2 can always be run, as they expect only plain text strings (not
637UTF) and make no use of Unicode properties. The first test file can be fed
638directly into the perltest.sh script to check that Perl gives the same results.
639The only difference you should see is in the first few lines, where the Perl
640version is given instead of the PCRE2 version. The second set of tests check
641auxiliary functions, error detection, and run-time flags that are specific to
642PCRE2. It also uses the debugging flags to check some of the internals of
643pcre2_compile().
644
645If you build PCRE2 with a locale setting that is not the standard C locale, the
646character tables may be different (see next paragraph). In some cases, this may
647cause failures in the second set of tests. For example, in a locale where the
648isprint() function yields TRUE for characters in the range 128-255, the use of
649[:isascii:] inside a character class defines a different set of characters, and
650this shows up in this test as a difference in the compiled code, which is being
651listed for checking. For example, where the comparison test output contains
652[\x00-\x7f] the test might contain [\x00-\xff], and similarly in some other
653cases. This is not a bug in PCRE2.
654
655Test 3 checks pcre2_maketables(), the facility for building a set of character
656tables for a specific locale and using them instead of the default tables. The
657script uses the "locale" command to check for the availability of the "fr_FR",
658"french", or "fr" locale, and uses the first one that it finds. If the "locale"
659command fails, or if its output doesn't include "fr_FR", "french", or "fr" in
660the list of available locales, the third test cannot be run, and a comment is
661output to say why. If running this test produces an error like this:
662
663 ** Failed to set locale "fr_FR"
664
665it means that the given locale is not available on your system, despite being
666listed by "locale". This does not mean that PCRE2 is broken. There are three
667alternative output files for the third test, because three different versions
668of the French locale have been encountered. The test passes if its output
669matches any one of them.
670
671Tests 4 and 5 check UTF and Unicode property support, test 4 being compatible
672with the perltest.sh script, and test 5 checking PCRE2-specific things.
673
674Tests 6 and 7 check the pcre2_dfa_match() alternative matching function, in
675non-UTF mode and UTF-mode with Unicode property support, respectively.
676
677Test 8 checks some internal offsets and code size features, but it is run only
678when Unicode support is enabled. The output is different in 8-bit, 16-bit, and
67932-bit modes and for different link sizes, so there are different output files
680for each mode and link size.
681
682Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
68316-bit and 32-bit modes. These are tests that generate different output in
6848-bit mode. Each pair are for general cases and Unicode support, respectively.
685
686Test 13 checks the handling of non-UTF characters greater than 255 by
687pcre2_dfa_match() in 16-bit and 32-bit modes.
688
689Test 14 contains some special UTF and UCP tests that give different output for
690different code unit widths.
691
692Test 15 contains a number of tests that must not be run with JIT. They check,
693among other non-JIT things, the match-limiting features of the intepretive
694matcher.
695
696Test 16 is run only when JIT support is not available. It checks that an
697attempt to use JIT has the expected behaviour.
698
699Test 17 is run only when JIT support is available. It checks JIT complete and
700partial modes, match-limiting under JIT, and other JIT-specific features.
701
702Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
703the 8-bit library, without and with Unicode support, respectively.
704
705Test 20 checks the serialization functions by writing a set of compiled
706patterns to a file, and then reloading and checking them.
707
708Tests 21 and 22 test \C support when the use of \C is not locked out, without
709and with UTF support, respectively. Test 23 tests \C when it is locked out.
710
711Tests 24 and 25 test the experimental pattern conversion functions, without and
712with UTF support, respectively.
713
714
715Character tables
716----------------
717
718For speed, PCRE2 uses four tables for manipulating and identifying characters
719whose code point values are less than 256. By default, a set of tables that is
720built into the library is used. The pcre2_maketables() function can be called
721by an application to create a new set of tables in the current locale. This are
722passed to PCRE2 by calling pcre2_set_character_tables() to put a pointer into a
723compile context.
724
725The source file called pcre2_chartables.c contains the default set of tables.
726By default, this is created as a copy of pcre2_chartables.c.dist, which
727contains tables for ASCII coding. However, if --enable-rebuild-chartables is
728specified for ./configure, a new version of pcre2_chartables.c is built by the
729program pcre2_dftables (compiled from pcre2_dftables.c), which uses the ANSI C
730character handling functions such as isalnum(), isalpha(), isupper(),
731islower(), etc. to build the table sources. This means that the default C
732locale that is set for your system will control the contents of these default
733tables. You can change the default tables by editing pcre2_chartables.c and
734then re-building PCRE2. If you do this, you should take care to ensure that the
735file does not get automatically re-generated. The best way to do this is to
736move pcre2_chartables.c.dist out of the way and replace it with your customized
737tables.
738
739When the pcre2_dftables program is run as a result of specifying
740--enable-rebuild-chartables, it uses the default C locale that is set on your
741system. It does not pay attention to the LC_xxx environment variables. In other
742words, it uses the system's default locale rather than whatever the compiling
743user happens to have set. If you really do want to build a source set of
744character tables in a locale that is specified by the LC_xxx variables, you can
745run the pcre2_dftables program by hand with the -L option. For example:
746
747 ./pcre2_dftables -L pcre2_chartables.c.special
748
749The second argument names the file where the source code for the tables is
750written. The first two 256-byte tables provide lower casing and case flipping
751functions, respectively. The next table consists of a number of 32-byte bit
752maps which identify certain character classes such as digits, "word"
753characters, white space, etc. These are used when building 32-byte bit maps
754that represent character classes for code points less than 256. The final
755256-byte table has bits indicating various character types, as follows:
756
757 1 white space character
758 2 letter
759 4 lower case letter
760 8 decimal digit
761 16 alphanumeric or '_'
762
763You can also specify -b (with or without -L) when running pcre2_dftables. This
764causes the tables to be written in binary instead of as source code. A set of
765binary tables can be loaded into memory by an application and passed to
766pcre2_compile() in the same way as tables created dynamically by calling
767pcre2_maketables(). The tables are just a string of bytes, independent of
768hardware characteristics such as endianness. This means they can be bundled
769with an application that runs in different environments, to ensure consistent
770behaviour.
771
772See also the pcre2build section "Creating character tables at build time".
773
774
775File manifest
776-------------
777
778The distribution should contain the files listed below.
779
780(A) Source files for the PCRE2 library functions and their headers are found in
781 the src directory:
782
783 src/pcre2_dftables.c auxiliary program for building pcre2_chartables.c
784 when --enable-rebuild-chartables is specified
785
786 src/pcre2_chartables.c.dist a default set of character tables that assume
787 ASCII coding; unless --enable-rebuild-chartables is
788 specified, used by copying to pcre2_chartables.c
789
790 src/pcre2posix.c )
791 src/pcre2_auto_possess.c )
792 src/pcre2_compile.c )
793 src/pcre2_config.c )
794 src/pcre2_context.c )
795 src/pcre2_convert.c )
796 src/pcre2_dfa_match.c )
797 src/pcre2_error.c )
798 src/pcre2_extuni.c )
799 src/pcre2_find_bracket.c )
800 src/pcre2_jit_compile.c )
801 src/pcre2_jit_match.c ) sources for the functions in the library,
802 src/pcre2_jit_misc.c ) and some internal functions that they use
803 src/pcre2_maketables.c )
804 src/pcre2_match.c )
805 src/pcre2_match_data.c )
806 src/pcre2_newline.c )
807 src/pcre2_ord2utf.c )
808 src/pcre2_pattern_info.c )
809 src/pcre2_script_run.c )
810 src/pcre2_serialize.c )
811 src/pcre2_string_utils.c )
812 src/pcre2_study.c )
813 src/pcre2_substitute.c )
814 src/pcre2_substring.c )
815 src/pcre2_tables.c )
816 src/pcre2_ucd.c )
817 src/pcre2_valid_utf.c )
818 src/pcre2_xclass.c )
819
820 src/pcre2_printint.c debugging function that is used by pcre2test,
821 src/pcre2_fuzzsupport.c function for (optional) fuzzing support
822
823 src/config.h.in template for config.h, when built by "configure"
824 src/pcre2.h.in template for pcre2.h when built by "configure"
825 src/pcre2posix.h header for the external POSIX wrapper API
826 src/pcre2_internal.h header for internal use
827 src/pcre2_intmodedep.h a mode-specific internal header
828 src/pcre2_ucp.h header for Unicode property handling
829
830 sljit/* source files for the JIT compiler
831
832(B) Source files for programs that use PCRE2:
833
834 src/pcre2demo.c simple demonstration of coding calls to PCRE2
835 src/pcre2grep.c source of a grep utility that uses PCRE2
836 src/pcre2test.c comprehensive test program
837 src/pcre2_jit_test.c JIT test program
838
839(C) Auxiliary files:
840
841 132html script to turn "man" pages into HTML
842 AUTHORS information about the author of PCRE2
843 ChangeLog log of changes to the code
844 CleanTxt script to clean nroff output for txt man pages
845 Detrail script to remove trailing spaces
846 HACKING some notes about the internals of PCRE2
847 INSTALL generic installation instructions
848 LICENCE conditions for the use of PCRE2
849 COPYING the same, using GNU's standard name
850 Makefile.in ) template for Unix Makefile, which is built by
851 ) "configure"
852 Makefile.am ) the automake input that was used to create
853 ) Makefile.in
854 NEWS important changes in this release
855 NON-AUTOTOOLS-BUILD notes on building PCRE2 without using autotools
856 PrepareRelease script to make preparations for "make dist"
857 README this file
858 RunTest a Unix shell script for running tests
859 RunGrepTest a Unix shell script for pcre2grep tests
860 aclocal.m4 m4 macros (generated by "aclocal")
861 config.guess ) files used by libtool,
862 config.sub ) used only when building a shared library
863 configure a configuring shell script (built by autoconf)
864 configure.ac ) the autoconf input that was used to build
865 ) "configure" and config.h
866 depcomp ) script to find program dependencies, generated by
867 ) automake
868 doc/*.3 man page sources for PCRE2
869 doc/*.1 man page sources for pcre2grep and pcre2test
870 doc/index.html.src the base HTML page
871 doc/html/* HTML documentation
872 doc/pcre2.txt plain text version of the man pages
873 doc/pcre2test.txt plain text documentation of test program
874 install-sh a shell script for installing files
875 libpcre2-8.pc.in template for libpcre2-8.pc for pkg-config
876 libpcre2-16.pc.in template for libpcre2-16.pc for pkg-config
877 libpcre2-32.pc.in template for libpcre2-32.pc for pkg-config
878 libpcre2-posix.pc.in template for libpcre2-posix.pc for pkg-config
879 ltmain.sh file used to build a libtool script
880 missing ) common stub for a few missing GNU programs while
881 ) installing, generated by automake
882 mkinstalldirs script for making install directories
883 perltest.sh Script for running a Perl test program
884 pcre2-config.in source of script which retains PCRE2 information
885 testdata/testinput* test data for main library tests
886 testdata/testoutput* expected test results
887 testdata/grep* input and output for pcre2grep tests
888 testdata/* other supporting test files
889
890(D) Auxiliary files for cmake support
891
892 cmake/COPYING-CMAKE-SCRIPTS
893 cmake/FindPackageHandleStandardArgs.cmake
894 cmake/FindEditline.cmake
895 cmake/FindReadline.cmake
896 CMakeLists.txt
897 config-cmake.h.in
898
899(E) Auxiliary files for building PCRE2 "by hand"
900
901 src/pcre2.h.generic ) a version of the public PCRE2 header file
902 ) for use in non-"configure" environments
903 src/config.h.generic ) a version of config.h for use in non-"configure"
904 ) environments
905
906Philip Hazel
907Email local part: Philip.Hazel
908Email domain: gmail.com
909Last updated: 27 August 2021