blob: 78969444ffc57c33d100a31e709c1bc34a3e6247 [file] [log] [blame]
Elliott Hughes5b808042021-10-01 10:56:10 -07001README file for PCRE2 (Perl-compatible regular expression library)
2------------------------------------------------------------------
3
4PCRE2 is a re-working of the original PCRE1 library to provide an entirely new
5API. Since its initial release in 2015, there has been further development of
6the code and it now differs from PCRE1 in more than just the API. There are new
7features, and the internals have been improved. The original PCRE1 library is
Elliott Hughes16619d62021-10-29 12:10:38 -07008now obsolete and no longer maintained. The latest release of PCRE2 is available
9in .tar.gz, tar.bz2, or .zip form from this GitHub repository:
Elliott Hughes5b808042021-10-01 10:56:10 -070010
11https://github.com/PhilipHazel/pcre2/releases
12
13There is a mailing list for discussion about the development of PCRE2 at
14pcre2-dev@googlegroups.com. You can subscribe by sending an email to
15pcre2-dev+subscribe@googlegroups.com.
16
17You can access the archives and also subscribe or manage your subscription
18here:
19
20https://groups.google.com/pcre2-dev
21
22Please read the NEWS file if you are upgrading from a previous release. The
23contents of this README file are:
24
25 The PCRE2 APIs
26 Documentation for PCRE2
27 Contributions by users of PCRE2
28 Building PCRE2 on non-Unix-like systems
29 Building PCRE2 without using autotools
30 Building PCRE2 using autotools
31 Retrieving configuration information
32 Shared libraries
33 Cross-compiling using autotools
34 Making new tarballs
35 Testing PCRE2
36 Character tables
37 File manifest
38
39
40The PCRE2 APIs
41--------------
42
43PCRE2 is written in C, and it has its own API. There are three sets of
44functions, one for the 8-bit library, which processes strings of bytes, one for
45the 16-bit library, which processes strings of 16-bit values, and one for the
4632-bit library, which processes strings of 32-bit values. Unlike PCRE1, there
47are no C++ wrappers.
48
49The distribution does contain a set of C wrapper functions for the 8-bit
50library that are based on the POSIX regular expression API (see the pcre2posix
51man page). These are built into a library called libpcre2-posix. Note that this
52just provides a POSIX calling interface to PCRE2; the regular expressions
53themselves still follow Perl syntax and semantics. The POSIX API is restricted,
54and does not give full access to all of PCRE2's facilities.
55
56The header file for the POSIX-style functions is called pcre2posix.h. The
57official POSIX name is regex.h, but I did not want to risk possible problems
58with existing files of that name by distributing it that way. To use PCRE2 with
59an existing program that uses the POSIX API, pcre2posix.h will have to be
60renamed or pointed at by a link (or the program modified, of course). See the
61pcre2posix documentation for more details.
62
63
64Documentation for PCRE2
65-----------------------
66
67If you install PCRE2 in the normal way on a Unix-like system, you will end up
68with a set of man pages whose names all start with "pcre2". The one that is
69just called "pcre2" lists all the others. In addition to these man pages, the
70PCRE2 documentation is supplied in two other forms:
71
72 1. There are files called doc/pcre2.txt, doc/pcre2grep.txt, and
73 doc/pcre2test.txt in the source distribution. The first of these is a
74 concatenation of the text forms of all the section 3 man pages except the
75 listing of pcre2demo.c and those that summarize individual functions. The
76 other two are the text forms of the section 1 man pages for the pcre2grep
77 and pcre2test commands. These text forms are provided for ease of scanning
78 with text editors or similar tools. They are installed in
79 <prefix>/share/doc/pcre2, where <prefix> is the installation prefix
80 (defaulting to /usr/local).
81
82 2. A set of files containing all the documentation in HTML form, hyperlinked
83 in various ways, and rooted in a file called index.html, is distributed in
84 doc/html and installed in <prefix>/share/doc/pcre2/html.
85
86
87Building PCRE2 on non-Unix-like systems
88---------------------------------------
89
90For a non-Unix-like system, please read the file NON-AUTOTOOLS-BUILD, though if
91your system supports the use of "configure" and "make" you may be able to build
92PCRE2 using autotools in the same way as for many Unix-like systems.
93
94PCRE2 can also be configured using CMake, which can be run in various ways
95(command line, GUI, etc). This creates Makefiles, solution files, etc. The file
96NON-AUTOTOOLS-BUILD has information about CMake.
97
98PCRE2 has been compiled on many different operating systems. It should be
99straightforward to build PCRE2 on any system that has a Standard C compiler and
100library, because it uses only Standard C functions.
101
102
103Building PCRE2 without using autotools
104--------------------------------------
105
106The use of autotools (in particular, libtool) is problematic in some
107environments, even some that are Unix or Unix-like. See the NON-AUTOTOOLS-BUILD
108file for ways of building PCRE2 without using autotools.
109
110
111Building PCRE2 using autotools
112------------------------------
113
114The following instructions assume the use of the widely used "configure; make;
115make install" (autotools) process.
116
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700117If you have downloaded and unpacked a PCRE2 release tarball, run the
118"configure" command from the PCRE2 directory, with your current directory set
Elliott Hughes5b808042021-10-01 10:56:10 -0700119to the directory where you want the files to be created. This command is a
120standard GNU "autoconf" configuration script, for which generic instructions
121are supplied in the file INSTALL.
122
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700123The files in the GitHub repository do not contain "configure". If you have
124downloaded the PCRE2 source files from GitHub, before you can run "configure"
125you must run the shell script called autogen.sh. This runs a number of
126autotools to create a "configure" script (you must of course have the autotools
127commands installed in order to do this).
128
Elliott Hughes5b808042021-10-01 10:56:10 -0700129Most commonly, people build PCRE2 within its own distribution directory, and in
130this case, on many systems, just running "./configure" is sufficient. However,
131the usual methods of changing standard defaults are available. For example:
132
133CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local
134
135This command specifies that the C compiler should be run with the flags '-O2
136-Wall' instead of the default, and that "make install" should install PCRE2
137under /opt/local instead of the default /usr/local.
138
139If you want to build in a different directory, just run "configure" with that
140directory as current. For example, suppose you have unpacked the PCRE2 source
141into /source/pcre2/pcre2-xxx, but you want to build it in
142/build/pcre2/pcre2-xxx:
143
144cd /build/pcre2/pcre2-xxx
145/source/pcre2/pcre2-xxx/configure
146
147PCRE2 is written in C and is normally compiled as a C library. However, it is
148possible to build it as a C++ library, though the provided building apparatus
149does not have any features to support this.
150
151There are some optional features that can be included or omitted from the PCRE2
152library. They are also documented in the pcre2build man page.
153
154. By default, both shared and static libraries are built. You can change this
155 by adding one of these options to the "configure" command:
156
157 --disable-shared
158 --disable-static
159
160 (See also "Shared libraries on Unix-like systems" below.)
161
162. By default, only the 8-bit library is built. If you add --enable-pcre2-16 to
163 the "configure" command, the 16-bit library is also built. If you add
164 --enable-pcre2-32 to the "configure" command, the 32-bit library is also
165 built. If you want only the 16-bit or 32-bit library, use --disable-pcre2-8
166 to disable building the 8-bit library.
167
168. If you want to include support for just-in-time (JIT) compiling, which can
169 give large performance improvements on certain platforms, add --enable-jit to
170 the "configure" command. This support is available only for certain hardware
171 architectures. If you try to enable it on an unsupported architecture, there
172 will be a compile time error. If in doubt, use --enable-jit=auto, which
173 enables JIT only if the current hardware is supported.
174
175. If you are enabling JIT under SELinux environment you may also want to add
176 --enable-jit-sealloc, which enables the use of an executable memory allocator
177 that is compatible with SELinux. Warning: this allocator is experimental!
178 It does not support fork() operation and may crash when no disk space is
179 available. This option has no effect if JIT is disabled.
180
181. If you do not want to make use of the default support for UTF-8 Unicode
182 character strings in the 8-bit library, UTF-16 Unicode character strings in
183 the 16-bit library, or UTF-32 Unicode character strings in the 32-bit
184 library, you can add --disable-unicode to the "configure" command. This
185 reduces the size of the libraries. It is not possible to configure one
186 library with Unicode support, and another without, in the same configuration.
187 It is also not possible to use --enable-ebcdic (see below) with Unicode
188 support, so if this option is set, you must also use --disable-unicode.
189
190 When Unicode support is available, the use of a UTF encoding still has to be
191 enabled by setting the PCRE2_UTF option at run time or starting a pattern
192 with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
193 either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms.
194
195 As well as supporting UTF strings, Unicode support includes support for the
196 \P, \p, and \X sequences that recognize Unicode character properties.
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700197 However, only a subset of Unicode properties are supported; see the
198 pcre2pattern man page for details. Escape sequences such as \d and \w in
199 patterns do not by default make use of Unicode properties, but can be made to
200 do so by setting the PCRE2_UCP option or starting a pattern with (*UCP).
Elliott Hughes5b808042021-10-01 10:56:10 -0700201
202. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
203 of the preceding, or any of the Unicode newline sequences, or the NUL (zero)
204 character as indicating the end of a line. Whatever you specify at build time
205 is the default; the caller of PCRE2 can change the selection at run time. The
206 default newline indicator is a single LF character (the Unix standard). You
207 can specify the default newline indicator by adding --enable-newline-is-cr,
208 --enable-newline-is-lf, --enable-newline-is-crlf,
209 --enable-newline-is-anycrlf, --enable-newline-is-any, or
210 --enable-newline-is-nul to the "configure" command, respectively.
211
212. By default, the sequence \R in a pattern matches any Unicode line ending
213 sequence. This is independent of the option specifying what PCRE2 considers
214 to be the end of a line (see above). However, the caller of PCRE2 can
215 restrict \R to match only CR, LF, or CRLF. You can make this the default by
216 adding --enable-bsr-anycrlf to the "configure" command (bsr = "backslash R").
217
218. In a pattern, the escape sequence \C matches a single code unit, even in a
219 UTF mode. This can be dangerous because it breaks up multi-code-unit
220 characters. You can build PCRE2 with the use of \C permanently locked out by
221 adding --enable-never-backslash-C (note the upper case C) to the "configure"
222 command. When \C is allowed by the library, individual applications can lock
223 it out by calling pcre2_compile() with the PCRE2_NEVER_BACKSLASH_C option.
224
225. PCRE2 has a counter that limits the depth of nesting of parentheses in a
226 pattern. This limits the amount of system stack that a pattern uses when it
227 is compiled. The default is 250, but you can change it by setting, for
228 example,
229
230 --with-parens-nest-limit=500
231
232. PCRE2 has a counter that can be set to limit the amount of computing resource
233 it uses when matching a pattern. If the limit is exceeded during a match, the
234 match fails. The default is ten million. You can change the default by
235 setting, for example,
236
237 --with-match-limit=500000
238
239 on the "configure" command. This is just the default; individual calls to
240 pcre2_match() or pcre2_dfa_match() can supply their own value. There is more
241 discussion in the pcre2api man page (search for pcre2_set_match_limit).
242
243. There is a separate counter that limits the depth of nested backtracking
244 (pcre2_match()) or nested function calls (pcre2_dfa_match()) during a
245 matching process, which indirectly limits the amount of heap memory that is
246 used, and in the case of pcre2_dfa_match() the amount of stack as well. This
247 counter also has a default of ten million, which is essentially "unlimited".
248 You can change the default by setting, for example,
249
250 --with-match-limit-depth=5000
251
252 There is more discussion in the pcre2api man page (search for
253 pcre2_set_depth_limit).
254
255. You can also set an explicit limit on the amount of heap memory used by
256 the pcre2_match() and pcre2_dfa_match() interpreters:
257
258 --with-heap-limit=500
259
260 The units are kibibytes (units of 1024 bytes). This limit does not apply when
261 the JIT optimization (which has its own memory control features) is used.
262 There is more discussion on the pcre2api man page (search for
263 pcre2_set_heap_limit).
264
265. In the 8-bit library, the default maximum compiled pattern size is around
266 64 kibibytes. You can increase this by adding --with-link-size=3 to the
267 "configure" command. PCRE2 then uses three bytes instead of two for offsets
268 to different parts of the compiled pattern. In the 16-bit library,
269 --with-link-size=3 is the same as --with-link-size=4, which (in both
270 libraries) uses four-byte offsets. Increasing the internal link size reduces
271 performance in the 8-bit and 16-bit libraries. In the 32-bit library, the
272 link size setting is ignored, as 4-byte offsets are always used.
273
274. For speed, PCRE2 uses four tables for manipulating and identifying characters
275 whose code point values are less than 256. By default, it uses a set of
276 tables for ASCII encoding that is part of the distribution. If you specify
277
278 --enable-rebuild-chartables
279
280 a program called pcre2_dftables is compiled and run in the default C locale
281 when you obey "make". It builds a source file called pcre2_chartables.c. If
282 you do not specify this option, pcre2_chartables.c is created as a copy of
283 pcre2_chartables.c.dist. See "Character tables" below for further
284 information.
285
286. It is possible to compile PCRE2 for use on systems that use EBCDIC as their
287 character code (as opposed to ASCII/Unicode) by specifying
288
289 --enable-ebcdic --disable-unicode
290
291 This automatically implies --enable-rebuild-chartables (see above). However,
292 when PCRE2 is built this way, it always operates in EBCDIC. It cannot support
293 both EBCDIC and UTF-8/16/32. There is a second option, --enable-ebcdic-nl25,
294 which specifies that the code value for the EBCDIC NL character is 0x25
295 instead of the default 0x15.
296
297. If you specify --enable-debug, additional debugging code is included in the
298 build. This option is intended for use by the PCRE2 maintainers.
299
300. In environments where valgrind is installed, if you specify
301
302 --enable-valgrind
303
304 PCRE2 will use valgrind annotations to mark certain memory regions as
305 unaddressable. This allows it to detect invalid memory accesses, and is
306 mostly useful for debugging PCRE2 itself.
307
308. In environments where the gcc compiler is used and lcov is installed, if you
309 specify
310
311 --enable-coverage
312
313 the build process implements a code coverage report for the test suite. The
314 report is generated by running "make coverage". If ccache is installed on
315 your system, it must be disabled when building PCRE2 for coverage reporting.
316 You can do this by setting the environment variable CCACHE_DISABLE=1 before
317 running "make" to build PCRE2. There is more information about coverage
318 reporting in the "pcre2build" documentation.
319
320. When JIT support is enabled, pcre2grep automatically makes use of it, unless
321 you add --disable-pcre2grep-jit to the "configure" command.
322
323. There is support for calling external programs during matching in the
324 pcre2grep command, using PCRE2's callout facility with string arguments. This
325 support can be disabled by adding --disable-pcre2grep-callout to the
326 "configure" command. There are two kinds of callout: one that generates
327 output from inbuilt code, and another that calls an external program. The
328 latter has special support for Windows and VMS; otherwise it assumes the
329 existence of the fork() function. This facility can be disabled by adding
330 --disable-pcre2grep-callout-fork to the "configure" command.
331
332. The pcre2grep program currently supports only 8-bit data files, and so
333 requires the 8-bit PCRE2 library. It is possible to compile pcre2grep to use
334 libz and/or libbz2, in order to read .gz and .bz2 files (respectively), by
335 specifying one or both of
336
337 --enable-pcre2grep-libz
338 --enable-pcre2grep-libbz2
339
340 Of course, the relevant libraries must be installed on your system.
341
342. The default starting size (in bytes) of the internal buffer used by pcre2grep
343 can be set by, for example:
344
345 --with-pcre2grep-bufsize=51200
346
347 The value must be a plain integer. The default is 20480. The amount of memory
348 used by pcre2grep is actually three times this number, to allow for "before"
349 and "after" lines. If very long lines are encountered, the buffer is
350 automatically enlarged, up to a fixed maximum size.
351
352. The default maximum size of pcre2grep's internal buffer can be set by, for
353 example:
354
355 --with-pcre2grep-max-bufsize=2097152
356
357 The default is either 1048576 or the value of --with-pcre2grep-bufsize,
358 whichever is the larger.
359
360. It is possible to compile pcre2test so that it links with the libreadline
361 or libedit libraries, by specifying, respectively,
362
363 --enable-pcre2test-libreadline or --enable-pcre2test-libedit
364
365 If this is done, when pcre2test's input is from a terminal, it reads it using
366 the readline() function. This provides line-editing and history facilities.
367 Note that libreadline is GPL-licenced, so if you distribute a binary of
368 pcre2test linked in this way, there may be licensing issues. These can be
369 avoided by linking with libedit (which has a BSD licence) instead.
370
371 Enabling libreadline causes the -lreadline option to be added to the
372 pcre2test build. In many operating environments with a sytem-installed
373 readline library this is sufficient. However, in some environments (e.g. if
374 an unmodified distribution version of readline is in use), it may be
375 necessary to specify something like LIBS="-lncurses" as well. This is
376 because, to quote the readline INSTALL, "Readline uses the termcap functions,
377 but does not link with the termcap or curses library itself, allowing
378 applications which link with readline the to choose an appropriate library."
379 If you get error messages about missing functions tgetstr, tgetent, tputs,
380 tgetflag, or tgoto, this is the problem, and linking with the ncurses library
381 should fix it.
382
383. The C99 standard defines formatting modifiers z and t for size_t and
384 ptrdiff_t values, respectively. By default, PCRE2 uses these modifiers in
Elliott Hughes16619d62021-10-29 12:10:38 -0700385 environments other than Microsoft Visual Studio versions earlier than 2013
386 when __STDC_VERSION__ is defined and has a value greater than or equal to
387 199901L (indicating C99). However, there is at least one environment that
388 claims to be C99 but does not support these modifiers. If
389 --disable-percent-zt is specified, no use is made of the z or t modifiers.
390 Instead of %td or %zu, %lu is used, with a cast for size_t values.
Elliott Hughes5b808042021-10-01 10:56:10 -0700391
392. There is a special option called --enable-fuzz-support for use by people who
393 want to run fuzzing tests on PCRE2. At present this applies only to the 8-bit
394 library. If set, it causes an extra library called libpcre2-fuzzsupport.a to
395 be built, but not installed. This contains a single function called
396 LLVMFuzzerTestOneInput() whose arguments are a pointer to a string and the
397 length of the string. When called, this function tries to compile the string
398 as a pattern, and if that succeeds, to match it. This is done both with no
399 options and with some random options bits that are generated from the string.
400 Setting --enable-fuzz-support also causes a binary called pcre2fuzzcheck to
401 be created. This is normally run under valgrind or used when PCRE2 is
402 compiled with address sanitizing enabled. It calls the fuzzing function and
403 outputs information about it is doing. The input strings are specified by
404 arguments: if an argument starts with "=" the rest of it is a literal input
405 string. Otherwise, it is assumed to be a file name, and the contents of the
406 file are the test string.
407
408. Releases before 10.30 could be compiled with --disable-stack-for-recursion,
409 which caused pcre2_match() to use individual blocks on the heap for
410 backtracking instead of recursive function calls (which use the stack). This
411 is now obsolete since pcre2_match() was refactored always to use the heap (in
412 a much more efficient way than before). This option is retained for backwards
413 compatibility, but has no effect other than to output a warning.
414
415The "configure" script builds the following files for the basic C library:
416
417. Makefile the makefile that builds the library
418. src/config.h build-time configuration options for the library
419. src/pcre2.h the public PCRE2 header file
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700420. pcre2-config script that shows the building settings such as CFLAGS
Elliott Hughes5b808042021-10-01 10:56:10 -0700421 that were set for "configure"
422. libpcre2-8.pc )
423. libpcre2-16.pc ) data for the pkg-config command
424. libpcre2-32.pc )
425. libpcre2-posix.pc )
426. libtool script that builds shared and/or static libraries
427
428Versions of config.h and pcre2.h are distributed in the src directory of PCRE2
429tarballs under the names config.h.generic and pcre2.h.generic. These are
430provided for those who have to build PCRE2 without using "configure" or CMake.
431If you use "configure" or CMake, the .generic versions are not used.
432
433The "configure" script also creates config.status, which is an executable
434script that can be run to recreate the configuration, and config.log, which
435contains compiler output from tests that "configure" runs.
436
437Once "configure" has run, you can run "make". This builds whichever of the
438libraries libpcre2-8, libpcre2-16 and libpcre2-32 are configured, and a test
439program called pcre2test. If you enabled JIT support with --enable-jit, another
440test program called pcre2_jit_test is built as well. If the 8-bit library is
441built, libpcre2-posix and the pcre2grep command are also built. Running
442"make" with the -j option may speed up compilation on multiprocessor systems.
443
444The command "make check" runs all the appropriate tests. Details of the PCRE2
445tests are given below in a separate section of this document. The -j option of
446"make" can also be used when running the tests.
447
448You can use "make install" to install PCRE2 into live directories on your
449system. The following are installed (file names are all relative to the
450<prefix> that is set when "configure" is run):
451
452 Commands (bin):
453 pcre2test
454 pcre2grep (if 8-bit support is enabled)
455 pcre2-config
456
457 Libraries (lib):
458 libpcre2-8 (if 8-bit support is enabled)
459 libpcre2-16 (if 16-bit support is enabled)
460 libpcre2-32 (if 32-bit support is enabled)
461 libpcre2-posix (if 8-bit support is enabled)
462
463 Configuration information (lib/pkgconfig):
464 libpcre2-8.pc
465 libpcre2-16.pc
466 libpcre2-32.pc
467 libpcre2-posix.pc
468
469 Header files (include):
470 pcre2.h
471 pcre2posix.h
472
473 Man pages (share/man/man{1,3}):
474 pcre2grep.1
475 pcre2test.1
476 pcre2-config.1
477 pcre2.3
478 pcre2*.3 (lots more pages, all starting "pcre2")
479
480 HTML documentation (share/doc/pcre2/html):
481 index.html
482 *.html (lots more pages, hyperlinked from index.html)
483
484 Text file documentation (share/doc/pcre2):
485 AUTHORS
486 COPYING
487 ChangeLog
488 LICENCE
489 NEWS
490 README
491 pcre2.txt (a concatenation of the man(3) pages)
492 pcre2test.txt the pcre2test man page
493 pcre2grep.txt the pcre2grep man page
494 pcre2-config.txt the pcre2-config man page
495
496If you want to remove PCRE2 from your system, you can run "make uninstall".
497This removes all the files that "make install" installed. However, it does not
498remove any directories, because these are often shared with other programs.
499
500
501Retrieving configuration information
502------------------------------------
503
504Running "make install" installs the command pcre2-config, which can be used to
505recall information about the PCRE2 configuration and installation. For example:
506
507 pcre2-config --version
508
509prints the version number, and
510
511 pcre2-config --libs8
512
513outputs information about where the 8-bit library is installed. This command
514can be included in makefiles for programs that use PCRE2, saving the programmer
515from having to remember too many details. Run pcre2-config with no arguments to
516obtain a list of possible arguments.
517
518The pkg-config command is another system for saving and retrieving information
519about installed libraries. Instead of separate commands for each library, a
520single command is used. For example:
521
522 pkg-config --libs libpcre2-16
523
524The data is held in *.pc files that are installed in a directory called
525<prefix>/lib/pkgconfig.
526
527
528Shared libraries
529----------------
530
531The default distribution builds PCRE2 as shared libraries and static libraries,
532as long as the operating system supports shared libraries. Shared library
533support relies on the "libtool" script which is built as part of the
534"configure" process.
535
536The libtool script is used to compile and link both shared and static
537libraries. They are placed in a subdirectory called .libs when they are newly
538built. The programs pcre2test and pcre2grep are built to use these uninstalled
539libraries (by means of wrapper scripts in the case of shared libraries). When
540you use "make install" to install shared libraries, pcre2grep and pcre2test are
541automatically re-built to use the newly installed shared libraries before being
542installed themselves. However, the versions left in the build directory still
543use the uninstalled libraries.
544
545To build PCRE2 using static libraries only you must use --disable-shared when
546configuring it. For example:
547
548./configure --prefix=/usr/gnu --disable-shared
549
550Then run "make" in the usual way. Similarly, you can use --disable-static to
551build only shared libraries.
552
553
554Cross-compiling using autotools
555-------------------------------
556
557You can specify CC and CFLAGS in the normal way to the "configure" command, in
558order to cross-compile PCRE2 for some other host. However, you should NOT
559specify --enable-rebuild-chartables, because if you do, the pcre2_dftables.c
560source file is compiled and run on the local host, in order to generate the
561inbuilt character tables (the pcre2_chartables.c file). This will probably not
562work, because pcre2_dftables.c needs to be compiled with the local compiler,
563not the cross compiler.
564
565When --enable-rebuild-chartables is not specified, pcre2_chartables.c is
566created by making a copy of pcre2_chartables.c.dist, which is a default set of
567tables that assumes ASCII code. Cross-compiling with the default tables should
568not be a problem.
569
570If you need to modify the character tables when cross-compiling, you should
571move pcre2_chartables.c.dist out of the way, then compile pcre2_dftables.c by
572hand and run it on the local host to make a new version of
573pcre2_chartables.c.dist. See the pcre2build section "Creating character tables
574at build time" for more details.
575
576
577Making new tarballs
578-------------------
579
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700580The command "make dist" creates three PCRE2 tarballs, in tar.gz, tar.bz2, and
581zip formats. The command "make distcheck" does the same, but then does a trial
582build of the new distribution to ensure that it works.
Elliott Hughes5b808042021-10-01 10:56:10 -0700583
584If you have modified any of the man page sources in the doc directory, you
585should first run the PrepareRelease script before making a distribution. This
586script creates the .txt and HTML forms of the documentation from the man pages.
587
588
589Testing PCRE2
590-------------
591
592To test the basic PCRE2 library on a Unix-like system, run the RunTest script.
593There is another script called RunGrepTest that tests the pcre2grep command.
594When JIT support is enabled, a third test program called pcre2_jit_test is
595built. Both the scripts and all the program tests are run if you obey "make
596check". For other environments, see the instructions in NON-AUTOTOOLS-BUILD.
597
598The RunTest script runs the pcre2test test program (which is documented in its
599own man page) on each of the relevant testinput files in the testdata
600directory, and compares the output with the contents of the corresponding
601testoutput files. RunTest uses a file called testtry to hold the main output
602from pcre2test. Other files whose names begin with "test" are used as working
603files in some tests.
604
605Some tests are relevant only when certain build-time options were selected. For
606example, the tests for UTF-8/16/32 features are run only when Unicode support
607is available. RunTest outputs a comment when it skips a test.
608
609Many (but not all) of the tests that are not skipped are run twice if JIT
610support is available. On the second run, JIT compilation is forced. This
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700611testing can be suppressed by putting "-nojit" on the RunTest command line.
Elliott Hughes5b808042021-10-01 10:56:10 -0700612
613The entire set of tests is run once for each of the 8-bit, 16-bit and 32-bit
614libraries that are enabled. If you want to run just one set of tests, call
615RunTest with either the -8, -16 or -32 option.
616
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700617If valgrind is installed, you can run the tests under it by putting "-valgrind"
Elliott Hughes5b808042021-10-01 10:56:10 -0700618on the RunTest command line. To run pcre2test on just one or more specific test
619files, give their numbers as arguments to RunTest, for example:
620
621 RunTest 2 7 11
622
623You can also specify ranges of tests such as 3-6 or 3- (meaning 3 to the
624end), or a number preceded by ~ to exclude a test. For example:
625
626 Runtest 3-15 ~10
627
628This runs tests 3 to 15, excluding test 10, and just ~13 runs all the tests
629except test 13. Whatever order the arguments are in, the tests are always run
630in numerical order.
631
632You can also call RunTest with the single argument "list" to cause it to output
633a list of tests.
634
635The test sequence starts with "test 0", which is a special test that has no
636input file, and whose output is not checked. This is because it will be
637different on different hardware and with different configurations. The test
638exists in order to exercise some of pcre2test's code that would not otherwise
639be run.
640
641Tests 1 and 2 can always be run, as they expect only plain text strings (not
642UTF) and make no use of Unicode properties. The first test file can be fed
643directly into the perltest.sh script to check that Perl gives the same results.
644The only difference you should see is in the first few lines, where the Perl
645version is given instead of the PCRE2 version. The second set of tests check
646auxiliary functions, error detection, and run-time flags that are specific to
647PCRE2. It also uses the debugging flags to check some of the internals of
648pcre2_compile().
649
650If you build PCRE2 with a locale setting that is not the standard C locale, the
651character tables may be different (see next paragraph). In some cases, this may
652cause failures in the second set of tests. For example, in a locale where the
653isprint() function yields TRUE for characters in the range 128-255, the use of
654[:isascii:] inside a character class defines a different set of characters, and
655this shows up in this test as a difference in the compiled code, which is being
656listed for checking. For example, where the comparison test output contains
657[\x00-\x7f] the test might contain [\x00-\xff], and similarly in some other
658cases. This is not a bug in PCRE2.
659
660Test 3 checks pcre2_maketables(), the facility for building a set of character
661tables for a specific locale and using them instead of the default tables. The
662script uses the "locale" command to check for the availability of the "fr_FR",
663"french", or "fr" locale, and uses the first one that it finds. If the "locale"
664command fails, or if its output doesn't include "fr_FR", "french", or "fr" in
665the list of available locales, the third test cannot be run, and a comment is
666output to say why. If running this test produces an error like this:
667
668 ** Failed to set locale "fr_FR"
669
670it means that the given locale is not available on your system, despite being
671listed by "locale". This does not mean that PCRE2 is broken. There are three
672alternative output files for the third test, because three different versions
673of the French locale have been encountered. The test passes if its output
674matches any one of them.
675
676Tests 4 and 5 check UTF and Unicode property support, test 4 being compatible
677with the perltest.sh script, and test 5 checking PCRE2-specific things.
678
679Tests 6 and 7 check the pcre2_dfa_match() alternative matching function, in
680non-UTF mode and UTF-mode with Unicode property support, respectively.
681
682Test 8 checks some internal offsets and code size features, but it is run only
683when Unicode support is enabled. The output is different in 8-bit, 16-bit, and
68432-bit modes and for different link sizes, so there are different output files
685for each mode and link size.
686
687Tests 9 and 10 are run only in 8-bit mode, and tests 11 and 12 are run only in
68816-bit and 32-bit modes. These are tests that generate different output in
6898-bit mode. Each pair are for general cases and Unicode support, respectively.
690
691Test 13 checks the handling of non-UTF characters greater than 255 by
692pcre2_dfa_match() in 16-bit and 32-bit modes.
693
694Test 14 contains some special UTF and UCP tests that give different output for
695different code unit widths.
696
697Test 15 contains a number of tests that must not be run with JIT. They check,
698among other non-JIT things, the match-limiting features of the intepretive
699matcher.
700
701Test 16 is run only when JIT support is not available. It checks that an
702attempt to use JIT has the expected behaviour.
703
704Test 17 is run only when JIT support is available. It checks JIT complete and
705partial modes, match-limiting under JIT, and other JIT-specific features.
706
707Tests 18 and 19 are run only in 8-bit mode. They check the POSIX interface to
708the 8-bit library, without and with Unicode support, respectively.
709
710Test 20 checks the serialization functions by writing a set of compiled
711patterns to a file, and then reloading and checking them.
712
713Tests 21 and 22 test \C support when the use of \C is not locked out, without
714and with UTF support, respectively. Test 23 tests \C when it is locked out.
715
716Tests 24 and 25 test the experimental pattern conversion functions, without and
717with UTF support, respectively.
718
719
720Character tables
721----------------
722
723For speed, PCRE2 uses four tables for manipulating and identifying characters
724whose code point values are less than 256. By default, a set of tables that is
725built into the library is used. The pcre2_maketables() function can be called
726by an application to create a new set of tables in the current locale. This are
727passed to PCRE2 by calling pcre2_set_character_tables() to put a pointer into a
728compile context.
729
730The source file called pcre2_chartables.c contains the default set of tables.
731By default, this is created as a copy of pcre2_chartables.c.dist, which
732contains tables for ASCII coding. However, if --enable-rebuild-chartables is
733specified for ./configure, a new version of pcre2_chartables.c is built by the
734program pcre2_dftables (compiled from pcre2_dftables.c), which uses the ANSI C
735character handling functions such as isalnum(), isalpha(), isupper(),
736islower(), etc. to build the table sources. This means that the default C
737locale that is set for your system will control the contents of these default
738tables. You can change the default tables by editing pcre2_chartables.c and
739then re-building PCRE2. If you do this, you should take care to ensure that the
740file does not get automatically re-generated. The best way to do this is to
741move pcre2_chartables.c.dist out of the way and replace it with your customized
742tables.
743
744When the pcre2_dftables program is run as a result of specifying
745--enable-rebuild-chartables, it uses the default C locale that is set on your
746system. It does not pay attention to the LC_xxx environment variables. In other
747words, it uses the system's default locale rather than whatever the compiling
748user happens to have set. If you really do want to build a source set of
749character tables in a locale that is specified by the LC_xxx variables, you can
750run the pcre2_dftables program by hand with the -L option. For example:
751
752 ./pcre2_dftables -L pcre2_chartables.c.special
753
754The second argument names the file where the source code for the tables is
755written. The first two 256-byte tables provide lower casing and case flipping
756functions, respectively. The next table consists of a number of 32-byte bit
757maps which identify certain character classes such as digits, "word"
758characters, white space, etc. These are used when building 32-byte bit maps
759that represent character classes for code points less than 256. The final
760256-byte table has bits indicating various character types, as follows:
761
762 1 white space character
763 2 letter
764 4 lower case letter
765 8 decimal digit
766 16 alphanumeric or '_'
767
768You can also specify -b (with or without -L) when running pcre2_dftables. This
769causes the tables to be written in binary instead of as source code. A set of
770binary tables can be loaded into memory by an application and passed to
771pcre2_compile() in the same way as tables created dynamically by calling
772pcre2_maketables(). The tables are just a string of bytes, independent of
773hardware characteristics such as endianness. This means they can be bundled
774with an application that runs in different environments, to ensure consistent
775behaviour.
776
777See also the pcre2build section "Creating character tables at build time".
778
779
780File manifest
781-------------
782
783The distribution should contain the files listed below.
784
785(A) Source files for the PCRE2 library functions and their headers are found in
786 the src directory:
787
788 src/pcre2_dftables.c auxiliary program for building pcre2_chartables.c
789 when --enable-rebuild-chartables is specified
790
791 src/pcre2_chartables.c.dist a default set of character tables that assume
792 ASCII coding; unless --enable-rebuild-chartables is
793 specified, used by copying to pcre2_chartables.c
794
795 src/pcre2posix.c )
796 src/pcre2_auto_possess.c )
797 src/pcre2_compile.c )
798 src/pcre2_config.c )
799 src/pcre2_context.c )
800 src/pcre2_convert.c )
801 src/pcre2_dfa_match.c )
802 src/pcre2_error.c )
803 src/pcre2_extuni.c )
804 src/pcre2_find_bracket.c )
805 src/pcre2_jit_compile.c )
806 src/pcre2_jit_match.c ) sources for the functions in the library,
807 src/pcre2_jit_misc.c ) and some internal functions that they use
808 src/pcre2_maketables.c )
809 src/pcre2_match.c )
810 src/pcre2_match_data.c )
811 src/pcre2_newline.c )
812 src/pcre2_ord2utf.c )
813 src/pcre2_pattern_info.c )
814 src/pcre2_script_run.c )
815 src/pcre2_serialize.c )
816 src/pcre2_string_utils.c )
817 src/pcre2_study.c )
818 src/pcre2_substitute.c )
819 src/pcre2_substring.c )
820 src/pcre2_tables.c )
821 src/pcre2_ucd.c )
822 src/pcre2_valid_utf.c )
823 src/pcre2_xclass.c )
824
825 src/pcre2_printint.c debugging function that is used by pcre2test,
826 src/pcre2_fuzzsupport.c function for (optional) fuzzing support
827
828 src/config.h.in template for config.h, when built by "configure"
829 src/pcre2.h.in template for pcre2.h when built by "configure"
830 src/pcre2posix.h header for the external POSIX wrapper API
831 src/pcre2_internal.h header for internal use
832 src/pcre2_intmodedep.h a mode-specific internal header
833 src/pcre2_ucp.h header for Unicode property handling
834
835 sljit/* source files for the JIT compiler
836
837(B) Source files for programs that use PCRE2:
838
839 src/pcre2demo.c simple demonstration of coding calls to PCRE2
840 src/pcre2grep.c source of a grep utility that uses PCRE2
841 src/pcre2test.c comprehensive test program
842 src/pcre2_jit_test.c JIT test program
843
844(C) Auxiliary files:
845
846 132html script to turn "man" pages into HTML
847 AUTHORS information about the author of PCRE2
848 ChangeLog log of changes to the code
849 CleanTxt script to clean nroff output for txt man pages
850 Detrail script to remove trailing spaces
851 HACKING some notes about the internals of PCRE2
852 INSTALL generic installation instructions
853 LICENCE conditions for the use of PCRE2
854 COPYING the same, using GNU's standard name
855 Makefile.in ) template for Unix Makefile, which is built by
856 ) "configure"
857 Makefile.am ) the automake input that was used to create
858 ) Makefile.in
859 NEWS important changes in this release
860 NON-AUTOTOOLS-BUILD notes on building PCRE2 without using autotools
861 PrepareRelease script to make preparations for "make dist"
862 README this file
863 RunTest a Unix shell script for running tests
864 RunGrepTest a Unix shell script for pcre2grep tests
865 aclocal.m4 m4 macros (generated by "aclocal")
866 config.guess ) files used by libtool,
867 config.sub ) used only when building a shared library
868 configure a configuring shell script (built by autoconf)
869 configure.ac ) the autoconf input that was used to build
870 ) "configure" and config.h
871 depcomp ) script to find program dependencies, generated by
872 ) automake
873 doc/*.3 man page sources for PCRE2
874 doc/*.1 man page sources for pcre2grep and pcre2test
875 doc/index.html.src the base HTML page
876 doc/html/* HTML documentation
877 doc/pcre2.txt plain text version of the man pages
878 doc/pcre2test.txt plain text documentation of test program
879 install-sh a shell script for installing files
880 libpcre2-8.pc.in template for libpcre2-8.pc for pkg-config
881 libpcre2-16.pc.in template for libpcre2-16.pc for pkg-config
882 libpcre2-32.pc.in template for libpcre2-32.pc for pkg-config
883 libpcre2-posix.pc.in template for libpcre2-posix.pc for pkg-config
884 ltmain.sh file used to build a libtool script
885 missing ) common stub for a few missing GNU programs while
886 ) installing, generated by automake
887 mkinstalldirs script for making install directories
888 perltest.sh Script for running a Perl test program
889 pcre2-config.in source of script which retains PCRE2 information
890 testdata/testinput* test data for main library tests
891 testdata/testoutput* expected test results
892 testdata/grep* input and output for pcre2grep tests
893 testdata/* other supporting test files
894
895(D) Auxiliary files for cmake support
896
897 cmake/COPYING-CMAKE-SCRIPTS
898 cmake/FindPackageHandleStandardArgs.cmake
899 cmake/FindEditline.cmake
900 cmake/FindReadline.cmake
901 CMakeLists.txt
902 config-cmake.h.in
903
904(E) Auxiliary files for building PCRE2 "by hand"
905
906 src/pcre2.h.generic ) a version of the public PCRE2 header file
907 ) for use in non-"configure" environments
908 src/config.h.generic ) a version of config.h for use in non-"configure"
909 ) environments
910
911Philip Hazel
912Email local part: Philip.Hazel
913Email domain: gmail.com
Elliott Hughes4e19c8e2022-04-15 15:11:02 -0700914Last updated: 15 April 2022