blob: 6cfede7eb365b826c4b64d1aed306013f5233d4c [file] [log] [blame]
Elliott Hughes378b1752021-06-08 13:42:40 -07001.TH PCRE2POSIX 3 "26 April 2021" "PCRE2 10.37"
Nick Kralevichf73ff172014-09-27 12:41:49 -07002.SH NAME
Janis Danisevskis53e448c2016-03-31 13:35:25 +01003PCRE2 - Perl-compatible regular expressions (revised API)
Nick Kralevichf73ff172014-09-27 12:41:49 -07004.SH "SYNOPSIS"
5.rs
6.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +01007.B #include <pcre2posix.h>
Nick Kralevichf73ff172014-09-27 12:41:49 -07008.PP
9.nf
Elliott Hughes0c26e192019-08-07 12:24:46 -070010.B int pcre2_regcomp(regex_t *\fIpreg\fP, const char *\fIpattern\fP,
Nick Kralevichf73ff172014-09-27 12:41:49 -070011.B " int \fIcflags\fP);"
12.sp
Elliott Hughes0c26e192019-08-07 12:24:46 -070013.B int pcre2_regexec(const regex_t *\fIpreg\fP, const char *\fIstring\fP,
Nick Kralevichf73ff172014-09-27 12:41:49 -070014.B " size_t \fInmatch\fP, regmatch_t \fIpmatch\fP[], int \fIeflags\fP);"
Janis Danisevskis53e448c2016-03-31 13:35:25 +010015.sp
Elliott Hughes0c26e192019-08-07 12:24:46 -070016.B "size_t pcre2_regerror(int \fIerrcode\fP, const regex_t *\fIpreg\fP,"
Nick Kralevichf73ff172014-09-27 12:41:49 -070017.B " char *\fIerrbuf\fP, size_t \fIerrbuf_size\fP);"
18.sp
Elliott Hughes0c26e192019-08-07 12:24:46 -070019.B void pcre2_regfree(regex_t *\fIpreg\fP);
Nick Kralevichf73ff172014-09-27 12:41:49 -070020.fi
21.
22.SH DESCRIPTION
23.rs
24.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +010025This set of functions provides a POSIX-style API for the PCRE2 regular
Elliott Hughes0c26e192019-08-07 12:24:46 -070026expression 8-bit library. There are no POSIX-style wrappers for PCRE2's 16-bit
27and 32-bit libraries. See the
Nick Kralevichf73ff172014-09-27 12:41:49 -070028.\" HREF
Janis Danisevskis53e448c2016-03-31 13:35:25 +010029\fBpcre2api\fP
Nick Kralevichf73ff172014-09-27 12:41:49 -070030.\"
Janis Danisevskis53e448c2016-03-31 13:35:25 +010031documentation for a description of PCRE2's native API, which contains much
Elliott Hughes0c26e192019-08-07 12:24:46 -070032additional functionality.
Nick Kralevichf73ff172014-09-27 12:41:49 -070033.P
Elliott Hughes0c26e192019-08-07 12:24:46 -070034The functions described here are wrapper functions that ultimately call the
35PCRE2 native API. Their prototypes are defined in the \fBpcre2posix.h\fP header
36file, and they all have unique names starting with \fBpcre2_\fP. However, the
37\fBpcre2posix.h\fP header also contains macro definitions that convert the
38standard POSIX names such \fBregcomp()\fP into \fBpcre2_regcomp()\fP etc. This
39means that a program can use the usual POSIX names without running the risk of
40accidentally linking with POSIX functions from a different library.
Nick Kralevichf73ff172014-09-27 12:41:49 -070041.P
Elliott Hughes0c26e192019-08-07 12:24:46 -070042On Unix-like systems the PCRE2 POSIX library is called \fBlibpcre2-posix\fP, so
43can be accessed by adding \fB-lpcre2-posix\fP to the command for linking an
44application. Because the POSIX functions call the native ones, it is also
45necessary to add \fB-lpcre2-8\fP.
46.P
Elliott Hughes378b1752021-06-08 13:42:40 -070047Although they were not defined as protypes in \fBpcre2posix.h\fP, releases
4810.33 to 10.36 of the library contained functions with the POSIX names
49\fBregcomp()\fP etc. These simply passed their arguments to the PCRE2
50functions. These functions were provided for backwards compatibility with
51earlier versions of PCRE2, which had only POSIX names. However, this has proved
52troublesome in situations where a program links with several libraries, some of
53which use PCRE2's POSIX interface while others use the real POSIX functions.
54For this reason, the POSIX names have been removed since release 10.37.
Elliott Hughes0c26e192019-08-07 12:24:46 -070055.P
56Calling the header file \fBpcre2posix.h\fP avoids any conflict with other POSIX
57libraries. It can, of course, be renamed or aliased as \fBregex.h\fP, which is
58the "correct" name, if there is no clash. It provides two structure types,
59\fIregex_t\fP for compiled internal forms, and \fIregmatch_t\fP for returning
60captured substrings. It also defines some constants whose names start with
61"REG_"; these are used for setting options and identifying error codes.
62.
63.
64.SH "USING THE POSIX FUNCTIONS"
65.rs
66.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +010067Those POSIX option bits that can reasonably be mapped to PCRE2 native options
68have been implemented. In addition, the option REG_EXTENDED is defined with the
69value zero. This has no effect, but since programs that are written to the
70POSIX interface often use it, this makes it easier to slot in PCRE2 as a
Nick Kralevichf73ff172014-09-27 12:41:49 -070071replacement library. Other POSIX options are not even defined.
72.P
Janis Danisevskis8b979b22016-08-15 16:09:16 +010073There are also some options that are not defined by POSIX. These have been
74added at the request of users who want to make use of certain PCRE2-specific
Elliott Hughes9bc971b2018-07-27 13:23:14 -070075features via the POSIX calling interface or to add BSD or GNU functionality.
Nick Kralevichf73ff172014-09-27 12:41:49 -070076.P
Janis Danisevskis53e448c2016-03-31 13:35:25 +010077When PCRE2 is called via these functions, it is only the API that is POSIX-like
Nick Kralevichf73ff172014-09-27 12:41:49 -070078in style. The syntax and semantics of the regular expressions themselves are
Janis Danisevskis53e448c2016-03-31 13:35:25 +010079still those of Perl, subject to the setting of various PCRE2 options, as
Nick Kralevichf73ff172014-09-27 12:41:49 -070080described below. "POSIX-like in style" means that the API approximates to the
Janis Danisevskis53e448c2016-03-31 13:35:25 +010081POSIX definition; it is not fully POSIX-compatible, and in multi-unit encoding
Nick Kralevichf73ff172014-09-27 12:41:49 -070082domains it is probably even less compatible.
83.P
Elliott Hughes0c26e192019-08-07 12:24:46 -070084The descriptions below use the actual names of the functions, but, as described
85above, the standard POSIX names (without the \fBpcre2_\fP prefix) may also be
86used.
Nick Kralevichf73ff172014-09-27 12:41:49 -070087.
88.
89.SH "COMPILING A PATTERN"
90.rs
91.sp
Elliott Hughes0c26e192019-08-07 12:24:46 -070092The function \fBpcre2_regcomp()\fP is called to compile a pattern into an
Elliott Hughes9bc971b2018-07-27 13:23:14 -070093internal form. By default, the pattern is a C string terminated by a binary
94zero (but see REG_PEND below). The \fIpreg\fP argument is a pointer to a
95\fBregex_t\fP structure that is used as a base for storing information about
96the compiled regular expression. (It is also used for input when REG_PEND is
97set.)
Nick Kralevichf73ff172014-09-27 12:41:49 -070098.P
99The argument \fIcflags\fP is either zero, or contains one or more of the bits
100defined by the following macros:
101.sp
102 REG_DOTALL
103.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100104The PCRE2_DOTALL option is set when the regular expression is passed for
Nick Kralevichf73ff172014-09-27 12:41:49 -0700105compilation to the native function. Note that REG_DOTALL is not part of the
106POSIX standard.
107.sp
108 REG_ICASE
109.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100110The PCRE2_CASELESS option is set when the regular expression is passed for
Nick Kralevichf73ff172014-09-27 12:41:49 -0700111compilation to the native function.
112.sp
113 REG_NEWLINE
114.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100115The PCRE2_MULTILINE option is set when the regular expression is passed for
Nick Kralevichf73ff172014-09-27 12:41:49 -0700116compilation to the native function. Note that this does \fInot\fP mimic the
117defined POSIX behaviour for REG_NEWLINE (see the following section).
118.sp
Elliott Hughes9bc971b2018-07-27 13:23:14 -0700119 REG_NOSPEC
120.sp
121The PCRE2_LITERAL option is set when the regular expression is passed for
122compilation to the native function. This disables all meta characters in the
123pattern, causing it to be treated as a literal string. The only other options
124that are allowed with REG_NOSPEC are REG_ICASE, REG_NOSUB, REG_PEND, and
125REG_UTF. Note that REG_NOSPEC is not part of the POSIX standard.
126.sp
Nick Kralevichf73ff172014-09-27 12:41:49 -0700127 REG_NOSUB
128.sp
Elliott Hughes0c26e192019-08-07 12:24:46 -0700129When a pattern that is compiled with this flag is passed to
130\fBpcre2_regexec()\fP for matching, the \fInmatch\fP and \fIpmatch\fP arguments
131are ignored, and no captured strings are returned. Versions of the PCRE library
132prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this
133no longer happens because it disables the use of backreferences.
Nick Kralevichf73ff172014-09-27 12:41:49 -0700134.sp
Elliott Hughes9bc971b2018-07-27 13:23:14 -0700135 REG_PEND
136.sp
137If this option is set, the \fBreg_endp\fP field in the \fIpreg\fP structure
138(which has the type const char *) must be set to point to the character beyond
Elliott Hughes0c26e192019-08-07 12:24:46 -0700139the end of the pattern before calling \fBpcre2_regcomp()\fP. The pattern itself
140may now contain binary zeros, which are treated as data characters. Without
Elliott Hughes9bc971b2018-07-27 13:23:14 -0700141REG_PEND, a binary zero terminates the pattern and the \fBre_endp\fP field is
142ignored. This is a GNU extension to the POSIX standard and should be used with
143caution in software intended to be portable to other systems.
144.sp
Nick Kralevichf73ff172014-09-27 12:41:49 -0700145 REG_UCP
146.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100147The PCRE2_UCP option is set when the regular expression is passed for
148compilation to the native function. This causes PCRE2 to use Unicode properties
Nick Kralevichf73ff172014-09-27 12:41:49 -0700149when matchine \ed, \ew, etc., instead of just recognizing ASCII values. Note
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100150that REG_UCP is not part of the POSIX standard.
Nick Kralevichf73ff172014-09-27 12:41:49 -0700151.sp
152 REG_UNGREEDY
153.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100154The PCRE2_UNGREEDY option is set when the regular expression is passed for
Nick Kralevichf73ff172014-09-27 12:41:49 -0700155compilation to the native function. Note that REG_UNGREEDY is not part of the
156POSIX standard.
157.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100158 REG_UTF
Nick Kralevichf73ff172014-09-27 12:41:49 -0700159.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100160The PCRE2_UTF option is set when the regular expression is passed for
Nick Kralevichf73ff172014-09-27 12:41:49 -0700161compilation to the native function. This causes the pattern itself and all data
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100162strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF
Nick Kralevichf73ff172014-09-27 12:41:49 -0700163is not part of the POSIX standard.
164.P
165In the absence of these flags, no options are passed to the native function.
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100166This means the the regex is compiled with PCRE2 default semantics. In
Nick Kralevichf73ff172014-09-27 12:41:49 -0700167particular, the way it handles newline characters in the subject string is the
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100168Perl way, not the POSIX way. Note that setting PCRE2_MULTILINE has only
Nick Kralevichf73ff172014-09-27 12:41:49 -0700169\fIsome\fP of the effects specified for REG_NEWLINE. It does not affect the way
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100170newlines are matched by the dot metacharacter (they are not) or by a negative
171class such as [^a] (they are).
Nick Kralevichf73ff172014-09-27 12:41:49 -0700172.P
Elliott Hughes0c26e192019-08-07 12:24:46 -0700173The yield of \fBpcre2_regcomp()\fP is zero on success, and non-zero otherwise.
174The \fIpreg\fP structure is filled in on success, and one other member of the
Elliott Hughes9bc971b2018-07-27 13:23:14 -0700175structure (as well as \fIre_endp\fP) is public: \fIre_nsub\fP contains the
176number of capturing subpatterns in the regular expression. Various error codes
177are defined in the header file.
Nick Kralevichf73ff172014-09-27 12:41:49 -0700178.P
Elliott Hughes0c26e192019-08-07 12:24:46 -0700179NOTE: If the yield of \fBpcre2_regcomp()\fP is non-zero, you must not attempt
180to use the contents of the \fIpreg\fP structure. If, for example, you pass it
181to \fBpcre2_regexec()\fP, the result is undefined and your program is likely to
182crash.
Nick Kralevichf73ff172014-09-27 12:41:49 -0700183.
184.
185.SH "MATCHING NEWLINE CHARACTERS"
186.rs
187.sp
188This area is not simple, because POSIX and Perl take different views of things.
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100189It is not possible to get PCRE2 to obey POSIX semantics, but then PCRE2 was
190never intended to be a POSIX engine. The following table lists the different
191possibilities for matching newline characters in Perl and PCRE2:
Nick Kralevichf73ff172014-09-27 12:41:49 -0700192.sp
193 Default Change with
194.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100195 . matches newline no PCRE2_DOTALL
Nick Kralevichf73ff172014-09-27 12:41:49 -0700196 newline matches [^a] yes not changeable
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100197 $ matches \en at end yes PCRE2_DOLLAR_ENDONLY
198 $ matches \en in middle no PCRE2_MULTILINE
199 ^ matches \en in middle no PCRE2_MULTILINE
Nick Kralevichf73ff172014-09-27 12:41:49 -0700200.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100201This is the equivalent table for a POSIX-compatible pattern matcher:
Nick Kralevichf73ff172014-09-27 12:41:49 -0700202.sp
203 Default Change with
204.sp
205 . matches newline yes REG_NEWLINE
206 newline matches [^a] yes REG_NEWLINE
207 $ matches \en at end no REG_NEWLINE
208 $ matches \en in middle no REG_NEWLINE
209 ^ matches \en in middle no REG_NEWLINE
210.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100211This behaviour is not what happens when PCRE2 is called via its POSIX
212API. By default, PCRE2's behaviour is the same as Perl's, except that there is
213no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2 and Perl, there
214is no way to stop newline from matching [^a].
Nick Kralevichf73ff172014-09-27 12:41:49 -0700215.P
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100216Default POSIX newline handling can be obtained by setting PCRE2_DOTALL and
217PCRE2_DOLLAR_ENDONLY when calling \fBpcre2_compile()\fP directly, but there is
218no way to make PCRE2 behave exactly as for the REG_NEWLINE action. When using
Elliott Hughes0c26e192019-08-07 12:24:46 -0700219the POSIX API, passing REG_NEWLINE to PCRE2's \fBpcre2_regcomp()\fP function
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100220causes PCRE2_MULTILINE to be passed to \fBpcre2_compile()\fP, and REG_DOTALL
221passes PCRE2_DOTALL. There is no way to pass PCRE2_DOLLAR_ENDONLY.
Nick Kralevichf73ff172014-09-27 12:41:49 -0700222.
223.
224.SH "MATCHING A PATTERN"
225.rs
226.sp
Elliott Hughes0c26e192019-08-07 12:24:46 -0700227The function \fBpcre2_regexec()\fP is called to match a compiled pattern
228\fIpreg\fP against a given \fIstring\fP, which is by default terminated by a
229zero byte (but see REG_STARTEND below), subject to the options in \fIeflags\fP.
230These can be:
Nick Kralevichf73ff172014-09-27 12:41:49 -0700231.sp
232 REG_NOTBOL
233.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100234The PCRE2_NOTBOL option is set when calling the underlying PCRE2 matching
Nick Kralevichf73ff172014-09-27 12:41:49 -0700235function.
236.sp
237 REG_NOTEMPTY
238.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100239The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2 matching
Nick Kralevichf73ff172014-09-27 12:41:49 -0700240function. Note that REG_NOTEMPTY is not part of the POSIX standard. However,
241setting this option can give more POSIX-like behaviour in some situations.
242.sp
243 REG_NOTEOL
244.sp
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100245The PCRE2_NOTEOL option is set when calling the underlying PCRE2 matching
Nick Kralevichf73ff172014-09-27 12:41:49 -0700246function.
247.sp
248 REG_STARTEND
249.sp
Elliott Hughes653c2102019-01-09 15:41:36 -0800250When this option is set, the subject string starts at \fIstring\fP +
Elliott Hughes9bc971b2018-07-27 13:23:14 -0700251\fIpmatch[0].rm_so\fP and ends at \fIstring\fP + \fIpmatch[0].rm_eo\fP, which
252should point to the first character beyond the string. There may be binary
Elliott Hughes653c2102019-01-09 15:41:36 -0800253zeros within the subject string, and indeed, using REG_STARTEND is the only
Elliott Hughes9bc971b2018-07-27 13:23:14 -0700254way to pass a subject string that contains a binary zero.
255.P
256Whatever the value of \fIpmatch[0].rm_so\fP, the offsets of the matched string
257and any captured substrings are still given relative to the start of
258\fIstring\fP itself. (Before PCRE2 release 10.30 these were given relative to
259\fIstring\fP + \fIpmatch[0].rm_so\fP, but this differs from other
260implementations.)
261.P
262This is a BSD extension, compatible with but not specified by IEEE Standard
2631003.2 (POSIX.2), and should be used with caution in software intended to be
264portable to other systems. Note that a non-zero \fIrm_so\fP does not imply
265REG_NOTBOL; REG_STARTEND affects only the location and length of the string,
266not how it is matched. Setting REG_STARTEND and passing \fIpmatch\fP as NULL
267are mutually exclusive; the error REG_INVARG is returned.
Nick Kralevichf73ff172014-09-27 12:41:49 -0700268.P
269If the pattern was compiled with the REG_NOSUB flag, no data about any matched
270strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
Elliott Hughes0c26e192019-08-07 12:24:46 -0700271\fBpcre2_regexec()\fP are ignored (except possibly as input for REG_STARTEND).
Nick Kralevichf73ff172014-09-27 12:41:49 -0700272.P
Janis Danisevskis8b979b22016-08-15 16:09:16 +0100273The value of \fInmatch\fP may be zero, and the value \fIpmatch\fP may be NULL
274(unless REG_STARTEND is set); in both these cases no data about any matched
275strings is returned.
Nick Kralevichf73ff172014-09-27 12:41:49 -0700276.P
Janis Danisevskis8b979b22016-08-15 16:09:16 +0100277Otherwise, the portion of the string that was matched, and also any captured
Nick Kralevichf73ff172014-09-27 12:41:49 -0700278substrings, are returned via the \fIpmatch\fP argument, which points to an
279array of \fInmatch\fP structures of type \fIregmatch_t\fP, containing the
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100280members \fIrm_so\fP and \fIrm_eo\fP. These contain the byte offset to the first
Nick Kralevichf73ff172014-09-27 12:41:49 -0700281character of each substring and the offset to the first character after the end
282of each substring, respectively. The 0th element of the vector relates to the
283entire portion of \fIstring\fP that was matched; subsequent elements relate to
284the capturing subpatterns of the regular expression. Unused entries in the
285array have both structure members set to -1.
286.P
287A successful match yields a zero return; various error codes are defined in the
288header file, of which REG_NOMATCH is the "expected" failure code.
289.
290.
291.SH "ERROR MESSAGES"
292.rs
293.sp
Elliott Hughes0c26e192019-08-07 12:24:46 -0700294The \fBpcre2_regerror()\fP function maps a non-zero errorcode from either
295\fBpcre2_regcomp()\fP or \fBpcre2_regexec()\fP to a printable message. If
296\fIpreg\fP is not NULL, the error should have arisen from the use of that
297structure. A message terminated by a binary zero is placed in \fIerrbuf\fP. If
298the buffer is too short, only the first \fIerrbuf_size\fP - 1 characters of the
299error message are used. The yield of the function is the size of buffer needed
300to hold the whole message, including the terminating zero. This value is
301greater than \fIerrbuf_size\fP if the message was truncated.
Nick Kralevichf73ff172014-09-27 12:41:49 -0700302.
303.
304.SH MEMORY USAGE
305.rs
306.sp
307Compiling a regular expression causes memory to be allocated and associated
Elliott Hughes0c26e192019-08-07 12:24:46 -0700308with the \fIpreg\fP structure. The function \fBpcre2_regfree()\fP frees all
309such memory, after which \fIpreg\fP may no longer be used as a compiled
310expression.
Nick Kralevichf73ff172014-09-27 12:41:49 -0700311.
312.
313.SH AUTHOR
314.rs
315.sp
316.nf
317Philip Hazel
318University Computing Service
Janis Danisevskis53e448c2016-03-31 13:35:25 +0100319Cambridge, England.
Nick Kralevichf73ff172014-09-27 12:41:49 -0700320.fi
321.
322.
323.SH REVISION
324.rs
325.sp
326.nf
Elliott Hughes378b1752021-06-08 13:42:40 -0700327Last updated: 26 April 2021
328Copyright (c) 1997-2021 University of Cambridge.
Nick Kralevichf73ff172014-09-27 12:41:49 -0700329.fi