| .TH PCRE2SYNTAX 3 "11 February 2019" "PCRE2 10.33" |
| .SH NAME |
| PCRE2 - Perl-compatible regular expressions (revised API) |
| .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY" |
| .rs |
| .sp |
| The full syntax and semantics of the regular expressions that are supported by |
| PCRE2 are described in the |
| .\" HREF |
| \fBpcre2pattern\fP |
| .\" |
| documentation. This document contains a quick-reference summary of the syntax. |
| . |
| . |
| .SH "QUOTING" |
| .rs |
| .sp |
| \ex where x is non-alphanumeric is a literal x |
| \eQ...\eE treat enclosed characters as literal |
| . |
| . |
| .SH "ESCAPED CHARACTERS" |
| .rs |
| .sp |
| This table applies to ASCII and Unicode environments. An unrecognized escape |
| sequence causes an error. |
| .sp |
| \ea alarm, that is, the BEL character (hex 07) |
| \ecx "control-x", where x is any ASCII printing character |
| \ee escape (hex 1B) |
| \ef form feed (hex 0C) |
| \en newline (hex 0A) |
| \er carriage return (hex 0D) |
| \et tab (hex 09) |
| \e0dd character with octal code 0dd |
| \eddd character with octal code ddd, or backreference |
| \eo{ddd..} character with octal code ddd.. |
| \eN{U+hh..} character with Unicode code point hh.. (Unicode mode only) |
| \exhh character with hex code hh |
| \ex{hh..} character with hex code hh.. |
| .sp |
| If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the |
| following are also recognized: |
| .sp |
| \eU the character "U" |
| \euhhhh character with hex code hhhh |
| \eu{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX |
| .sp |
| When \ex is not followed by {, from zero to two hexadecimal digits are read, |
| but in ALT_BSUX mode \ex must be followed by two hexadecimal digits to be |
| recognized as a hexadecimal escape; otherwise it matches a literal "x". |
| Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits |
| or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it |
| matches a literal "u". |
| .P |
| Note that \e0dd is always an octal code. The treatment of backslash followed by |
| a non-zero digit is complicated; for details see the section |
| .\" HTML <a href="pcre2pattern.html#digitsafterbackslash"> |
| .\" </a> |
| "Non-printing characters" |
| .\" |
| in the |
| .\" HREF |
| \fBpcre2pattern\fP |
| .\" |
| documentation, where details of escape processing in EBCDIC environments are |
| also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not |
| supported in EBCDIC environments. Note that \eN not followed by an opening |
| curly bracket has a different meaning (see below). |
| . |
| . |
| .SH "CHARACTER TYPES" |
| .rs |
| .sp |
| . any character except newline; |
| in dotall mode, any character whatsoever |
| \eC one code unit, even in UTF mode (best avoided) |
| \ed a decimal digit |
| \eD a character that is not a decimal digit |
| \eh a horizontal white space character |
| \eH a character that is not a horizontal white space character |
| \eN a character that is not a newline |
| \ep{\fIxx\fP} a character with the \fIxx\fP property |
| \eP{\fIxx\fP} a character without the \fIxx\fP property |
| \eR a newline sequence |
| \es a white space character |
| \eS a character that is not a white space character |
| \ev a vertical white space character |
| \eV a character that is not a vertical white space character |
| \ew a "word" character |
| \eW a "non-word" character |
| \eX a Unicode extended grapheme cluster |
| .sp |
| \eC is dangerous because it may leave the current matching point in the middle |
| of a UTF-8 or UTF-16 character. The application can lock out the use of \eC by |
| setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2 |
| with the use of \eC permanently disabled. |
| .P |
| By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode |
| or in the 16-bit and 32-bit libraries. However, if locale-specific matching is |
| happening, \es and \ew may also match characters with code points in the range |
| 128-255. If the PCRE2_UCP option is set, the behaviour of these escape |
| sequences is changed to use Unicode properties and they match many more |
| characters. |
| . |
| . |
| .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP" |
| .rs |
| .sp |
| C Other |
| Cc Control |
| Cf Format |
| Cn Unassigned |
| Co Private use |
| Cs Surrogate |
| .sp |
| L Letter |
| Ll Lower case letter |
| Lm Modifier letter |
| Lo Other letter |
| Lt Title case letter |
| Lu Upper case letter |
| L& Ll, Lu, or Lt |
| .sp |
| M Mark |
| Mc Spacing mark |
| Me Enclosing mark |
| Mn Non-spacing mark |
| .sp |
| N Number |
| Nd Decimal number |
| Nl Letter number |
| No Other number |
| .sp |
| P Punctuation |
| Pc Connector punctuation |
| Pd Dash punctuation |
| Pe Close punctuation |
| Pf Final punctuation |
| Pi Initial punctuation |
| Po Other punctuation |
| Ps Open punctuation |
| .sp |
| S Symbol |
| Sc Currency symbol |
| Sk Modifier symbol |
| Sm Mathematical symbol |
| So Other symbol |
| .sp |
| Z Separator |
| Zl Line separator |
| Zp Paragraph separator |
| Zs Space separator |
| . |
| . |
| .SH "PCRE2 SPECIAL CATEGORY PROPERTIES FOR \ep and \eP" |
| .rs |
| .sp |
| Xan Alphanumeric: union of properties L and N |
| Xps POSIX space: property Z or tab, NL, VT, FF, CR |
| Xsp Perl space: property Z or tab, NL, VT, FF, CR |
| Xuc Univerally-named character: one that can be |
| represented by a Universal Character Name |
| Xwd Perl word: property Xan or underscore |
| .sp |
| Perl and POSIX space are now the same. Perl added VT to its space character set |
| at release 5.18. |
| . |
| . |
| .SH "SCRIPT NAMES FOR \ep AND \eP" |
| .rs |
| .sp |
| Adlam, |
| Ahom, |
| Anatolian_Hieroglyphs, |
| Arabic, |
| Armenian, |
| Avestan, |
| Balinese, |
| Bamum, |
| Bassa_Vah, |
| Batak, |
| Bengali, |
| Bhaiksuki, |
| Bopomofo, |
| Brahmi, |
| Braille, |
| Buginese, |
| Buhid, |
| Canadian_Aboriginal, |
| Carian, |
| Caucasian_Albanian, |
| Chakma, |
| Cham, |
| Cherokee, |
| Common, |
| Coptic, |
| Cuneiform, |
| Cypriot, |
| Cyrillic, |
| Deseret, |
| Devanagari, |
| Dogra, |
| Duployan, |
| Egyptian_Hieroglyphs, |
| Elbasan, |
| Ethiopic, |
| Georgian, |
| Glagolitic, |
| Gothic, |
| Grantha, |
| Greek, |
| Gujarati, |
| Gunjala_Gondi, |
| Gurmukhi, |
| Han, |
| Hangul, |
| Hanifi_Rohingya, |
| Hanunoo, |
| Hatran, |
| Hebrew, |
| Hiragana, |
| Imperial_Aramaic, |
| Inherited, |
| Inscriptional_Pahlavi, |
| Inscriptional_Parthian, |
| Javanese, |
| Kaithi, |
| Kannada, |
| Katakana, |
| Kayah_Li, |
| Kharoshthi, |
| Khmer, |
| Khojki, |
| Khudawadi, |
| Lao, |
| Latin, |
| Lepcha, |
| Limbu, |
| Linear_A, |
| Linear_B, |
| Lisu, |
| Lycian, |
| Lydian, |
| Mahajani, |
| Makasar, |
| Malayalam, |
| Mandaic, |
| Manichaean, |
| Marchen, |
| Masaram_Gondi, |
| Medefaidrin, |
| Meetei_Mayek, |
| Mende_Kikakui, |
| Meroitic_Cursive, |
| Meroitic_Hieroglyphs, |
| Miao, |
| Modi, |
| Mongolian, |
| Mro, |
| Multani, |
| Myanmar, |
| Nabataean, |
| New_Tai_Lue, |
| Newa, |
| Nko, |
| Nushu, |
| Ogham, |
| Ol_Chiki, |
| Old_Hungarian, |
| Old_Italic, |
| Old_North_Arabian, |
| Old_Permic, |
| Old_Persian, |
| Old_Sogdian, |
| Old_South_Arabian, |
| Old_Turkic, |
| Oriya, |
| Osage, |
| Osmanya, |
| Pahawh_Hmong, |
| Palmyrene, |
| Pau_Cin_Hau, |
| Phags_Pa, |
| Phoenician, |
| Psalter_Pahlavi, |
| Rejang, |
| Runic, |
| Samaritan, |
| Saurashtra, |
| Sharada, |
| Shavian, |
| Siddham, |
| SignWriting, |
| Sinhala, |
| Sogdian, |
| Sora_Sompeng, |
| Soyombo, |
| Sundanese, |
| Syloti_Nagri, |
| Syriac, |
| Tagalog, |
| Tagbanwa, |
| Tai_Le, |
| Tai_Tham, |
| Tai_Viet, |
| Takri, |
| Tamil, |
| Tangut, |
| Telugu, |
| Thaana, |
| Thai, |
| Tibetan, |
| Tifinagh, |
| Tirhuta, |
| Ugaritic, |
| Vai, |
| Warang_Citi, |
| Yi, |
| Zanabazar_Square. |
| . |
| . |
| .SH "CHARACTER CLASSES" |
| .rs |
| .sp |
| [...] positive character class |
| [^...] negative character class |
| [x-y] range (can be used for hex characters) |
| [[:xxx:]] positive POSIX named set |
| [[:^xxx:]] negative POSIX named set |
| .sp |
| alnum alphanumeric |
| alpha alphabetic |
| ascii 0-127 |
| blank space or tab |
| cntrl control character |
| digit decimal digit |
| graph printing, excluding space |
| lower lower case letter |
| print printing, including space |
| punct printing, excluding alphanumeric |
| space white space |
| upper upper case letter |
| word same as \ew |
| xdigit hexadecimal digit |
| .sp |
| In PCRE2, POSIX character set names recognize only ASCII characters by default, |
| but some of them use Unicode properties if PCRE2_UCP is set. You can use |
| \eQ...\eE inside a character class. |
| . |
| . |
| .SH "QUANTIFIERS" |
| .rs |
| .sp |
| ? 0 or 1, greedy |
| ?+ 0 or 1, possessive |
| ?? 0 or 1, lazy |
| * 0 or more, greedy |
| *+ 0 or more, possessive |
| *? 0 or more, lazy |
| + 1 or more, greedy |
| ++ 1 or more, possessive |
| +? 1 or more, lazy |
| {n} exactly n |
| {n,m} at least n, no more than m, greedy |
| {n,m}+ at least n, no more than m, possessive |
| {n,m}? at least n, no more than m, lazy |
| {n,} n or more, greedy |
| {n,}+ n or more, possessive |
| {n,}? n or more, lazy |
| . |
| . |
| .SH "ANCHORS AND SIMPLE ASSERTIONS" |
| .rs |
| .sp |
| \eb word boundary |
| \eB not a word boundary |
| ^ start of subject |
| also after an internal newline in multiline mode |
| (after any newline if PCRE2_ALT_CIRCUMFLEX is set) |
| \eA start of subject |
| $ end of subject |
| also before newline at end of subject |
| also before internal newline in multiline mode |
| \eZ end of subject |
| also before newline at end of subject |
| \ez end of subject |
| \eG first matching position in subject |
| . |
| . |
| .SH "REPORTED MATCH POINT SETTING" |
| .rs |
| .sp |
| \eK set reported start of match |
| .sp |
| \eK is honoured in positive assertions, but ignored in negative ones. |
| . |
| . |
| .SH "ALTERNATION" |
| .rs |
| .sp |
| expr|expr|expr... |
| . |
| . |
| .SH "CAPTURING" |
| .rs |
| .sp |
| (...) capture group |
| (?<name>...) named capture group (Perl) |
| (?'name'...) named capture group (Perl) |
| (?P<name>...) named capture group (Python) |
| (?:...) non-capture group |
| (?|...) non-capture group; reset group numbers for |
| capture groups in each alternative |
| .sp |
| In non-UTF modes, names may contain underscores and ASCII letters and digits; |
| in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In |
| both cases, a name must not start with a digit. |
| . |
| . |
| .SH "ATOMIC GROUPS" |
| .rs |
| .sp |
| (?>...) atomic non-capture group |
| (*atomic:...) atomic non-capture group |
| . |
| . |
| .SH "COMMENT" |
| .rs |
| .sp |
| (?#....) comment (not nestable) |
| . |
| . |
| .SH "OPTION SETTING" |
| .rs |
| Changes of these options within a group are automatically cancelled at the end |
| of the group. |
| .sp |
| (?i) caseless |
| (?J) allow duplicate names |
| (?m) multiline |
| (?n) no auto capture |
| (?s) single line (dotall) |
| (?U) default ungreedy (lazy) |
| (?x) extended: ignore white space except in classes |
| (?xx) as (?x) but also ignore space and tab in classes |
| (?-...) unset option(s) |
| (?^) unset imnsx options |
| .sp |
| Unsetting x or xx unsets both. Several options may be set at once, and a |
| mixture of setting and unsetting such as (?i-x) is allowed, but there may be |
| only one hyphen. Setting (but no unsetting) is allowed after (?^ for example |
| (?^in). An option setting may appear at the start of a non-capture group, for |
| example (?i:...). |
| .P |
| The following are recognized only at the very start of a pattern or after one |
| of the newline or \eR options with similar syntax. More than one of them may |
| appear. For the first three, d is a decimal number. |
| .sp |
| (*LIMIT_DEPTH=d) set the backtracking limit to d |
| (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes |
| (*LIMIT_MATCH=d) set the match limit to d |
| (*NOTEMPTY) set PCRE2_NOTEMPTY when matching |
| (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching |
| (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) |
| (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) |
| (*NO_JIT) disable JIT optimization |
| (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) |
| (*UTF) set appropriate UTF mode for the library in use |
| (*UCP) set PCRE2_UCP (use Unicode properties for \ed etc) |
| .sp |
| Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of |
| the limits set by the caller of \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP, |
| not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The |
| application can lock out the use of (*UTF) and (*UCP) by setting the |
| PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time. |
| . |
| . |
| .SH "NEWLINE CONVENTION" |
| .rs |
| .sp |
| These are recognized only at the very start of the pattern or after option |
| settings with a similar syntax. |
| .sp |
| (*CR) carriage return only |
| (*LF) linefeed only |
| (*CRLF) carriage return followed by linefeed |
| (*ANYCRLF) all three of the above |
| (*ANY) any Unicode newline sequence |
| (*NUL) the NUL character (binary zero) |
| . |
| . |
| .SH "WHAT \eR MATCHES" |
| .rs |
| .sp |
| These are recognized only at the very start of the pattern or after option |
| setting with a similar syntax. |
| .sp |
| (*BSR_ANYCRLF) CR, LF, or CRLF |
| (*BSR_UNICODE) any Unicode newline sequence |
| . |
| . |
| .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS" |
| .rs |
| .sp |
| (?=...) ) |
| (*pla:...) ) positive lookahead |
| (*positive_lookahead:...) ) |
| .sp |
| (?!...) ) |
| (*nla:...) ) negative lookahead |
| (*negative_lookahead:...) ) |
| .sp |
| (?<=...) ) |
| (*plb:...) ) positive lookbehind |
| (*positive_lookbehind:...) ) |
| .sp |
| (?<!...) ) |
| (*nlb:...) ) negative lookbehind |
| (*negative_lookbehind:...) ) |
| .sp |
| Each top-level branch of a lookbehind must be of a fixed length. |
| . |
| . |
| .SH "SCRIPT RUNS" |
| .rs |
| .sp |
| (*script_run:...) ) script run, can be backtracked into |
| (*sr:...) ) |
| .sp |
| (*atomic_script_run:...) ) atomic script run |
| (*asr:...) ) |
| . |
| . |
| .SH "BACKREFERENCES" |
| .rs |
| .sp |
| \en reference by number (can be ambiguous) |
| \egn reference by number |
| \eg{n} reference by number |
| \eg+n relative reference by number (PCRE2 extension) |
| \eg-n relative reference by number |
| \eg{+n} relative reference by number (PCRE2 extension) |
| \eg{-n} relative reference by number |
| \ek<name> reference by name (Perl) |
| \ek'name' reference by name (Perl) |
| \eg{name} reference by name (Perl) |
| \ek{name} reference by name (.NET) |
| (?P=name) reference by name (Python) |
| . |
| . |
| .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)" |
| .rs |
| .sp |
| (?R) recurse whole pattern |
| (?n) call subroutine by absolute number |
| (?+n) call subroutine by relative number |
| (?-n) call subroutine by relative number |
| (?&name) call subroutine by name (Perl) |
| (?P>name) call subroutine by name (Python) |
| \eg<name> call subroutine by name (Oniguruma) |
| \eg'name' call subroutine by name (Oniguruma) |
| \eg<n> call subroutine by absolute number (Oniguruma) |
| \eg'n' call subroutine by absolute number (Oniguruma) |
| \eg<+n> call subroutine by relative number (PCRE2 extension) |
| \eg'+n' call subroutine by relative number (PCRE2 extension) |
| \eg<-n> call subroutine by relative number (PCRE2 extension) |
| \eg'-n' call subroutine by relative number (PCRE2 extension) |
| . |
| . |
| .SH "CONDITIONAL PATTERNS" |
| .rs |
| .sp |
| (?(condition)yes-pattern) |
| (?(condition)yes-pattern|no-pattern) |
| .sp |
| (?(n) absolute reference condition |
| (?(+n) relative reference condition |
| (?(-n) relative reference condition |
| (?(<name>) named reference condition (Perl) |
| (?('name') named reference condition (Perl) |
| (?(name) named reference condition (PCRE2, deprecated) |
| (?(R) overall recursion condition |
| (?(Rn) specific numbered group recursion condition |
| (?(R&name) specific named group recursion condition |
| (?(DEFINE) define groups for reference |
| (?(VERSION[>]=n.m) test PCRE2 version |
| (?(assert) assertion condition |
| .sp |
| Note the ambiguity of (?(R) and (?(Rn) which might be named reference |
| conditions or recursion tests. Such a condition is interpreted as a reference |
| condition if the relevant named group exists. |
| . |
| . |
| .SH "BACKTRACKING CONTROL" |
| .rs |
| .sp |
| All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the |
| name is mandatory, for the others it is optional. (*SKIP) changes its behaviour |
| if :NAME is present. The others just set a name for passing back to the caller, |
| but this is not a name that (*SKIP) can see. The following act immediately they |
| are reached: |
| .sp |
| (*ACCEPT) force successful match |
| (*FAIL) force backtrack; synonym (*F) |
| (*MARK:NAME) set name to be passed back; synonym (*:NAME) |
| .sp |
| The following act only when a subsequent match failure causes a backtrack to |
| reach them. They all force a match failure, but they differ in what happens |
| afterwards. Those that advance the start-of-match point do so only if the |
| pattern is not anchored. |
| .sp |
| (*COMMIT) overall failure, no advance of starting point |
| (*PRUNE) advance to next starting character |
| (*SKIP) advance to current matching position |
| (*SKIP:NAME) advance to position corresponding to an earlier |
| (*MARK:NAME); if not found, the (*SKIP) is ignored |
| (*THEN) local failure, backtrack to next alternation |
| .sp |
| The effect of one of these verbs in a group called as a subroutine is confined |
| to the subroutine call. |
| . |
| . |
| .SH "CALLOUTS" |
| .rs |
| .sp |
| (?C) callout (assumed number 0) |
| (?Cn) callout with numerical data n |
| (?C"text") callout with string data |
| .sp |
| The allowed string delimiters are ` ' " ^ % # $ (which are the same for the |
| start and the end), and the starting delimiter { matched with the ending |
| delimiter }. To encode the ending delimiter within the string, double it. |
| . |
| . |
| .SH "SEE ALSO" |
| .rs |
| .sp |
| \fBpcre2pattern\fP(3), \fBpcre2api\fP(3), \fBpcre2callout\fP(3), |
| \fBpcre2matching\fP(3), \fBpcre2\fP(3). |
| . |
| . |
| .SH AUTHOR |
| .rs |
| .sp |
| .nf |
| Philip Hazel |
| University Computing Service |
| Cambridge, England. |
| .fi |
| . |
| . |
| .SH REVISION |
| .rs |
| .sp |
| .nf |
| Last updated: 11 February 2019 |
| Copyright (c) 1997-2019 University of Cambridge. |
| .fi |