Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 1 | .TH PCRE2UNICODE 3 "22 December 2021" "PCRE2 10.40" |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 2 | .SH NAME |
| 3 | PCRE - Perl-compatible regular expressions (revised API) |
| 4 | .SH "UNICODE AND UTF SUPPORT" |
| 5 | .rs |
| 6 | .sp |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 7 | PCRE2 is normally built with Unicode support, though if you do not need it, you |
| 8 | can build it without, in which case the library will be smaller. With Unicode |
| 9 | support, PCRE2 has knowledge of Unicode character properties and can process |
| 10 | strings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit |
| 11 | width), but this is not the default. Unless specifically requested, PCRE2 |
| 12 | treats each code unit in a string as one character. |
| 13 | .P |
| 14 | There are two ways of telling PCRE2 to switch to UTF mode, where characters may |
| 15 | consist of more than one code unit and the range of values is constrained. The |
| 16 | program can call |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 17 | .\" HREF |
| 18 | \fBpcre2_compile()\fP |
| 19 | .\" |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 20 | with the PCRE2_UTF option, or the pattern may start with the sequence (*UTF). |
| 21 | However, the latter facility can be locked out by the PCRE2_NEVER_UTF option. |
| 22 | That is, the programmer can prevent the supplier of the pattern from switching |
| 23 | to UTF mode. |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 24 | .P |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 25 | Note that the PCRE2_MATCH_INVALID_UTF option (see |
| 26 | .\" HTML <a href="#matchinvalid"> |
| 27 | .\" </a> |
| 28 | below) |
| 29 | .\" |
| 30 | forces PCRE2_UTF to be set. |
| 31 | .P |
| 32 | In UTF mode, both the pattern and any subject strings that are matched against |
| 33 | it are treated as UTF strings instead of strings of individual one-code-unit |
| 34 | characters. There are also some other changes to the way characters are |
| 35 | handled, as documented below. |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 36 | . |
| 37 | . |
| 38 | .SH "UNICODE PROPERTY SUPPORT" |
| 39 | .rs |
| 40 | .sp |
| 41 | When PCRE2 is built with Unicode support, the escape sequences \ep{..}, |
Elliott Hughes | 0c26e19 | 2019-08-07 12:24:46 -0700 | [diff] [blame] | 42 | \eP{..}, and \eX can be used. This is not dependent on the PCRE2_UTF setting. |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 43 | The Unicode properties that can be tested are a subset of those that Perl |
| 44 | supports. Currently they are limited to the general category properties such as |
| 45 | Lu for an upper case letter or Nd for a decimal number, the Unicode script |
| 46 | names such as Arabic or Han, Bidi_Class, Bidi_Control, and the derived |
| 47 | properties Any and LC (synonym L&). Full lists are given in the |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 48 | .\" HREF |
| 49 | \fBpcre2pattern\fP |
| 50 | .\" |
| 51 | and |
| 52 | .\" HREF |
| 53 | \fBpcre2syntax\fP |
| 54 | .\" |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 55 | documentation. In general, only the short names for properties are supported. |
| 56 | For example, \ep{L} matches a letter. Its longer synonym, \ep{Letter}, is not |
| 57 | supported. Furthermore, in Perl, many properties may optionally be prefixed by |
| 58 | "Is", for compatibility with Perl 5.6. PCRE2 does not support this. |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 59 | . |
| 60 | . |
| 61 | .SH "WIDE CHARACTERS AND UTF MODES" |
| 62 | .rs |
| 63 | .sp |
Elliott Hughes | 653c210 | 2019-01-09 15:41:36 -0800 | [diff] [blame] | 64 | Code points less than 256 can be specified in patterns by either braced or |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 65 | unbraced hexadecimal escape sequences (for example, \ex{b3} or \exb3). Larger |
| 66 | values have to use braced sequences. Unbraced octal code points up to \e777 are |
| 67 | also recognized; larger ones can be coded using \eo{...}. |
| 68 | .P |
Elliott Hughes | 653c210 | 2019-01-09 15:41:36 -0800 | [diff] [blame] | 69 | The escape sequence \eN{U+<hex digits>} is recognized as another way of |
| 70 | specifying a Unicode character by code point in a UTF mode. It is not allowed |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 71 | in non-UTF mode. |
Elliott Hughes | 653c210 | 2019-01-09 15:41:36 -0800 | [diff] [blame] | 72 | .P |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 73 | In UTF mode, repeat quantifiers apply to complete UTF characters, not to |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 74 | individual code units. |
| 75 | .P |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 76 | In UTF mode, the dot metacharacter matches one UTF character instead of a |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 77 | single code unit. |
| 78 | .P |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 79 | In UTF mode, capture group names are not restricted to ASCII, and may contain |
Elliott Hughes | 0c26e19 | 2019-08-07 12:24:46 -0700 | [diff] [blame] | 80 | any Unicode letters and decimal digits, as well as underscore. |
| 81 | .P |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 82 | The escape sequence \eC can be used to match a single code unit in UTF mode, |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 83 | but its use can lead to some strange effects because it breaks up multi-unit |
| 84 | characters (see the description of \eC in the |
| 85 | .\" HREF |
| 86 | \fBpcre2pattern\fP |
| 87 | .\" |
Elliott Hughes | 0c26e19 | 2019-08-07 12:24:46 -0700 | [diff] [blame] | 88 | documentation). For this reason, there is a build-time option that disables |
| 89 | support for \eC completely. There is also a less draconian compile-time option |
| 90 | for locking out the use of \eC when a pattern is compiled. |
Janis Danisevskis | 8b979b2 | 2016-08-15 16:09:16 +0100 | [diff] [blame] | 91 | .P |
| 92 | The use of \eC is not supported by the alternative matching function |
| 93 | \fBpcre2_dfa_match()\fP when in UTF-8 or UTF-16 mode, that is, when a character |
| 94 | may consist of more than one code unit. The use of \eC in these modes provokes |
| 95 | a match-time error. Also, the JIT optimization does not support \eC in these |
| 96 | modes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that |
| 97 | contains \eC, it will not succeed, and so when \fBpcre2_match()\fP is called, |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 98 | the matching will be carried out by the interpretive function. |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 99 | .P |
| 100 | The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly test |
| 101 | characters of any code value, but, by default, the characters that PCRE2 |
| 102 | recognizes as digits, spaces, or word characters remain the same set as in |
| 103 | non-UTF mode, all with code points less than 256. This remains true even when |
| 104 | PCRE2 is built to include Unicode support, because to do otherwise would slow |
| 105 | down matching in many common cases. Note that this also applies to \eb |
| 106 | and \eB, because they are defined in terms of \ew and \eW. If you want |
| 107 | to test for a wider sense of, say, "digit", you can use explicit Unicode |
| 108 | property tests such as \ep{Nd}. Alternatively, if you set the PCRE2_UCP option, |
| 109 | the way that the character escapes work is changed so that Unicode properties |
| 110 | are used to determine which characters match. There are more details in the |
| 111 | section on |
| 112 | .\" HTML <a href="pcre2pattern.html#genericchartypes"> |
| 113 | .\" </a> |
| 114 | generic character types |
| 115 | .\" |
| 116 | in the |
| 117 | .\" HREF |
| 118 | \fBpcre2pattern\fP |
| 119 | .\" |
| 120 | documentation. |
| 121 | .P |
| 122 | Similarly, characters that match the POSIX named character classes are all |
| 123 | low-valued characters, unless the PCRE2_UCP option is set. |
| 124 | .P |
| 125 | However, the special horizontal and vertical white space matching escapes (\eh, |
| 126 | \eH, \ev, and \eV) do match all the appropriate Unicode characters, whether or |
| 127 | not PCRE2_UCP is set. |
Elliott Hughes | 9bc971b | 2018-07-27 13:23:14 -0700 | [diff] [blame] | 128 | . |
| 129 | . |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 130 | .SH "UNICODE CASE-EQUIVALENCE" |
Elliott Hughes | 9bc971b | 2018-07-27 13:23:14 -0700 | [diff] [blame] | 131 | .rs |
| 132 | .sp |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 133 | If either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing makes use |
| 134 | of Unicode properties except for characters whose code points are less than 128 |
| 135 | and that have at most two case-equivalent values. For these, a direct table |
| 136 | lookup is used for speed. A few Unicode characters such as Greek sigma have |
| 137 | more than two code points that are case-equivalent, and these are treated |
| 138 | specially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case |
| 139 | processing for non-UTF character encodings such as UCS-2. |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 140 | . |
| 141 | . |
Elliott Hughes | 0c26e19 | 2019-08-07 12:24:46 -0700 | [diff] [blame] | 142 | .\" HTML <a name="scriptruns"></a> |
| 143 | .SH "SCRIPT RUNS" |
| 144 | .rs |
| 145 | .sp |
| 146 | The pattern constructs (*script_run:...) and (*atomic_script_run:...), with |
| 147 | synonyms (*sr:...) and (*asr:...), verify that the string matched within the |
| 148 | parentheses is a script run. In concept, a script run is a sequence of |
| 149 | characters that are all from the same Unicode script. However, because some |
| 150 | scripts are commonly used together, and because some diacritical and other |
| 151 | marks are used with multiple scripts, it is not that simple. |
| 152 | .P |
| 153 | Every Unicode character has a Script property, mostly with a value |
| 154 | corresponding to the name of a script, such as Latin, Greek, or Cyrillic. There |
| 155 | are also three special values: |
| 156 | .P |
| 157 | "Unknown" is used for code points that have not been assigned, and also for the |
| 158 | surrogate code points. In the PCRE2 32-bit library, characters whose code |
| 159 | points are greater than the Unicode maximum (U+10FFFF), which are accessible |
| 160 | only in non-UTF mode, are assigned the Unknown script. |
| 161 | .P |
| 162 | "Common" is used for characters that are used with many scripts. These include |
| 163 | punctuation, emoji, mathematical, musical, and currency symbols, and the ASCII |
| 164 | digits 0 to 9. |
| 165 | .P |
| 166 | "Inherited" is used for characters such as diacritical marks that modify a |
| 167 | previous character. These are considered to take on the script of the character |
| 168 | that they modify. |
| 169 | .P |
| 170 | Some Inherited characters are used with many scripts, but many of them are only |
| 171 | normally used with a small number of scripts. For example, U+102E0 (Coptic |
| 172 | Epact thousands mark) is used only with Arabic and Coptic. In order to make it |
| 173 | possible to check this, a Unicode property called Script Extension exists. Its |
| 174 | value is a list of scripts that apply to the character. For the majority of |
| 175 | characters, the list contains just one script, the same one as the Script |
| 176 | property. However, for characters such as U+102E0 more than one Script is |
| 177 | listed. There are also some Common characters that have a single, non-Common |
| 178 | script in their Script Extension list. |
| 179 | .P |
| 180 | The next section describes the basic rules for deciding whether a given string |
| 181 | of characters is a script run. Note, however, that there are some special cases |
| 182 | involving the Chinese Han script, and an additional constraint for decimal |
| 183 | digits. These are covered in subsequent sections. |
| 184 | . |
| 185 | . |
| 186 | .SS "Basic script run rules" |
| 187 | .rs |
| 188 | .sp |
| 189 | A string that is less than two characters long is a script run. This is the |
| 190 | only case in which an Unknown character can be part of a script run. Longer |
| 191 | strings are checked using only the Script Extensions property, not the basic |
| 192 | Script property. |
| 193 | .P |
| 194 | If a character's Script Extension property is the single value "Inherited", it |
| 195 | is always accepted as part of a script run. This is also true for the property |
| 196 | "Common", subject to the checking of decimal digits described below. All the |
| 197 | remaining characters in a script run must have at least one script in common in |
| 198 | their Script Extension lists. In set-theoretic terminology, the intersection of |
| 199 | all the sets of scripts must not be empty. |
| 200 | .P |
| 201 | A simple example is an Internet name such as "google.com". The letters are all |
| 202 | in the Latin script, and the dot is Common, so this string is a script run. |
| 203 | However, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a |
| 204 | string that looks the same, but with Cyrillic "o"s is not a script run. |
| 205 | .P |
| 206 | More interesting examples involve characters with more than one script in their |
| 207 | Script Extension. Consider the following characters: |
| 208 | .sp |
| 209 | U+060C Arabic comma |
| 210 | U+06D4 Arabic full stop |
| 211 | .sp |
| 212 | The first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and |
| 213 | Thaana; the second has just Arabic and Hanifi Rohingya. Both of them could |
| 214 | appear in script runs of either Arabic or Hanifi Rohingya. The first could also |
| 215 | appear in Syriac or Thaana script runs, but the second could not. |
| 216 | . |
| 217 | . |
| 218 | .SS "The Chinese Han script" |
| 219 | .rs |
| 220 | .sp |
| 221 | The Chinese Han script is commonly used in conjunction with other scripts for |
| 222 | writing certain languages. Japanese uses the Hiragana and Katakana scripts |
| 223 | together with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo |
| 224 | and Han. These three combinations are treated as special cases when checking |
| 225 | script runs and are, in effect, "virtual scripts". Thus, a script run may |
| 226 | contain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and |
| 227 | Han, or a mixture of Bopomofo and Han, but not, for example, a mixture of |
| 228 | Hangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical |
| 229 | Standard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/) |
| 230 | in allowing such mixtures. |
| 231 | . |
| 232 | . |
| 233 | .SS "Decimal digits" |
| 234 | .rs |
| 235 | .sp |
| 236 | Unicode contains many sets of 10 decimal digits in different scripts, and some |
| 237 | scripts (including the Common script) contain more than one set. Some of these |
| 238 | decimal digits them are visually indistinguishable from the common ASCII |
| 239 | digits. In addition to the script checking described above, if a script run |
| 240 | contains any decimal digits, they must all come from the same set of 10 |
| 241 | adjacent characters. |
| 242 | . |
| 243 | . |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 244 | .SH "VALIDITY OF UTF STRINGS" |
| 245 | .rs |
| 246 | .sp |
| 247 | When the PCRE2_UTF option is set, the strings passed as patterns and subjects |
Elliott Hughes | 0c26e19 | 2019-08-07 12:24:46 -0700 | [diff] [blame] | 248 | are (by default) checked for validity on entry to the relevant functions. If an |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 249 | invalid UTF string is passed, a negative error code is returned. The code unit |
Elliott Hughes | 0c26e19 | 2019-08-07 12:24:46 -0700 | [diff] [blame] | 250 | offset to the offending character can be extracted from the match data block by |
| 251 | calling \fBpcre2_get_startchar()\fP, which is used for this purpose after a UTF |
| 252 | error. |
| 253 | .P |
| 254 | In some situations, you may already know that your strings are valid, and |
| 255 | therefore want to skip these checks in order to improve performance, for |
| 256 | example in the case of a long subject string that is being scanned repeatedly. |
| 257 | If you set the PCRE2_NO_UTF_CHECK option at compile time or at match time, |
| 258 | PCRE2 assumes that the pattern or subject it is given (respectively) contains |
| 259 | only valid UTF code unit sequences. |
| 260 | .P |
| 261 | If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 262 | is undefined and your program may crash or loop indefinitely or give incorrect |
| 263 | results. There is, however, one mode of matching that can handle invalid UTF |
| 264 | subject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to |
| 265 | \fBpcre2_compile()\fP and is discussed below in the next section. The rest of |
| 266 | this section covers the case when PCRE2_MATCH_INVALID_UTF is not set. |
Elliott Hughes | 0c26e19 | 2019-08-07 12:24:46 -0700 | [diff] [blame] | 267 | .P |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 268 | Passing PCRE2_NO_UTF_CHECK to \fBpcre2_compile()\fP just disables the UTF check |
| 269 | for the pattern; it does not also apply to subject strings. If you want to |
| 270 | disable the check for a subject string you must pass this same option to |
Elliott Hughes | 0c26e19 | 2019-08-07 12:24:46 -0700 | [diff] [blame] | 271 | \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP. |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 272 | .P |
| 273 | UTF-16 and UTF-32 strings can indicate their endianness by special code knows |
| 274 | as a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting |
| 275 | strings to be in host byte order. |
| 276 | .P |
Elliott Hughes | 0c26e19 | 2019-08-07 12:24:46 -0700 | [diff] [blame] | 277 | Unless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other |
| 278 | processing takes place. In the case of \fBpcre2_match()\fP and |
| 279 | \fBpcre2_dfa_match()\fP calls with a non-zero starting offset, the check is |
| 280 | applied only to that part of the subject that could be inspected during |
| 281 | matching, and there is a check that the starting offset points to the first |
| 282 | code unit of a character or to the end of the subject. If there are no |
| 283 | lookbehind assertions in the pattern, the check starts at the starting offset. |
| 284 | Otherwise, it starts at the length of the longest lookbehind before the |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 285 | starting offset, or at the start of the subject if there are not that many |
| 286 | characters before the starting offset. Note that the sequences \eb and \eB are |
| 287 | one-character lookbehinds. |
| 288 | .P |
| 289 | In addition to checking the format of the string, there is a check to ensure |
| 290 | that all code points lie in the range U+0 to U+10FFFF, excluding the surrogate |
| 291 | area. The so-called "non-character" code points are not excluded because |
| 292 | Unicode corrigendum #9 makes it clear that they should not be. |
| 293 | .P |
| 294 | Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, |
| 295 | where they are used in pairs to encode code points with values greater than |
| 296 | 0xFFFF. The code points that are encoded by UTF-16 pairs are available |
| 297 | independently in the UTF-8 and UTF-32 encodings. (In other words, the whole |
| 298 | surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and |
| 299 | UTF-32.) |
| 300 | .P |
Elliott Hughes | 0c26e19 | 2019-08-07 12:24:46 -0700 | [diff] [blame] | 301 | Setting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is |
| 302 | given if an escape sequence for an invalid Unicode code point is encountered in |
| 303 | the pattern. If you want to allow escape sequences such as \ex{d800} (a |
| 304 | surrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra |
| 305 | option. However, this is possible only in UTF-8 and UTF-32 modes, because these |
| 306 | values are not representable in UTF-16. |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 307 | . |
| 308 | . |
| 309 | .\" HTML <a name="utf8strings"></a> |
| 310 | .SS "Errors in UTF-8 strings" |
| 311 | .rs |
| 312 | .sp |
| 313 | The following negative error codes are given for invalid UTF-8 strings: |
| 314 | .sp |
| 315 | PCRE2_ERROR_UTF8_ERR1 |
| 316 | PCRE2_ERROR_UTF8_ERR2 |
| 317 | PCRE2_ERROR_UTF8_ERR3 |
| 318 | PCRE2_ERROR_UTF8_ERR4 |
| 319 | PCRE2_ERROR_UTF8_ERR5 |
| 320 | .sp |
| 321 | The string ends with a truncated UTF-8 character; the code specifies how many |
| 322 | bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be |
| 323 | no longer than 4 bytes, the encoding scheme (originally defined by RFC 2279) |
| 324 | allows for up to 6 bytes, and this is checked first; hence the possibility of |
| 325 | 4 or 5 missing bytes. |
| 326 | .sp |
| 327 | PCRE2_ERROR_UTF8_ERR6 |
| 328 | PCRE2_ERROR_UTF8_ERR7 |
| 329 | PCRE2_ERROR_UTF8_ERR8 |
| 330 | PCRE2_ERROR_UTF8_ERR9 |
| 331 | PCRE2_ERROR_UTF8_ERR10 |
| 332 | .sp |
| 333 | The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of the |
| 334 | character do not have the binary value 0b10 (that is, either the most |
| 335 | significant bit is 0, or the next bit is 1). |
| 336 | .sp |
| 337 | PCRE2_ERROR_UTF8_ERR11 |
| 338 | PCRE2_ERROR_UTF8_ERR12 |
| 339 | .sp |
| 340 | A character that is valid by the RFC 2279 rules is either 5 or 6 bytes long; |
| 341 | these code points are excluded by RFC 3629. |
| 342 | .sp |
| 343 | PCRE2_ERROR_UTF8_ERR13 |
| 344 | .sp |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 345 | A 4-byte character has a value greater than 0x10ffff; these code points are |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 346 | excluded by RFC 3629. |
| 347 | .sp |
| 348 | PCRE2_ERROR_UTF8_ERR14 |
| 349 | .sp |
| 350 | A 3-byte character has a value in the range 0xd800 to 0xdfff; this range of |
| 351 | code points are reserved by RFC 3629 for use with UTF-16, and so are excluded |
| 352 | from UTF-8. |
| 353 | .sp |
| 354 | PCRE2_ERROR_UTF8_ERR15 |
| 355 | PCRE2_ERROR_UTF8_ERR16 |
| 356 | PCRE2_ERROR_UTF8_ERR17 |
| 357 | PCRE2_ERROR_UTF8_ERR18 |
| 358 | PCRE2_ERROR_UTF8_ERR19 |
| 359 | .sp |
| 360 | A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes for a |
| 361 | value that can be represented by fewer bytes, which is invalid. For example, |
| 362 | the two bytes 0xc0, 0xae give the value 0x2e, whose correct coding uses just |
| 363 | one byte. |
| 364 | .sp |
| 365 | PCRE2_ERROR_UTF8_ERR20 |
| 366 | .sp |
| 367 | The two most significant bits of the first byte of a character have the binary |
| 368 | value 0b10 (that is, the most significant bit is 1 and the second is 0). Such a |
| 369 | byte can only validly occur as the second or subsequent byte of a multi-byte |
| 370 | character. |
| 371 | .sp |
| 372 | PCRE2_ERROR_UTF8_ERR21 |
| 373 | .sp |
| 374 | The first byte of a character has the value 0xfe or 0xff. These values can |
| 375 | never occur in a valid UTF-8 string. |
| 376 | . |
| 377 | . |
| 378 | .\" HTML <a name="utf16strings"></a> |
| 379 | .SS "Errors in UTF-16 strings" |
| 380 | .rs |
| 381 | .sp |
| 382 | The following negative error codes are given for invalid UTF-16 strings: |
| 383 | .sp |
Janis Danisevskis | 8b979b2 | 2016-08-15 16:09:16 +0100 | [diff] [blame] | 384 | PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string |
| 385 | PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate |
| 386 | PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 387 | .sp |
| 388 | . |
| 389 | . |
| 390 | .\" HTML <a name="utf32strings"></a> |
| 391 | .SS "Errors in UTF-32 strings" |
| 392 | .rs |
| 393 | .sp |
| 394 | The following negative error codes are given for invalid UTF-32 strings: |
| 395 | .sp |
Janis Danisevskis | 8b979b2 | 2016-08-15 16:09:16 +0100 | [diff] [blame] | 396 | PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff) |
| 397 | PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 398 | .sp |
| 399 | . |
| 400 | . |
Elliott Hughes | 2dbd7d2 | 2020-06-03 14:32:37 -0700 | [diff] [blame] | 401 | .\" HTML <a name="matchinvalid"></a> |
| 402 | .SH "MATCHING IN INVALID UTF STRINGS" |
| 403 | .rs |
| 404 | .sp |
| 405 | You can run pattern matches on subject strings that may contain invalid UTF |
| 406 | sequences if you call \fBpcre2_compile()\fP with the PCRE2_MATCH_INVALID_UTF |
| 407 | option. This is supported by \fBpcre2_match()\fP, including JIT matching, but |
| 408 | not by \fBpcre2_dfa_match()\fP. When PCRE2_MATCH_INVALID_UTF is set, it forces |
| 409 | PCRE2_UTF to be set as well. Note, however, that the pattern itself must be a |
| 410 | valid UTF string. |
| 411 | .P |
| 412 | Setting PCRE2_MATCH_INVALID_UTF does not affect what \fBpcre2_compile()\fP |
| 413 | generates, but if \fBpcre2_jit_compile()\fP is subsequently called, it does |
| 414 | generate different code. If JIT is not used, the option affects the behaviour |
| 415 | of the interpretive code in \fBpcre2_match()\fP. When PCRE2_MATCH_INVALID_UTF |
| 416 | is set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time. |
| 417 | .P |
| 418 | In this mode, an invalid code unit sequence in the subject never matches any |
| 419 | pattern item. It does not match dot, it does not match \ep{Any}, it does not |
| 420 | even match negative items such as [^X]. A lookbehind assertion fails if it |
| 421 | encounters an invalid sequence while moving the current point backwards. In |
| 422 | other words, an invalid UTF code unit sequence acts as a barrier which no match |
| 423 | can cross. |
| 424 | .P |
| 425 | You can also think of this as the subject being split up into fragments of |
| 426 | valid UTF, delimited internally by invalid code unit sequences. The pattern is |
| 427 | matched fragment by fragment. The result of a successful match, however, is |
| 428 | given as code unit offsets in the entire subject string in the usual way. There |
| 429 | are a few points to consider: |
| 430 | .P |
| 431 | The internal boundaries are not interpreted as the beginnings or ends of lines |
| 432 | and so do not match circumflex or dollar characters in the pattern. |
| 433 | .P |
| 434 | If \fBpcre2_match()\fP is called with an offset that points to an invalid |
| 435 | UTF-sequence, that sequence is skipped, and the match starts at the next valid |
| 436 | UTF character, or the end of the subject. |
| 437 | .P |
| 438 | At internal fragment boundaries, \eb and \eB behave in the same way as at the |
| 439 | beginning and end of the subject. For example, a sequence such as \ebWORD\eb |
| 440 | would match an instance of WORD that is surrounded by invalid UTF code units. |
| 441 | .P |
| 442 | Using PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary |
| 443 | data, knowing that any matched strings that are returned are valid UTF. This |
| 444 | can be useful when searching for UTF text in executable or other binary files. |
| 445 | . |
| 446 | . |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 447 | .SH AUTHOR |
| 448 | .rs |
| 449 | .sp |
| 450 | .nf |
| 451 | Philip Hazel |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 452 | Retired from University Computing Service |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 453 | Cambridge, England. |
| 454 | .fi |
| 455 | . |
| 456 | . |
| 457 | .SH REVISION |
| 458 | .rs |
| 459 | .sp |
| 460 | .nf |
Elliott Hughes | 4e19c8e | 2022-04-15 15:11:02 -0700 | [diff] [blame] | 461 | Last updated: 22 December 2021 |
| 462 | Copyright (c) 1997-2021 University of Cambridge. |
Janis Danisevskis | 53e448c | 2016-03-31 13:35:25 +0100 | [diff] [blame] | 463 | .fi |