Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 1 | /*! |
| 2 | This crate provides a library for parsing, compiling, and executing regular |
| 3 | expressions. Its syntax is similar to Perl-style regular expressions, but lacks |
| 4 | a few features like look around and backreferences. In exchange, all searches |
| 5 | execute in linear time with respect to the size of the regular expression and |
| 6 | search text. |
| 7 | |
| 8 | This crate's documentation provides some simple examples, describes |
| 9 | [Unicode support](#unicode) and exhaustively lists the |
| 10 | [supported syntax](#syntax). |
| 11 | |
| 12 | For more specific details on the API for regular expressions, please see the |
| 13 | documentation for the [`Regex`](struct.Regex.html) type. |
| 14 | |
| 15 | # Usage |
| 16 | |
| 17 | This crate is [on crates.io](https://crates.io/crates/regex) and can be |
| 18 | used by adding `regex` to your dependencies in your project's `Cargo.toml`. |
| 19 | |
| 20 | ```toml |
| 21 | [dependencies] |
| 22 | regex = "1" |
| 23 | ``` |
| 24 | |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 25 | # Example: find a date |
| 26 | |
| 27 | General use of regular expressions in this package involves compiling an |
| 28 | expression and then using it to search, split or replace text. For example, |
| 29 | to confirm that some text resembles a date: |
| 30 | |
| 31 | ```rust |
| 32 | use regex::Regex; |
| 33 | let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap(); |
| 34 | assert!(re.is_match("2014-01-01")); |
| 35 | ``` |
| 36 | |
| 37 | Notice the use of the `^` and `$` anchors. In this crate, every expression |
| 38 | is executed with an implicit `.*?` at the beginning and end, which allows |
| 39 | it to match anywhere in the text. Anchors can be used to ensure that the |
| 40 | full text matches an expression. |
| 41 | |
| 42 | This example also demonstrates the utility of |
| 43 | [raw strings](https://doc.rust-lang.org/stable/reference/tokens.html#raw-string-literals) |
| 44 | in Rust, which |
| 45 | are just like regular strings except they are prefixed with an `r` and do |
| 46 | not process any escape sequences. For example, `"\\d"` is the same |
| 47 | expression as `r"\d"`. |
| 48 | |
| 49 | # Example: Avoid compiling the same regex in a loop |
| 50 | |
| 51 | It is an anti-pattern to compile the same regular expression in a loop |
| 52 | since compilation is typically expensive. (It takes anywhere from a few |
| 53 | microseconds to a few **milliseconds** depending on the size of the |
| 54 | regex.) Not only is compilation itself expensive, but this also prevents |
| 55 | optimizations that reuse allocations internally to the matching engines. |
| 56 | |
| 57 | In Rust, it can sometimes be a pain to pass regular expressions around if |
| 58 | they're used from inside a helper function. Instead, we recommend using the |
| 59 | [`lazy_static`](https://crates.io/crates/lazy_static) crate to ensure that |
| 60 | regular expressions are compiled exactly once. |
| 61 | |
| 62 | For example: |
| 63 | |
| 64 | ```rust |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 65 | use lazy_static::lazy_static; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 66 | use regex::Regex; |
| 67 | |
| 68 | fn some_helper_function(text: &str) -> bool { |
| 69 | lazy_static! { |
| 70 | static ref RE: Regex = Regex::new("...").unwrap(); |
| 71 | } |
| 72 | RE.is_match(text) |
| 73 | } |
| 74 | |
| 75 | fn main() {} |
| 76 | ``` |
| 77 | |
| 78 | Specifically, in this example, the regex will be compiled when it is used for |
| 79 | the first time. On subsequent uses, it will reuse the previous compilation. |
| 80 | |
| 81 | # Example: iterating over capture groups |
| 82 | |
| 83 | This crate provides convenient iterators for matching an expression |
| 84 | repeatedly against a search string to find successive non-overlapping |
| 85 | matches. For example, to find all dates in a string and be able to access |
| 86 | them by their component pieces: |
| 87 | |
| 88 | ```rust |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 89 | # use regex::Regex; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 90 | # fn main() { |
| 91 | let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap(); |
| 92 | let text = "2012-03-14, 2013-01-01 and 2014-07-05"; |
| 93 | for cap in re.captures_iter(text) { |
| 94 | println!("Month: {} Day: {} Year: {}", &cap[2], &cap[3], &cap[1]); |
| 95 | } |
| 96 | // Output: |
| 97 | // Month: 03 Day: 14 Year: 2012 |
| 98 | // Month: 01 Day: 01 Year: 2013 |
| 99 | // Month: 07 Day: 05 Year: 2014 |
| 100 | # } |
| 101 | ``` |
| 102 | |
| 103 | Notice that the year is in the capture group indexed at `1`. This is |
| 104 | because the *entire match* is stored in the capture group at index `0`. |
| 105 | |
| 106 | # Example: replacement with named capture groups |
| 107 | |
| 108 | Building on the previous example, perhaps we'd like to rearrange the date |
| 109 | formats. This can be done with text replacement. But to make the code |
| 110 | clearer, we can *name* our capture groups and use those names as variables |
| 111 | in our replacement text: |
| 112 | |
| 113 | ```rust |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 114 | # use regex::Regex; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 115 | # fn main() { |
| 116 | let re = Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})").unwrap(); |
| 117 | let before = "2012-03-14, 2013-01-01 and 2014-07-05"; |
| 118 | let after = re.replace_all(before, "$m/$d/$y"); |
| 119 | assert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014"); |
| 120 | # } |
| 121 | ``` |
| 122 | |
| 123 | The `replace` methods are actually polymorphic in the replacement, which |
| 124 | provides more flexibility than is seen here. (See the documentation for |
| 125 | `Regex::replace` for more details.) |
| 126 | |
| 127 | Note that if your regex gets complicated, you can use the `x` flag to |
| 128 | enable insignificant whitespace mode, which also lets you write comments: |
| 129 | |
| 130 | ```rust |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 131 | # use regex::Regex; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 132 | # fn main() { |
| 133 | let re = Regex::new(r"(?x) |
| 134 | (?P<y>\d{4}) # the year |
| 135 | - |
| 136 | (?P<m>\d{2}) # the month |
| 137 | - |
| 138 | (?P<d>\d{2}) # the day |
| 139 | ").unwrap(); |
| 140 | let before = "2012-03-14, 2013-01-01 and 2014-07-05"; |
| 141 | let after = re.replace_all(before, "$m/$d/$y"); |
| 142 | assert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014"); |
| 143 | # } |
| 144 | ``` |
| 145 | |
| 146 | If you wish to match against whitespace in this mode, you can still use `\s`, |
Haibo Huang | 49cbe5f | 2020-05-28 20:14:24 -0700 | [diff] [blame] | 147 | `\n`, `\t`, etc. For escaping a single space character, you can escape it |
| 148 | directly with `\ `, use its hex character code `\x20` or temporarily disable |
| 149 | the `x` flag, e.g., `(?-x: )`. |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 150 | |
| 151 | # Example: match multiple regular expressions simultaneously |
| 152 | |
| 153 | This demonstrates how to use a `RegexSet` to match multiple (possibly |
| 154 | overlapping) regular expressions in a single scan of the search text: |
| 155 | |
| 156 | ```rust |
| 157 | use regex::RegexSet; |
| 158 | |
| 159 | let set = RegexSet::new(&[ |
| 160 | r"\w+", |
| 161 | r"\d+", |
| 162 | r"\pL+", |
| 163 | r"foo", |
| 164 | r"bar", |
| 165 | r"barfoo", |
| 166 | r"foobar", |
| 167 | ]).unwrap(); |
| 168 | |
| 169 | // Iterate over and collect all of the matches. |
| 170 | let matches: Vec<_> = set.matches("foobar").into_iter().collect(); |
| 171 | assert_eq!(matches, vec![0, 2, 3, 4, 6]); |
| 172 | |
| 173 | // You can also test whether a particular regex matched: |
| 174 | let matches = set.matches("foobar"); |
| 175 | assert!(!matches.matched(5)); |
| 176 | assert!(matches.matched(6)); |
| 177 | ``` |
| 178 | |
| 179 | # Pay for what you use |
| 180 | |
| 181 | With respect to searching text with a regular expression, there are three |
| 182 | questions that can be asked: |
| 183 | |
| 184 | 1. Does the text match this expression? |
| 185 | 2. If so, where does it match? |
| 186 | 3. Where did the capturing groups match? |
| 187 | |
| 188 | Generally speaking, this crate could provide a function to answer only #3, |
| 189 | which would subsume #1 and #2 automatically. However, it can be significantly |
| 190 | more expensive to compute the location of capturing group matches, so it's best |
| 191 | not to do it if you don't need to. |
| 192 | |
| 193 | Therefore, only use what you need. For example, don't use `find` if you |
| 194 | only need to test if an expression matches a string. (Use `is_match` |
| 195 | instead.) |
| 196 | |
| 197 | # Unicode |
| 198 | |
| 199 | This implementation executes regular expressions **only** on valid UTF-8 |
| 200 | while exposing match locations as byte indices into the search string. (To |
| 201 | relax this restriction, use the [`bytes`](bytes/index.html) sub-module.) |
| 202 | |
| 203 | Only simple case folding is supported. Namely, when matching |
| 204 | case-insensitively, the characters are first mapped using the "simple" case |
| 205 | folding rules defined by Unicode. |
| 206 | |
| 207 | Regular expressions themselves are **only** interpreted as a sequence of |
| 208 | Unicode scalar values. This means you can use Unicode characters directly |
| 209 | in your expression: |
| 210 | |
| 211 | ```rust |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 212 | # use regex::Regex; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 213 | # fn main() { |
| 214 | let re = Regex::new(r"(?i)Δ+").unwrap(); |
| 215 | let mat = re.find("ΔδΔ").unwrap(); |
| 216 | assert_eq!((mat.start(), mat.end()), (0, 6)); |
| 217 | # } |
| 218 | ``` |
| 219 | |
| 220 | Most features of the regular expressions in this crate are Unicode aware. Here |
| 221 | are some examples: |
| 222 | |
| 223 | * `.` will match any valid UTF-8 encoded Unicode scalar value except for `\n`. |
| 224 | (To also match `\n`, enable the `s` flag, e.g., `(?s:.)`.) |
| 225 | * `\w`, `\d` and `\s` are Unicode aware. For example, `\s` will match all forms |
| 226 | of whitespace categorized by Unicode. |
| 227 | * `\b` matches a Unicode word boundary. |
| 228 | * Negated character classes like `[^a]` match all Unicode scalar values except |
| 229 | for `a`. |
| 230 | * `^` and `$` are **not** Unicode aware in multi-line mode. Namely, they only |
| 231 | recognize `\n` and not any of the other forms of line terminators defined |
| 232 | by Unicode. |
| 233 | |
| 234 | Unicode general categories, scripts, script extensions, ages and a smattering |
| 235 | of boolean properties are available as character classes. For example, you can |
| 236 | match a sequence of numerals, Greek or Cherokee letters: |
| 237 | |
| 238 | ```rust |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 239 | # use regex::Regex; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 240 | # fn main() { |
| 241 | let re = Regex::new(r"[\pN\p{Greek}\p{Cherokee}]+").unwrap(); |
| 242 | let mat = re.find("abcΔᎠβⅠᏴγδⅡxyz").unwrap(); |
| 243 | assert_eq!((mat.start(), mat.end()), (3, 23)); |
| 244 | # } |
| 245 | ``` |
| 246 | |
| 247 | For a more detailed breakdown of Unicode support with respect to |
Elliott Hughes | ffb6030 | 2021-04-01 17:11:40 -0700 | [diff] [blame] | 248 | [UTS#18](https://unicode.org/reports/tr18/), |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 249 | please see the |
| 250 | [UNICODE](https://github.com/rust-lang/regex/blob/master/UNICODE.md) |
| 251 | document in the root of the regex repository. |
| 252 | |
| 253 | # Opt out of Unicode support |
| 254 | |
| 255 | The `bytes` sub-module provides a `Regex` type that can be used to match |
| 256 | on `&[u8]`. By default, text is interpreted as UTF-8 just like it is with |
| 257 | the main `Regex` type. However, this behavior can be disabled by turning |
| 258 | off the `u` flag, even if doing so could result in matching invalid UTF-8. |
| 259 | For example, when the `u` flag is disabled, `.` will match any byte instead |
| 260 | of any Unicode scalar value. |
| 261 | |
| 262 | Disabling the `u` flag is also possible with the standard `&str`-based `Regex` |
| 263 | type, but it is only allowed where the UTF-8 invariant is maintained. For |
| 264 | example, `(?-u:\w)` is an ASCII-only `\w` character class and is legal in an |
| 265 | `&str`-based `Regex`, but `(?-u:\xFF)` will attempt to match the raw byte |
| 266 | `\xFF`, which is invalid UTF-8 and therefore is illegal in `&str`-based |
| 267 | regexes. |
| 268 | |
| 269 | Finally, since Unicode support requires bundling large Unicode data |
| 270 | tables, this crate exposes knobs to disable the compilation of those |
| 271 | data tables, which can be useful for shrinking binary size and reducing |
| 272 | compilation times. For details on how to do that, see the section on [crate |
| 273 | features](#crate-features). |
| 274 | |
| 275 | # Syntax |
| 276 | |
| 277 | The syntax supported in this crate is documented below. |
| 278 | |
| 279 | Note that the regular expression parser and abstract syntax are exposed in |
| 280 | a separate crate, [`regex-syntax`](https://docs.rs/regex-syntax). |
| 281 | |
| 282 | ## Matching one character |
| 283 | |
| 284 | <pre class="rust"> |
| 285 | . any character except new line (includes new line with s flag) |
| 286 | \d digit (\p{Nd}) |
| 287 | \D not digit |
| 288 | \pN One-letter name Unicode character class |
| 289 | \p{Greek} Unicode character class (general category or script) |
| 290 | \PN Negated one-letter name Unicode character class |
| 291 | \P{Greek} negated Unicode character class (general category or script) |
| 292 | </pre> |
| 293 | |
| 294 | ### Character classes |
| 295 | |
| 296 | <pre class="rust"> |
| 297 | [xyz] A character class matching either x, y or z (union). |
| 298 | [^xyz] A character class matching any character except x, y and z. |
| 299 | [a-z] A character class matching any character in range a-z. |
| 300 | [[:alpha:]] ASCII character class ([A-Za-z]) |
| 301 | [[:^alpha:]] Negated ASCII character class ([^A-Za-z]) |
| 302 | [x[^xyz]] Nested/grouping character class (matching any character except y and z) |
| 303 | [a-y&&xyz] Intersection (matching x or y) |
| 304 | [0-9&&[^4]] Subtraction using intersection and negation (matching 0-9 except 4) |
| 305 | [0-9--4] Direct subtraction (matching 0-9 except 4) |
| 306 | [a-g~~b-h] Symmetric difference (matching `a` and `h` only) |
| 307 | [\[\]] Escaping in character classes (matching [ or ]) |
| 308 | </pre> |
| 309 | |
| 310 | Any named character class may appear inside a bracketed `[...]` character |
| 311 | class. For example, `[\p{Greek}[:digit:]]` matches any Greek or ASCII |
| 312 | digit. `[\p{Greek}&&\pL]` matches Greek letters. |
| 313 | |
| 314 | Precedence in character classes, from most binding to least: |
| 315 | |
| 316 | 1. Ranges: `a-cd` == `[a-c]d` |
| 317 | 2. Union: `ab&&bc` == `[ab]&&[bc]` |
| 318 | 3. Intersection: `^a-z&&b` == `^[a-z&&b]` |
| 319 | 4. Negation |
| 320 | |
| 321 | ## Composites |
| 322 | |
| 323 | <pre class="rust"> |
| 324 | xy concatenation (x followed by y) |
| 325 | x|y alternation (x or y, prefer x) |
| 326 | </pre> |
| 327 | |
| 328 | ## Repetitions |
| 329 | |
| 330 | <pre class="rust"> |
| 331 | x* zero or more of x (greedy) |
| 332 | x+ one or more of x (greedy) |
| 333 | x? zero or one of x (greedy) |
| 334 | x*? zero or more of x (ungreedy/lazy) |
| 335 | x+? one or more of x (ungreedy/lazy) |
| 336 | x?? zero or one of x (ungreedy/lazy) |
| 337 | x{n,m} at least n x and at most m x (greedy) |
| 338 | x{n,} at least n x (greedy) |
| 339 | x{n} exactly n x |
| 340 | x{n,m}? at least n x and at most m x (ungreedy/lazy) |
| 341 | x{n,}? at least n x (ungreedy/lazy) |
| 342 | x{n}? exactly n x |
| 343 | </pre> |
| 344 | |
| 345 | ## Empty matches |
| 346 | |
| 347 | <pre class="rust"> |
| 348 | ^ the beginning of text (or start-of-line with multi-line mode) |
| 349 | $ the end of text (or end-of-line with multi-line mode) |
| 350 | \A only the beginning of text (even with multi-line mode enabled) |
| 351 | \z only the end of text (even with multi-line mode enabled) |
| 352 | \b a Unicode word boundary (\w on one side and \W, \A, or \z on other) |
| 353 | \B not a Unicode word boundary |
| 354 | </pre> |
| 355 | |
| 356 | ## Grouping and flags |
| 357 | |
| 358 | <pre class="rust"> |
| 359 | (exp) numbered capture group (indexed by opening parenthesis) |
Chih-Hung Hsieh | 849e445 | 2020-10-26 13:16:47 -0700 | [diff] [blame] | 360 | (?P<name>exp) named (also numbered) capture group (allowed chars: [_0-9a-zA-Z.\[\]]) |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 361 | (?:exp) non-capturing group |
| 362 | (?flags) set flags within current group |
| 363 | (?flags:exp) set flags for exp (non-capturing) |
| 364 | </pre> |
| 365 | |
| 366 | Flags are each a single character. For example, `(?x)` sets the flag `x` |
| 367 | and `(?-x)` clears the flag `x`. Multiple flags can be set or cleared at |
| 368 | the same time: `(?xy)` sets both the `x` and `y` flags and `(?x-y)` sets |
| 369 | the `x` flag and clears the `y` flag. |
| 370 | |
| 371 | All flags are by default disabled unless stated otherwise. They are: |
| 372 | |
| 373 | <pre class="rust"> |
| 374 | i case-insensitive: letters match both upper and lower case |
| 375 | m multi-line mode: ^ and $ match begin/end of line |
| 376 | s allow . to match \n |
| 377 | U swap the meaning of x* and x*? |
| 378 | u Unicode support (enabled by default) |
| 379 | x ignore whitespace and allow line comments (starting with `#`) |
| 380 | </pre> |
| 381 | |
| 382 | Flags can be toggled within a pattern. Here's an example that matches |
| 383 | case-insensitively for the first part but case-sensitively for the second part: |
| 384 | |
| 385 | ```rust |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 386 | # use regex::Regex; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 387 | # fn main() { |
| 388 | let re = Regex::new(r"(?i)a+(?-i)b+").unwrap(); |
| 389 | let cap = re.captures("AaAaAbbBBBb").unwrap(); |
| 390 | assert_eq!(&cap[0], "AaAaAbb"); |
| 391 | # } |
| 392 | ``` |
| 393 | |
| 394 | Notice that the `a+` matches either `a` or `A`, but the `b+` only matches |
| 395 | `b`. |
| 396 | |
| 397 | Multi-line mode means `^` and `$` no longer match just at the beginning/end of |
| 398 | the input, but at the beginning/end of lines: |
| 399 | |
| 400 | ``` |
| 401 | # use regex::Regex; |
| 402 | let re = Regex::new(r"(?m)^line \d+").unwrap(); |
| 403 | let m = re.find("line one\nline 2\n").unwrap(); |
| 404 | assert_eq!(m.as_str(), "line 2"); |
| 405 | ``` |
| 406 | |
| 407 | Note that `^` matches after new lines, even at the end of input: |
| 408 | |
| 409 | ``` |
| 410 | # use regex::Regex; |
| 411 | let re = Regex::new(r"(?m)^").unwrap(); |
| 412 | let m = re.find_iter("test\n").last().unwrap(); |
| 413 | assert_eq!((m.start(), m.end()), (5, 5)); |
| 414 | ``` |
| 415 | |
| 416 | Here is an example that uses an ASCII word boundary instead of a Unicode |
| 417 | word boundary: |
| 418 | |
| 419 | ```rust |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 420 | # use regex::Regex; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 421 | # fn main() { |
| 422 | let re = Regex::new(r"(?-u:\b).+(?-u:\b)").unwrap(); |
| 423 | let cap = re.captures("$$abc$$").unwrap(); |
| 424 | assert_eq!(&cap[0], "abc"); |
| 425 | # } |
| 426 | ``` |
| 427 | |
| 428 | ## Escape sequences |
| 429 | |
| 430 | <pre class="rust"> |
| 431 | \* literal *, works for any punctuation character: \.+*?()|[]{}^$ |
| 432 | \a bell (\x07) |
| 433 | \f form feed (\x0C) |
| 434 | \t horizontal tab |
| 435 | \n new line |
| 436 | \r carriage return |
| 437 | \v vertical tab (\x0B) |
| 438 | \123 octal character code (up to three digits) (when enabled) |
| 439 | \x7F hex character code (exactly two digits) |
| 440 | \x{10FFFF} any hex character code corresponding to a Unicode code point |
| 441 | \u007F hex character code (exactly four digits) |
| 442 | \u{7F} any hex character code corresponding to a Unicode code point |
| 443 | \U0000007F hex character code (exactly eight digits) |
| 444 | \U{7F} any hex character code corresponding to a Unicode code point |
| 445 | </pre> |
| 446 | |
| 447 | ## Perl character classes (Unicode friendly) |
| 448 | |
| 449 | These classes are based on the definitions provided in |
Elliott Hughes | ffb6030 | 2021-04-01 17:11:40 -0700 | [diff] [blame] | 450 | [UTS#18](https://www.unicode.org/reports/tr18/#Compatibility_Properties): |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 451 | |
| 452 | <pre class="rust"> |
| 453 | \d digit (\p{Nd}) |
| 454 | \D not digit |
| 455 | \s whitespace (\p{White_Space}) |
| 456 | \S not whitespace |
| 457 | \w word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control}) |
| 458 | \W not word character |
| 459 | </pre> |
| 460 | |
| 461 | ## ASCII character classes |
| 462 | |
| 463 | <pre class="rust"> |
| 464 | [[:alnum:]] alphanumeric ([0-9A-Za-z]) |
| 465 | [[:alpha:]] alphabetic ([A-Za-z]) |
| 466 | [[:ascii:]] ASCII ([\x00-\x7F]) |
| 467 | [[:blank:]] blank ([\t ]) |
| 468 | [[:cntrl:]] control ([\x00-\x1F\x7F]) |
| 469 | [[:digit:]] digits ([0-9]) |
| 470 | [[:graph:]] graphical ([!-~]) |
| 471 | [[:lower:]] lower case ([a-z]) |
| 472 | [[:print:]] printable ([ -~]) |
| 473 | [[:punct:]] punctuation ([!-/:-@\[-`{-~]) |
| 474 | [[:space:]] whitespace ([\t\n\v\f\r ]) |
| 475 | [[:upper:]] upper case ([A-Z]) |
| 476 | [[:word:]] word characters ([0-9A-Za-z_]) |
| 477 | [[:xdigit:]] hex digit ([0-9A-Fa-f]) |
| 478 | </pre> |
| 479 | |
| 480 | # Crate features |
| 481 | |
| 482 | By default, this crate tries pretty hard to make regex matching both as fast |
| 483 | as possible and as correct as it can be, within reason. This means that there |
| 484 | is a lot of code dedicated to performance, the handling of Unicode data and the |
| 485 | Unicode data itself. Overall, this leads to more dependencies, larger binaries |
| 486 | and longer compile times. This trade off may not be appropriate in all cases, |
| 487 | and indeed, even when all Unicode and performance features are disabled, one |
| 488 | is still left with a perfectly serviceable regex engine that will work well |
| 489 | in many cases. |
| 490 | |
| 491 | This crate exposes a number of features for controlling that trade off. Some |
| 492 | of these features are strictly performance oriented, such that disabling them |
| 493 | won't result in a loss of functionality, but may result in worse performance. |
| 494 | Other features, such as the ones controlling the presence or absence of Unicode |
| 495 | data, can result in a loss of functionality. For example, if one disables the |
| 496 | `unicode-case` feature (described below), then compiling the regex `(?i)a` |
| 497 | will fail since Unicode case insensitivity is enabled by default. Instead, |
| 498 | callers must use `(?i-u)a` instead to disable Unicode case folding. Stated |
| 499 | differently, enabling or disabling any of the features below can only add or |
| 500 | subtract from the total set of valid regular expressions. Enabling or disabling |
| 501 | a feature will never modify the match semantics of a regular expression. |
| 502 | |
| 503 | All features below are enabled by default. |
| 504 | |
| 505 | ### Ecosystem features |
| 506 | |
| 507 | * **std** - |
| 508 | When enabled, this will cause `regex` to use the standard library. Currently, |
| 509 | disabling this feature will always result in a compilation error. It is |
| 510 | intended to add `alloc`-only support to regex in the future. |
| 511 | |
| 512 | ### Performance features |
| 513 | |
| 514 | * **perf** - |
| 515 | Enables all performance related features. This feature is enabled by default |
| 516 | and will always cover all features that improve performance, even if more |
| 517 | are added in the future. |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 518 | * **perf-dfa** - |
| 519 | Enables the use of a lazy DFA for matching. The lazy DFA is used to compile |
| 520 | portions of a regex to a very fast DFA on an as-needed basis. This can |
| 521 | result in substantial speedups, usually by an order of magnitude on large |
| 522 | haystacks. The lazy DFA does not bring in any new dependencies, but it can |
| 523 | make compile times longer. |
| 524 | * **perf-inline** - |
| 525 | Enables the use of aggressive inlining inside match routines. This reduces |
| 526 | the overhead of each match. The aggressive inlining, however, increases |
| 527 | compile times and binary size. |
| 528 | * **perf-literal** - |
| 529 | Enables the use of literal optimizations for speeding up matches. In some |
| 530 | cases, literal optimizations can result in speedups of _several_ orders of |
| 531 | magnitude. Disabling this drops the `aho-corasick` and `memchr` dependencies. |
Elliott Hughes | ffb6030 | 2021-04-01 17:11:40 -0700 | [diff] [blame] | 532 | * **perf-cache** - |
| 533 | This feature used to enable a faster internal cache at the cost of using |
| 534 | additional dependencies, but this is no longer an option. A fast internal |
| 535 | cache is now used unconditionally with no additional dependencies. This may |
| 536 | change in the future. |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 537 | |
| 538 | ### Unicode features |
| 539 | |
| 540 | * **unicode** - |
| 541 | Enables all Unicode features. This feature is enabled by default, and will |
| 542 | always cover all Unicode features, even if more are added in the future. |
| 543 | * **unicode-age** - |
| 544 | Provide the data for the |
| 545 | [Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age). |
| 546 | This makes it possible to use classes like `\p{Age:6.0}` to refer to all |
| 547 | codepoints first introduced in Unicode 6.0 |
| 548 | * **unicode-bool** - |
| 549 | Provide the data for numerous Unicode boolean properties. The full list |
| 550 | is not included here, but contains properties like `Alphabetic`, `Emoji`, |
| 551 | `Lowercase`, `Math`, `Uppercase` and `White_Space`. |
| 552 | * **unicode-case** - |
| 553 | Provide the data for case insensitive matching using |
| 554 | [Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches). |
| 555 | * **unicode-gencat** - |
| 556 | Provide the data for |
Chih-Hung Hsieh | 849e445 | 2020-10-26 13:16:47 -0700 | [diff] [blame] | 557 | [Unicode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values). |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 558 | This includes, but is not limited to, `Decimal_Number`, `Letter`, |
| 559 | `Math_Symbol`, `Number` and `Punctuation`. |
| 560 | * **unicode-perl** - |
| 561 | Provide the data for supporting the Unicode-aware Perl character classes, |
| 562 | corresponding to `\w`, `\s` and `\d`. This is also necessary for using |
| 563 | Unicode-aware word boundary assertions. Note that if this feature is |
| 564 | disabled, the `\s` and `\d` character classes are still available if the |
| 565 | `unicode-bool` and `unicode-gencat` features are enabled, respectively. |
| 566 | * **unicode-script** - |
| 567 | Provide the data for |
| 568 | [Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/). |
| 569 | This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`, |
| 570 | `Latin` and `Thai`. |
| 571 | * **unicode-segment** - |
| 572 | Provide the data necessary to provide the properties used to implement the |
| 573 | [Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/). |
| 574 | This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and |
| 575 | `\p{sb=ATerm}`. |
| 576 | |
| 577 | |
| 578 | # Untrusted input |
| 579 | |
| 580 | This crate can handle both untrusted regular expressions and untrusted |
| 581 | search text. |
| 582 | |
| 583 | Untrusted regular expressions are handled by capping the size of a compiled |
| 584 | regular expression. |
| 585 | (See [`RegexBuilder::size_limit`](struct.RegexBuilder.html#method.size_limit).) |
| 586 | Without this, it would be trivial for an attacker to exhaust your system's |
| 587 | memory with expressions like `a{100}{100}{100}`. |
| 588 | |
| 589 | Untrusted search text is allowed because the matching engine(s) in this |
| 590 | crate have time complexity `O(mn)` (with `m ~ regex` and `n ~ search |
| 591 | text`), which means there's no way to cause exponential blow-up like with |
| 592 | some other regular expression engines. (We pay for this by disallowing |
| 593 | features like arbitrary look-ahead and backreferences.) |
| 594 | |
| 595 | When a DFA is used, pathological cases with exponential state blow-up are |
| 596 | avoided by constructing the DFA lazily or in an "online" manner. Therefore, |
| 597 | at most one new state can be created for each byte of input. This satisfies |
| 598 | our time complexity guarantees, but can lead to memory growth |
| 599 | proportional to the size of the input. As a stopgap, the DFA is only |
| 600 | allowed to store a fixed number of states. When the limit is reached, its |
| 601 | states are wiped and continues on, possibly duplicating previous work. If |
| 602 | the limit is reached too frequently, it gives up and hands control off to |
| 603 | another matching engine with fixed memory requirements. |
| 604 | (The DFA size limit can also be tweaked. See |
| 605 | [`RegexBuilder::dfa_size_limit`](struct.RegexBuilder.html#method.dfa_size_limit).) |
| 606 | */ |
| 607 | |
| 608 | #![deny(missing_docs)] |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 609 | #![cfg_attr(feature = "pattern", feature(pattern))] |
Haibo Huang | 47619dd | 2021-01-08 17:05:43 -0800 | [diff] [blame] | 610 | #![warn(missing_debug_implementations)] |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 611 | |
| 612 | #[cfg(not(feature = "std"))] |
| 613 | compile_error!("`std` feature is currently required to build this crate"); |
| 614 | |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 615 | // To check README's example |
| 616 | // TODO: Re-enable this once the MSRV is 1.43 or greater. |
| 617 | // See: https://github.com/rust-lang/regex/issues/684 |
| 618 | // See: https://github.com/rust-lang/regex/issues/685 |
Haibo Huang | 49cbe5f | 2020-05-28 20:14:24 -0700 | [diff] [blame] | 619 | // #[cfg(doctest)] |
| 620 | // doc_comment::doctest!("../README.md"); |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 621 | |
| 622 | #[cfg(feature = "std")] |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 623 | pub use crate::error::Error; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 624 | #[cfg(feature = "std")] |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 625 | pub use crate::re_builder::set_unicode::*; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 626 | #[cfg(feature = "std")] |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 627 | pub use crate::re_builder::unicode::*; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 628 | #[cfg(feature = "std")] |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 629 | pub use crate::re_set::unicode::*; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 630 | #[cfg(feature = "std")] |
| 631 | #[cfg(feature = "std")] |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 632 | pub use crate::re_unicode::{ |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 633 | escape, CaptureLocations, CaptureMatches, CaptureNames, Captures, |
| 634 | Locations, Match, Matches, NoExpand, Regex, Replacer, ReplacerRef, Split, |
| 635 | SplitN, SubCaptureMatches, |
| 636 | }; |
| 637 | |
| 638 | /** |
| 639 | Match regular expressions on arbitrary bytes. |
| 640 | |
| 641 | This module provides a nearly identical API to the one found in the |
| 642 | top-level of this crate. There are two important differences: |
| 643 | |
| 644 | 1. Matching is done on `&[u8]` instead of `&str`. Additionally, `Vec<u8>` |
| 645 | is used where `String` would have been used. |
| 646 | 2. Unicode support can be disabled even when disabling it would result in |
| 647 | matching invalid UTF-8 bytes. |
| 648 | |
| 649 | # Example: match null terminated string |
| 650 | |
| 651 | This shows how to find all null-terminated strings in a slice of bytes: |
| 652 | |
| 653 | ```rust |
| 654 | # use regex::bytes::Regex; |
| 655 | let re = Regex::new(r"(?-u)(?P<cstr>[^\x00]+)\x00").unwrap(); |
| 656 | let text = b"foo\x00bar\x00baz\x00"; |
| 657 | |
| 658 | // Extract all of the strings without the null terminator from each match. |
| 659 | // The unwrap is OK here since a match requires the `cstr` capture to match. |
| 660 | let cstrs: Vec<&[u8]> = |
| 661 | re.captures_iter(text) |
| 662 | .map(|c| c.name("cstr").unwrap().as_bytes()) |
| 663 | .collect(); |
| 664 | assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs); |
| 665 | ``` |
| 666 | |
| 667 | # Example: selectively enable Unicode support |
| 668 | |
| 669 | This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded |
| 670 | string (e.g., to extract a title from a Matroska file): |
| 671 | |
| 672 | ```rust |
| 673 | # use std::str; |
| 674 | # use regex::bytes::Regex; |
| 675 | let re = Regex::new( |
| 676 | r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))" |
| 677 | ).unwrap(); |
| 678 | let text = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65"; |
| 679 | let caps = re.captures(text).unwrap(); |
| 680 | |
| 681 | // Notice that despite the `.*` at the end, it will only match valid UTF-8 |
| 682 | // because Unicode mode was enabled with the `u` flag. Without the `u` flag, |
| 683 | // the `.*` would match the rest of the bytes. |
| 684 | let mat = caps.get(1).unwrap(); |
| 685 | assert_eq!((7, 10), (mat.start(), mat.end())); |
| 686 | |
| 687 | // If there was a match, Unicode mode guarantees that `title` is valid UTF-8. |
| 688 | let title = str::from_utf8(&caps[1]).unwrap(); |
| 689 | assert_eq!("☃", title); |
| 690 | ``` |
| 691 | |
| 692 | In general, if the Unicode flag is enabled in a capture group and that capture |
| 693 | is part of the overall match, then the capture is *guaranteed* to be valid |
| 694 | UTF-8. |
| 695 | |
| 696 | # Syntax |
| 697 | |
| 698 | The supported syntax is pretty much the same as the syntax for Unicode |
| 699 | regular expressions with a few changes that make sense for matching arbitrary |
| 700 | bytes: |
| 701 | |
| 702 | 1. The `u` flag can be disabled even when disabling it might cause the regex to |
| 703 | match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in |
| 704 | "ASCII compatible" mode. |
| 705 | 2. In ASCII compatible mode, neither Unicode scalar values nor Unicode |
| 706 | character classes are allowed. |
| 707 | 3. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`) |
| 708 | revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps |
| 709 | to `[[:digit:]]` and `\s` maps to `[[:space:]]`. |
| 710 | 4. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to |
| 711 | determine whether a byte is a word byte or not. |
| 712 | 5. Hexadecimal notation can be used to specify arbitrary bytes instead of |
| 713 | Unicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the |
| 714 | literal byte `\xFF`, while in Unicode mode, `\xFF` is a Unicode codepoint that |
| 715 | matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal notation when |
| 716 | enabled. |
Chih-Hung Hsieh | 849e445 | 2020-10-26 13:16:47 -0700 | [diff] [blame] | 717 | 6. In ASCII compatible mode, `.` matches any *byte* except for `\n`. When the |
| 718 | `s` flag is additionally enabled, `.` matches any byte. |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 719 | |
| 720 | # Performance |
| 721 | |
| 722 | In general, one should expect performance on `&[u8]` to be roughly similar to |
| 723 | performance on `&str`. |
| 724 | */ |
| 725 | #[cfg(feature = "std")] |
| 726 | pub mod bytes { |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 727 | pub use crate::re_builder::bytes::*; |
| 728 | pub use crate::re_builder::set_bytes::*; |
| 729 | pub use crate::re_bytes::*; |
| 730 | pub use crate::re_set::bytes::*; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 731 | } |
| 732 | |
| 733 | mod backtrack; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 734 | mod compile; |
| 735 | #[cfg(feature = "perf-dfa")] |
| 736 | mod dfa; |
| 737 | mod error; |
| 738 | mod exec; |
| 739 | mod expand; |
| 740 | mod find_byte; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 741 | mod input; |
| 742 | mod literal; |
| 743 | #[cfg(feature = "pattern")] |
| 744 | mod pattern; |
| 745 | mod pikevm; |
Elliott Hughes | ffb6030 | 2021-04-01 17:11:40 -0700 | [diff] [blame] | 746 | mod pool; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 747 | mod prog; |
| 748 | mod re_builder; |
| 749 | mod re_bytes; |
| 750 | mod re_set; |
| 751 | mod re_trait; |
| 752 | mod re_unicode; |
| 753 | mod sparse; |
| 754 | mod utf8; |
| 755 | |
| 756 | /// The `internal` module exists to support suspicious activity, such as |
| 757 | /// testing different matching engines and supporting the `regex-debug` CLI |
| 758 | /// utility. |
| 759 | #[doc(hidden)] |
| 760 | #[cfg(feature = "std")] |
| 761 | pub mod internal { |
Joel Galenson | 3874808 | 2021-05-19 16:51:51 -0700 | [diff] [blame^] | 762 | pub use crate::compile::Compiler; |
| 763 | pub use crate::exec::{Exec, ExecBuilder}; |
| 764 | pub use crate::input::{Char, CharInput, Input, InputAt}; |
| 765 | pub use crate::literal::LiteralSearcher; |
| 766 | pub use crate::prog::{EmptyLook, Inst, InstRanges, Program}; |
Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame] | 767 | } |