blob: 7f2dec815d09cb08aeaa582f95afc6d226d59704 [file] [log] [blame]
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -07001/*!
2This crate provides a library for parsing, compiling, and executing regular
3expressions. Its syntax is similar to Perl-style regular expressions, but lacks
4a few features like look around and backreferences. In exchange, all searches
5execute in linear time with respect to the size of the regular expression and
6search text.
7
8This crate's documentation provides some simple examples, describes
9[Unicode support](#unicode) and exhaustively lists the
10[supported syntax](#syntax).
11
12For more specific details on the API for regular expressions, please see the
13documentation for the [`Regex`](struct.Regex.html) type.
14
15# Usage
16
17This crate is [on crates.io](https://crates.io/crates/regex) and can be
18used by adding `regex` to your dependencies in your project's `Cargo.toml`.
19
20```toml
21[dependencies]
22regex = "1"
23```
24
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -070025# Example: find a date
26
27General use of regular expressions in this package involves compiling an
28expression and then using it to search, split or replace text. For example,
29to confirm that some text resembles a date:
30
31```rust
32use regex::Regex;
33let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap();
34assert!(re.is_match("2014-01-01"));
35```
36
37Notice the use of the `^` and `$` anchors. In this crate, every expression
38is executed with an implicit `.*?` at the beginning and end, which allows
39it to match anywhere in the text. Anchors can be used to ensure that the
40full text matches an expression.
41
42This example also demonstrates the utility of
43[raw strings](https://doc.rust-lang.org/stable/reference/tokens.html#raw-string-literals)
44in Rust, which
45are just like regular strings except they are prefixed with an `r` and do
46not process any escape sequences. For example, `"\\d"` is the same
47expression as `r"\d"`.
48
49# Example: Avoid compiling the same regex in a loop
50
51It is an anti-pattern to compile the same regular expression in a loop
52since compilation is typically expensive. (It takes anywhere from a few
53microseconds to a few **milliseconds** depending on the size of the
54regex.) Not only is compilation itself expensive, but this also prevents
55optimizations that reuse allocations internally to the matching engines.
56
57In Rust, it can sometimes be a pain to pass regular expressions around if
58they're used from inside a helper function. Instead, we recommend using the
59[`lazy_static`](https://crates.io/crates/lazy_static) crate to ensure that
60regular expressions are compiled exactly once.
61
62For example:
63
64```rust
Joel Galenson38748082021-05-19 16:51:51 -070065use lazy_static::lazy_static;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -070066use regex::Regex;
67
68fn some_helper_function(text: &str) -> bool {
69 lazy_static! {
70 static ref RE: Regex = Regex::new("...").unwrap();
71 }
72 RE.is_match(text)
73}
74
75fn main() {}
76```
77
78Specifically, in this example, the regex will be compiled when it is used for
79the first time. On subsequent uses, it will reuse the previous compilation.
80
81# Example: iterating over capture groups
82
83This crate provides convenient iterators for matching an expression
84repeatedly against a search string to find successive non-overlapping
85matches. For example, to find all dates in a string and be able to access
86them by their component pieces:
87
88```rust
Joel Galenson38748082021-05-19 16:51:51 -070089# use regex::Regex;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -070090# fn main() {
91let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
92let text = "2012-03-14, 2013-01-01 and 2014-07-05";
93for cap in re.captures_iter(text) {
94 println!("Month: {} Day: {} Year: {}", &cap[2], &cap[3], &cap[1]);
95}
96// Output:
97// Month: 03 Day: 14 Year: 2012
98// Month: 01 Day: 01 Year: 2013
99// Month: 07 Day: 05 Year: 2014
100# }
101```
102
103Notice that the year is in the capture group indexed at `1`. This is
104because the *entire match* is stored in the capture group at index `0`.
105
106# Example: replacement with named capture groups
107
108Building on the previous example, perhaps we'd like to rearrange the date
109formats. This can be done with text replacement. But to make the code
110clearer, we can *name* our capture groups and use those names as variables
111in our replacement text:
112
113```rust
Joel Galenson38748082021-05-19 16:51:51 -0700114# use regex::Regex;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700115# fn main() {
116let re = Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})").unwrap();
117let before = "2012-03-14, 2013-01-01 and 2014-07-05";
118let after = re.replace_all(before, "$m/$d/$y");
119assert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014");
120# }
121```
122
123The `replace` methods are actually polymorphic in the replacement, which
124provides more flexibility than is seen here. (See the documentation for
125`Regex::replace` for more details.)
126
127Note that if your regex gets complicated, you can use the `x` flag to
128enable insignificant whitespace mode, which also lets you write comments:
129
130```rust
Joel Galenson38748082021-05-19 16:51:51 -0700131# use regex::Regex;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700132# fn main() {
133let re = Regex::new(r"(?x)
134 (?P<y>\d{4}) # the year
135 -
136 (?P<m>\d{2}) # the month
137 -
138 (?P<d>\d{2}) # the day
139").unwrap();
140let before = "2012-03-14, 2013-01-01 and 2014-07-05";
141let after = re.replace_all(before, "$m/$d/$y");
142assert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014");
143# }
144```
145
146If you wish to match against whitespace in this mode, you can still use `\s`,
Haibo Huang49cbe5f2020-05-28 20:14:24 -0700147`\n`, `\t`, etc. For escaping a single space character, you can escape it
148directly with `\ `, use its hex character code `\x20` or temporarily disable
149the `x` flag, e.g., `(?-x: )`.
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700150
151# Example: match multiple regular expressions simultaneously
152
153This demonstrates how to use a `RegexSet` to match multiple (possibly
154overlapping) regular expressions in a single scan of the search text:
155
156```rust
157use regex::RegexSet;
158
159let set = RegexSet::new(&[
160 r"\w+",
161 r"\d+",
162 r"\pL+",
163 r"foo",
164 r"bar",
165 r"barfoo",
166 r"foobar",
167]).unwrap();
168
169// Iterate over and collect all of the matches.
170let matches: Vec<_> = set.matches("foobar").into_iter().collect();
171assert_eq!(matches, vec![0, 2, 3, 4, 6]);
172
173// You can also test whether a particular regex matched:
174let matches = set.matches("foobar");
175assert!(!matches.matched(5));
176assert!(matches.matched(6));
177```
178
179# Pay for what you use
180
181With respect to searching text with a regular expression, there are three
182questions that can be asked:
183
1841. Does the text match this expression?
1852. If so, where does it match?
1863. Where did the capturing groups match?
187
188Generally speaking, this crate could provide a function to answer only #3,
189which would subsume #1 and #2 automatically. However, it can be significantly
190more expensive to compute the location of capturing group matches, so it's best
191not to do it if you don't need to.
192
193Therefore, only use what you need. For example, don't use `find` if you
194only need to test if an expression matches a string. (Use `is_match`
195instead.)
196
197# Unicode
198
199This implementation executes regular expressions **only** on valid UTF-8
200while exposing match locations as byte indices into the search string. (To
201relax this restriction, use the [`bytes`](bytes/index.html) sub-module.)
202
203Only simple case folding is supported. Namely, when matching
204case-insensitively, the characters are first mapped using the "simple" case
205folding rules defined by Unicode.
206
207Regular expressions themselves are **only** interpreted as a sequence of
208Unicode scalar values. This means you can use Unicode characters directly
209in your expression:
210
211```rust
Joel Galenson38748082021-05-19 16:51:51 -0700212# use regex::Regex;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700213# fn main() {
214let re = Regex::new(r"(?i)Δ+").unwrap();
215let mat = re.find("ΔδΔ").unwrap();
216assert_eq!((mat.start(), mat.end()), (0, 6));
217# }
218```
219
220Most features of the regular expressions in this crate are Unicode aware. Here
221are some examples:
222
223* `.` will match any valid UTF-8 encoded Unicode scalar value except for `\n`.
224 (To also match `\n`, enable the `s` flag, e.g., `(?s:.)`.)
225* `\w`, `\d` and `\s` are Unicode aware. For example, `\s` will match all forms
226 of whitespace categorized by Unicode.
227* `\b` matches a Unicode word boundary.
228* Negated character classes like `[^a]` match all Unicode scalar values except
229 for `a`.
230* `^` and `$` are **not** Unicode aware in multi-line mode. Namely, they only
231 recognize `\n` and not any of the other forms of line terminators defined
232 by Unicode.
233
234Unicode general categories, scripts, script extensions, ages and a smattering
235of boolean properties are available as character classes. For example, you can
236match a sequence of numerals, Greek or Cherokee letters:
237
238```rust
Joel Galenson38748082021-05-19 16:51:51 -0700239# use regex::Regex;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700240# fn main() {
241let re = Regex::new(r"[\pN\p{Greek}\p{Cherokee}]+").unwrap();
242let mat = re.find("abcΔᎠβⅠᏴγδⅡxyz").unwrap();
243assert_eq!((mat.start(), mat.end()), (3, 23));
244# }
245```
246
247For a more detailed breakdown of Unicode support with respect to
Elliott Hughesffb60302021-04-01 17:11:40 -0700248[UTS#18](https://unicode.org/reports/tr18/),
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700249please see the
250[UNICODE](https://github.com/rust-lang/regex/blob/master/UNICODE.md)
251document in the root of the regex repository.
252
253# Opt out of Unicode support
254
255The `bytes` sub-module provides a `Regex` type that can be used to match
256on `&[u8]`. By default, text is interpreted as UTF-8 just like it is with
257the main `Regex` type. However, this behavior can be disabled by turning
258off the `u` flag, even if doing so could result in matching invalid UTF-8.
259For example, when the `u` flag is disabled, `.` will match any byte instead
260of any Unicode scalar value.
261
262Disabling the `u` flag is also possible with the standard `&str`-based `Regex`
263type, but it is only allowed where the UTF-8 invariant is maintained. For
264example, `(?-u:\w)` is an ASCII-only `\w` character class and is legal in an
265`&str`-based `Regex`, but `(?-u:\xFF)` will attempt to match the raw byte
266`\xFF`, which is invalid UTF-8 and therefore is illegal in `&str`-based
267regexes.
268
269Finally, since Unicode support requires bundling large Unicode data
270tables, this crate exposes knobs to disable the compilation of those
271data tables, which can be useful for shrinking binary size and reducing
272compilation times. For details on how to do that, see the section on [crate
273features](#crate-features).
274
275# Syntax
276
277The syntax supported in this crate is documented below.
278
279Note that the regular expression parser and abstract syntax are exposed in
280a separate crate, [`regex-syntax`](https://docs.rs/regex-syntax).
281
282## Matching one character
283
284<pre class="rust">
285. any character except new line (includes new line with s flag)
286\d digit (\p{Nd})
287\D not digit
288\pN One-letter name Unicode character class
289\p{Greek} Unicode character class (general category or script)
290\PN Negated one-letter name Unicode character class
291\P{Greek} negated Unicode character class (general category or script)
292</pre>
293
294### Character classes
295
296<pre class="rust">
297[xyz] A character class matching either x, y or z (union).
298[^xyz] A character class matching any character except x, y and z.
299[a-z] A character class matching any character in range a-z.
300[[:alpha:]] ASCII character class ([A-Za-z])
301[[:^alpha:]] Negated ASCII character class ([^A-Za-z])
302[x[^xyz]] Nested/grouping character class (matching any character except y and z)
303[a-y&&xyz] Intersection (matching x or y)
304[0-9&&[^4]] Subtraction using intersection and negation (matching 0-9 except 4)
305[0-9--4] Direct subtraction (matching 0-9 except 4)
306[a-g~~b-h] Symmetric difference (matching `a` and `h` only)
307[\[\]] Escaping in character classes (matching [ or ])
308</pre>
309
310Any named character class may appear inside a bracketed `[...]` character
311class. For example, `[\p{Greek}[:digit:]]` matches any Greek or ASCII
312digit. `[\p{Greek}&&\pL]` matches Greek letters.
313
314Precedence in character classes, from most binding to least:
315
3161. Ranges: `a-cd` == `[a-c]d`
3172. Union: `ab&&bc` == `[ab]&&[bc]`
3183. Intersection: `^a-z&&b` == `^[a-z&&b]`
3194. Negation
320
321## Composites
322
323<pre class="rust">
324xy concatenation (x followed by y)
325x|y alternation (x or y, prefer x)
326</pre>
327
328## Repetitions
329
330<pre class="rust">
331x* zero or more of x (greedy)
332x+ one or more of x (greedy)
333x? zero or one of x (greedy)
334x*? zero or more of x (ungreedy/lazy)
335x+? one or more of x (ungreedy/lazy)
336x?? zero or one of x (ungreedy/lazy)
337x{n,m} at least n x and at most m x (greedy)
338x{n,} at least n x (greedy)
339x{n} exactly n x
340x{n,m}? at least n x and at most m x (ungreedy/lazy)
341x{n,}? at least n x (ungreedy/lazy)
342x{n}? exactly n x
343</pre>
344
345## Empty matches
346
347<pre class="rust">
348^ the beginning of text (or start-of-line with multi-line mode)
349$ the end of text (or end-of-line with multi-line mode)
350\A only the beginning of text (even with multi-line mode enabled)
351\z only the end of text (even with multi-line mode enabled)
352\b a Unicode word boundary (\w on one side and \W, \A, or \z on other)
353\B not a Unicode word boundary
354</pre>
355
356## Grouping and flags
357
358<pre class="rust">
359(exp) numbered capture group (indexed by opening parenthesis)
Chih-Hung Hsieh849e4452020-10-26 13:16:47 -0700360(?P&lt;name&gt;exp) named (also numbered) capture group (allowed chars: [_0-9a-zA-Z.\[\]])
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700361(?:exp) non-capturing group
362(?flags) set flags within current group
363(?flags:exp) set flags for exp (non-capturing)
364</pre>
365
366Flags are each a single character. For example, `(?x)` sets the flag `x`
367and `(?-x)` clears the flag `x`. Multiple flags can be set or cleared at
368the same time: `(?xy)` sets both the `x` and `y` flags and `(?x-y)` sets
369the `x` flag and clears the `y` flag.
370
371All flags are by default disabled unless stated otherwise. They are:
372
373<pre class="rust">
374i case-insensitive: letters match both upper and lower case
375m multi-line mode: ^ and $ match begin/end of line
376s allow . to match \n
377U swap the meaning of x* and x*?
378u Unicode support (enabled by default)
379x ignore whitespace and allow line comments (starting with `#`)
380</pre>
381
382Flags can be toggled within a pattern. Here's an example that matches
383case-insensitively for the first part but case-sensitively for the second part:
384
385```rust
Joel Galenson38748082021-05-19 16:51:51 -0700386# use regex::Regex;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700387# fn main() {
388let re = Regex::new(r"(?i)a+(?-i)b+").unwrap();
389let cap = re.captures("AaAaAbbBBBb").unwrap();
390assert_eq!(&cap[0], "AaAaAbb");
391# }
392```
393
394Notice that the `a+` matches either `a` or `A`, but the `b+` only matches
395`b`.
396
397Multi-line mode means `^` and `$` no longer match just at the beginning/end of
398the input, but at the beginning/end of lines:
399
400```
401# use regex::Regex;
402let re = Regex::new(r"(?m)^line \d+").unwrap();
403let m = re.find("line one\nline 2\n").unwrap();
404assert_eq!(m.as_str(), "line 2");
405```
406
407Note that `^` matches after new lines, even at the end of input:
408
409```
410# use regex::Regex;
411let re = Regex::new(r"(?m)^").unwrap();
412let m = re.find_iter("test\n").last().unwrap();
413assert_eq!((m.start(), m.end()), (5, 5));
414```
415
416Here is an example that uses an ASCII word boundary instead of a Unicode
417word boundary:
418
419```rust
Joel Galenson38748082021-05-19 16:51:51 -0700420# use regex::Regex;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700421# fn main() {
422let re = Regex::new(r"(?-u:\b).+(?-u:\b)").unwrap();
423let cap = re.captures("$$abc$$").unwrap();
424assert_eq!(&cap[0], "abc");
425# }
426```
427
428## Escape sequences
429
430<pre class="rust">
431\* literal *, works for any punctuation character: \.+*?()|[]{}^$
432\a bell (\x07)
433\f form feed (\x0C)
434\t horizontal tab
435\n new line
436\r carriage return
437\v vertical tab (\x0B)
438\123 octal character code (up to three digits) (when enabled)
439\x7F hex character code (exactly two digits)
440\x{10FFFF} any hex character code corresponding to a Unicode code point
441\u007F hex character code (exactly four digits)
442\u{7F} any hex character code corresponding to a Unicode code point
443\U0000007F hex character code (exactly eight digits)
444\U{7F} any hex character code corresponding to a Unicode code point
445</pre>
446
447## Perl character classes (Unicode friendly)
448
449These classes are based on the definitions provided in
Elliott Hughesffb60302021-04-01 17:11:40 -0700450[UTS#18](https://www.unicode.org/reports/tr18/#Compatibility_Properties):
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700451
452<pre class="rust">
453\d digit (\p{Nd})
454\D not digit
455\s whitespace (\p{White_Space})
456\S not whitespace
457\w word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
458\W not word character
459</pre>
460
461## ASCII character classes
462
463<pre class="rust">
464[[:alnum:]] alphanumeric ([0-9A-Za-z])
465[[:alpha:]] alphabetic ([A-Za-z])
466[[:ascii:]] ASCII ([\x00-\x7F])
467[[:blank:]] blank ([\t ])
468[[:cntrl:]] control ([\x00-\x1F\x7F])
469[[:digit:]] digits ([0-9])
470[[:graph:]] graphical ([!-~])
471[[:lower:]] lower case ([a-z])
472[[:print:]] printable ([ -~])
473[[:punct:]] punctuation ([!-/:-@\[-`{-~])
474[[:space:]] whitespace ([\t\n\v\f\r ])
475[[:upper:]] upper case ([A-Z])
476[[:word:]] word characters ([0-9A-Za-z_])
477[[:xdigit:]] hex digit ([0-9A-Fa-f])
478</pre>
479
480# Crate features
481
482By default, this crate tries pretty hard to make regex matching both as fast
483as possible and as correct as it can be, within reason. This means that there
484is a lot of code dedicated to performance, the handling of Unicode data and the
485Unicode data itself. Overall, this leads to more dependencies, larger binaries
486and longer compile times. This trade off may not be appropriate in all cases,
487and indeed, even when all Unicode and performance features are disabled, one
488is still left with a perfectly serviceable regex engine that will work well
489in many cases.
490
491This crate exposes a number of features for controlling that trade off. Some
492of these features are strictly performance oriented, such that disabling them
493won't result in a loss of functionality, but may result in worse performance.
494Other features, such as the ones controlling the presence or absence of Unicode
495data, can result in a loss of functionality. For example, if one disables the
496`unicode-case` feature (described below), then compiling the regex `(?i)a`
497will fail since Unicode case insensitivity is enabled by default. Instead,
498callers must use `(?i-u)a` instead to disable Unicode case folding. Stated
499differently, enabling or disabling any of the features below can only add or
500subtract from the total set of valid regular expressions. Enabling or disabling
501a feature will never modify the match semantics of a regular expression.
502
503All features below are enabled by default.
504
505### Ecosystem features
506
507* **std** -
508 When enabled, this will cause `regex` to use the standard library. Currently,
509 disabling this feature will always result in a compilation error. It is
510 intended to add `alloc`-only support to regex in the future.
511
512### Performance features
513
514* **perf** -
515 Enables all performance related features. This feature is enabled by default
516 and will always cover all features that improve performance, even if more
517 are added in the future.
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700518* **perf-dfa** -
519 Enables the use of a lazy DFA for matching. The lazy DFA is used to compile
520 portions of a regex to a very fast DFA on an as-needed basis. This can
521 result in substantial speedups, usually by an order of magnitude on large
522 haystacks. The lazy DFA does not bring in any new dependencies, but it can
523 make compile times longer.
524* **perf-inline** -
525 Enables the use of aggressive inlining inside match routines. This reduces
526 the overhead of each match. The aggressive inlining, however, increases
527 compile times and binary size.
528* **perf-literal** -
529 Enables the use of literal optimizations for speeding up matches. In some
530 cases, literal optimizations can result in speedups of _several_ orders of
531 magnitude. Disabling this drops the `aho-corasick` and `memchr` dependencies.
Elliott Hughesffb60302021-04-01 17:11:40 -0700532* **perf-cache** -
533 This feature used to enable a faster internal cache at the cost of using
534 additional dependencies, but this is no longer an option. A fast internal
535 cache is now used unconditionally with no additional dependencies. This may
536 change in the future.
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700537
538### Unicode features
539
540* **unicode** -
541 Enables all Unicode features. This feature is enabled by default, and will
542 always cover all Unicode features, even if more are added in the future.
543* **unicode-age** -
544 Provide the data for the
545 [Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age).
546 This makes it possible to use classes like `\p{Age:6.0}` to refer to all
547 codepoints first introduced in Unicode 6.0
548* **unicode-bool** -
549 Provide the data for numerous Unicode boolean properties. The full list
550 is not included here, but contains properties like `Alphabetic`, `Emoji`,
551 `Lowercase`, `Math`, `Uppercase` and `White_Space`.
552* **unicode-case** -
553 Provide the data for case insensitive matching using
554 [Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches).
555* **unicode-gencat** -
556 Provide the data for
Chih-Hung Hsieh849e4452020-10-26 13:16:47 -0700557 [Unicode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values).
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700558 This includes, but is not limited to, `Decimal_Number`, `Letter`,
559 `Math_Symbol`, `Number` and `Punctuation`.
560* **unicode-perl** -
561 Provide the data for supporting the Unicode-aware Perl character classes,
562 corresponding to `\w`, `\s` and `\d`. This is also necessary for using
563 Unicode-aware word boundary assertions. Note that if this feature is
564 disabled, the `\s` and `\d` character classes are still available if the
565 `unicode-bool` and `unicode-gencat` features are enabled, respectively.
566* **unicode-script** -
567 Provide the data for
568 [Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/).
569 This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`,
570 `Latin` and `Thai`.
571* **unicode-segment** -
572 Provide the data necessary to provide the properties used to implement the
573 [Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/).
574 This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and
575 `\p{sb=ATerm}`.
576
577
578# Untrusted input
579
580This crate can handle both untrusted regular expressions and untrusted
581search text.
582
583Untrusted regular expressions are handled by capping the size of a compiled
584regular expression.
585(See [`RegexBuilder::size_limit`](struct.RegexBuilder.html#method.size_limit).)
586Without this, it would be trivial for an attacker to exhaust your system's
587memory with expressions like `a{100}{100}{100}`.
588
589Untrusted search text is allowed because the matching engine(s) in this
590crate have time complexity `O(mn)` (with `m ~ regex` and `n ~ search
591text`), which means there's no way to cause exponential blow-up like with
592some other regular expression engines. (We pay for this by disallowing
593features like arbitrary look-ahead and backreferences.)
594
595When a DFA is used, pathological cases with exponential state blow-up are
596avoided by constructing the DFA lazily or in an "online" manner. Therefore,
597at most one new state can be created for each byte of input. This satisfies
598our time complexity guarantees, but can lead to memory growth
599proportional to the size of the input. As a stopgap, the DFA is only
600allowed to store a fixed number of states. When the limit is reached, its
601states are wiped and continues on, possibly duplicating previous work. If
602the limit is reached too frequently, it gives up and hands control off to
603another matching engine with fixed memory requirements.
604(The DFA size limit can also be tweaked. See
605[`RegexBuilder::dfa_size_limit`](struct.RegexBuilder.html#method.dfa_size_limit).)
606*/
607
608#![deny(missing_docs)]
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700609#![cfg_attr(feature = "pattern", feature(pattern))]
Haibo Huang47619dd2021-01-08 17:05:43 -0800610#![warn(missing_debug_implementations)]
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700611
612#[cfg(not(feature = "std"))]
613compile_error!("`std` feature is currently required to build this crate");
614
Joel Galenson38748082021-05-19 16:51:51 -0700615// To check README's example
616// TODO: Re-enable this once the MSRV is 1.43 or greater.
617// See: https://github.com/rust-lang/regex/issues/684
618// See: https://github.com/rust-lang/regex/issues/685
Haibo Huang49cbe5f2020-05-28 20:14:24 -0700619// #[cfg(doctest)]
620// doc_comment::doctest!("../README.md");
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700621
622#[cfg(feature = "std")]
Joel Galenson38748082021-05-19 16:51:51 -0700623pub use crate::error::Error;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700624#[cfg(feature = "std")]
Joel Galenson38748082021-05-19 16:51:51 -0700625pub use crate::re_builder::set_unicode::*;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700626#[cfg(feature = "std")]
Joel Galenson38748082021-05-19 16:51:51 -0700627pub use crate::re_builder::unicode::*;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700628#[cfg(feature = "std")]
Joel Galenson38748082021-05-19 16:51:51 -0700629pub use crate::re_set::unicode::*;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700630#[cfg(feature = "std")]
631#[cfg(feature = "std")]
Joel Galenson38748082021-05-19 16:51:51 -0700632pub use crate::re_unicode::{
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700633 escape, CaptureLocations, CaptureMatches, CaptureNames, Captures,
634 Locations, Match, Matches, NoExpand, Regex, Replacer, ReplacerRef, Split,
635 SplitN, SubCaptureMatches,
636};
637
638/**
639Match regular expressions on arbitrary bytes.
640
641This module provides a nearly identical API to the one found in the
642top-level of this crate. There are two important differences:
643
6441. Matching is done on `&[u8]` instead of `&str`. Additionally, `Vec<u8>`
645is used where `String` would have been used.
6462. Unicode support can be disabled even when disabling it would result in
647matching invalid UTF-8 bytes.
648
649# Example: match null terminated string
650
651This shows how to find all null-terminated strings in a slice of bytes:
652
653```rust
654# use regex::bytes::Regex;
655let re = Regex::new(r"(?-u)(?P<cstr>[^\x00]+)\x00").unwrap();
656let text = b"foo\x00bar\x00baz\x00";
657
658// Extract all of the strings without the null terminator from each match.
659// The unwrap is OK here since a match requires the `cstr` capture to match.
660let cstrs: Vec<&[u8]> =
661 re.captures_iter(text)
662 .map(|c| c.name("cstr").unwrap().as_bytes())
663 .collect();
664assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs);
665```
666
667# Example: selectively enable Unicode support
668
669This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded
670string (e.g., to extract a title from a Matroska file):
671
672```rust
673# use std::str;
674# use regex::bytes::Regex;
675let re = Regex::new(
676 r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))"
677).unwrap();
678let text = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65";
679let caps = re.captures(text).unwrap();
680
681// Notice that despite the `.*` at the end, it will only match valid UTF-8
682// because Unicode mode was enabled with the `u` flag. Without the `u` flag,
683// the `.*` would match the rest of the bytes.
684let mat = caps.get(1).unwrap();
685assert_eq!((7, 10), (mat.start(), mat.end()));
686
687// If there was a match, Unicode mode guarantees that `title` is valid UTF-8.
688let title = str::from_utf8(&caps[1]).unwrap();
689assert_eq!("☃", title);
690```
691
692In general, if the Unicode flag is enabled in a capture group and that capture
693is part of the overall match, then the capture is *guaranteed* to be valid
694UTF-8.
695
696# Syntax
697
698The supported syntax is pretty much the same as the syntax for Unicode
699regular expressions with a few changes that make sense for matching arbitrary
700bytes:
701
7021. The `u` flag can be disabled even when disabling it might cause the regex to
703match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in
704"ASCII compatible" mode.
7052. In ASCII compatible mode, neither Unicode scalar values nor Unicode
706character classes are allowed.
7073. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`)
708revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps
709to `[[:digit:]]` and `\s` maps to `[[:space:]]`.
7104. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to
711determine whether a byte is a word byte or not.
7125. Hexadecimal notation can be used to specify arbitrary bytes instead of
713Unicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the
714literal byte `\xFF`, while in Unicode mode, `\xFF` is a Unicode codepoint that
715matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal notation when
716enabled.
Chih-Hung Hsieh849e4452020-10-26 13:16:47 -07007176. In ASCII compatible mode, `.` matches any *byte* except for `\n`. When the
718`s` flag is additionally enabled, `.` matches any byte.
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700719
720# Performance
721
722In general, one should expect performance on `&[u8]` to be roughly similar to
723performance on `&str`.
724*/
725#[cfg(feature = "std")]
726pub mod bytes {
Joel Galenson38748082021-05-19 16:51:51 -0700727 pub use crate::re_builder::bytes::*;
728 pub use crate::re_builder::set_bytes::*;
729 pub use crate::re_bytes::*;
730 pub use crate::re_set::bytes::*;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700731}
732
733mod backtrack;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700734mod compile;
735#[cfg(feature = "perf-dfa")]
736mod dfa;
737mod error;
738mod exec;
739mod expand;
740mod find_byte;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700741mod input;
742mod literal;
743#[cfg(feature = "pattern")]
744mod pattern;
745mod pikevm;
Elliott Hughesffb60302021-04-01 17:11:40 -0700746mod pool;
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700747mod prog;
748mod re_builder;
749mod re_bytes;
750mod re_set;
751mod re_trait;
752mod re_unicode;
753mod sparse;
754mod utf8;
755
756/// The `internal` module exists to support suspicious activity, such as
757/// testing different matching engines and supporting the `regex-debug` CLI
758/// utility.
759#[doc(hidden)]
760#[cfg(feature = "std")]
761pub mod internal {
Joel Galenson38748082021-05-19 16:51:51 -0700762 pub use crate::compile::Compiler;
763 pub use crate::exec::{Exec, ExecBuilder};
764 pub use crate::input::{Char, CharInput, Input, InputAt};
765 pub use crate::literal::LiteralSearcher;
766 pub use crate::prog::{EmptyLook, Inst, InstRanges, Program};
Chih-Hung Hsiehe42c5052020-04-16 10:44:21 -0700767}