Chih-Hung Hsieh | e42c505 | 2020-04-16 10:44:21 -0700 | [diff] [blame^] | 1 | regex |
| 2 | ===== |
| 3 | A Rust library for parsing, compiling, and executing regular expressions. Its |
| 4 | syntax is similar to Perl-style regular expressions, but lacks a few features |
| 5 | like look around and backreferences. In exchange, all searches execute in |
| 6 | linear time with respect to the size of the regular expression and search text. |
| 7 | Much of the syntax and implementation is inspired |
| 8 | by [RE2](https://github.com/google/re2). |
| 9 | |
| 10 | [](https://github.com/rust-lang/regex/actions) |
| 11 | [](https://crates.io/crates/regex) |
| 12 | [](https://github.com/rust-lang/regex) |
| 13 | |
| 14 | ### Documentation |
| 15 | |
| 16 | [Module documentation with examples](https://docs.rs/regex). |
| 17 | The module documentation also includes a comprehensive description of the |
| 18 | syntax supported. |
| 19 | |
| 20 | Documentation with examples for the various matching functions and iterators |
| 21 | can be found on the |
| 22 | [`Regex` type](https://docs.rs/regex/*/regex/struct.Regex.html). |
| 23 | |
| 24 | ### Usage |
| 25 | |
| 26 | Add this to your `Cargo.toml`: |
| 27 | |
| 28 | ```toml |
| 29 | [dependencies] |
| 30 | regex = "1" |
| 31 | ``` |
| 32 | |
| 33 | and this to your crate root (if you're using Rust 2015): |
| 34 | |
| 35 | ```rust |
| 36 | extern crate regex; |
| 37 | ``` |
| 38 | |
| 39 | Here's a simple example that matches a date in YYYY-MM-DD format and prints the |
| 40 | year, month and day: |
| 41 | |
| 42 | ```rust |
| 43 | use regex::Regex; |
| 44 | |
| 45 | fn main() { |
| 46 | let re = Regex::new(r"(?x) |
| 47 | (?P<year>\d{4}) # the year |
| 48 | - |
| 49 | (?P<month>\d{2}) # the month |
| 50 | - |
| 51 | (?P<day>\d{2}) # the day |
| 52 | ").unwrap(); |
| 53 | let caps = re.captures("2010-03-14").unwrap(); |
| 54 | |
| 55 | assert_eq!("2010", &caps["year"]); |
| 56 | assert_eq!("03", &caps["month"]); |
| 57 | assert_eq!("14", &caps["day"]); |
| 58 | } |
| 59 | ``` |
| 60 | |
| 61 | If you have lots of dates in text that you'd like to iterate over, then it's |
| 62 | easy to adapt the above example with an iterator: |
| 63 | |
| 64 | ```rust |
| 65 | use regex::Regex; |
| 66 | |
| 67 | const TO_SEARCH: &'static str = " |
| 68 | On 2010-03-14, foo happened. On 2014-10-14, bar happened. |
| 69 | "; |
| 70 | |
| 71 | fn main() { |
| 72 | let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap(); |
| 73 | |
| 74 | for caps in re.captures_iter(TO_SEARCH) { |
| 75 | // Note that all of the unwraps are actually OK for this regex |
| 76 | // because the only way for the regex to match is if all of the |
| 77 | // capture groups match. This is not true in general though! |
| 78 | println!("year: {}, month: {}, day: {}", |
| 79 | caps.get(1).unwrap().as_str(), |
| 80 | caps.get(2).unwrap().as_str(), |
| 81 | caps.get(3).unwrap().as_str()); |
| 82 | } |
| 83 | } |
| 84 | ``` |
| 85 | |
| 86 | This example outputs: |
| 87 | |
| 88 | ```text |
| 89 | year: 2010, month: 03, day: 14 |
| 90 | year: 2014, month: 10, day: 14 |
| 91 | ``` |
| 92 | |
| 93 | ### Usage: Avoid compiling the same regex in a loop |
| 94 | |
| 95 | It is an anti-pattern to compile the same regular expression in a loop since |
| 96 | compilation is typically expensive. (It takes anywhere from a few microseconds |
| 97 | to a few **milliseconds** depending on the size of the regex.) Not only is |
| 98 | compilation itself expensive, but this also prevents optimizations that reuse |
| 99 | allocations internally to the matching engines. |
| 100 | |
| 101 | In Rust, it can sometimes be a pain to pass regular expressions around if |
| 102 | they're used from inside a helper function. Instead, we recommend using the |
| 103 | [`lazy_static`](https://crates.io/crates/lazy_static) crate to ensure that |
| 104 | regular expressions are compiled exactly once. |
| 105 | |
| 106 | For example: |
| 107 | |
| 108 | ```rust,ignore |
| 109 | use regex::Regex; |
| 110 | |
| 111 | fn some_helper_function(text: &str) -> bool { |
| 112 | lazy_static! { |
| 113 | static ref RE: Regex = Regex::new("...").unwrap(); |
| 114 | } |
| 115 | RE.is_match(text) |
| 116 | } |
| 117 | ``` |
| 118 | |
| 119 | Specifically, in this example, the regex will be compiled when it is used for |
| 120 | the first time. On subsequent uses, it will reuse the previous compilation. |
| 121 | |
| 122 | ### Usage: match regular expressions on `&[u8]` |
| 123 | |
| 124 | The main API of this crate (`regex::Regex`) requires the caller to pass a |
| 125 | `&str` for searching. In Rust, an `&str` is required to be valid UTF-8, which |
| 126 | means the main API can't be used for searching arbitrary bytes. |
| 127 | |
| 128 | To match on arbitrary bytes, use the `regex::bytes::Regex` API. The API |
| 129 | is identical to the main API, except that it takes an `&[u8]` to search |
| 130 | on instead of an `&str`. By default, `.` will match any *byte* using |
| 131 | `regex::bytes::Regex`, while `.` will match any *UTF-8 encoded Unicode scalar |
| 132 | value* using the main API. |
| 133 | |
| 134 | This example shows how to find all null-terminated strings in a slice of bytes: |
| 135 | |
| 136 | ```rust |
| 137 | use regex::bytes::Regex; |
| 138 | |
| 139 | let re = Regex::new(r"(?P<cstr>[^\x00]+)\x00").unwrap(); |
| 140 | let text = b"foo\x00bar\x00baz\x00"; |
| 141 | |
| 142 | // Extract all of the strings without the null terminator from each match. |
| 143 | // The unwrap is OK here since a match requires the `cstr` capture to match. |
| 144 | let cstrs: Vec<&[u8]> = |
| 145 | re.captures_iter(text) |
| 146 | .map(|c| c.name("cstr").unwrap().as_bytes()) |
| 147 | .collect(); |
| 148 | assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs); |
| 149 | ``` |
| 150 | |
| 151 | Notice here that the `[^\x00]+` will match any *byte* except for `NUL`. When |
| 152 | using the main API, `[^\x00]+` would instead match any valid UTF-8 sequence |
| 153 | except for `NUL`. |
| 154 | |
| 155 | ### Usage: match multiple regular expressions simultaneously |
| 156 | |
| 157 | This demonstrates how to use a `RegexSet` to match multiple (possibly |
| 158 | overlapping) regular expressions in a single scan of the search text: |
| 159 | |
| 160 | ```rust |
| 161 | use regex::RegexSet; |
| 162 | |
| 163 | let set = RegexSet::new(&[ |
| 164 | r"\w+", |
| 165 | r"\d+", |
| 166 | r"\pL+", |
| 167 | r"foo", |
| 168 | r"bar", |
| 169 | r"barfoo", |
| 170 | r"foobar", |
| 171 | ]).unwrap(); |
| 172 | |
| 173 | // Iterate over and collect all of the matches. |
| 174 | let matches: Vec<_> = set.matches("foobar").into_iter().collect(); |
| 175 | assert_eq!(matches, vec![0, 2, 3, 4, 6]); |
| 176 | |
| 177 | // You can also test whether a particular regex matched: |
| 178 | let matches = set.matches("foobar"); |
| 179 | assert!(!matches.matched(5)); |
| 180 | assert!(matches.matched(6)); |
| 181 | ``` |
| 182 | |
| 183 | ### Usage: enable SIMD optimizations |
| 184 | |
| 185 | SIMD optimizations are enabled automatically on Rust stable 1.27 and newer. |
| 186 | For nightly versions of Rust, this requires a recent version with the SIMD |
| 187 | features stabilized. |
| 188 | |
| 189 | |
| 190 | ### Usage: a regular expression parser |
| 191 | |
| 192 | This repository contains a crate that provides a well tested regular expression |
| 193 | parser, abstract syntax and a high-level intermediate representation for |
| 194 | convenient analysis. It provides no facilities for compilation or execution. |
| 195 | This may be useful if you're implementing your own regex engine or otherwise |
| 196 | need to do analysis on the syntax of a regular expression. It is otherwise not |
| 197 | recommended for general use. |
| 198 | |
| 199 | [Documentation `regex-syntax`.](https://docs.rs/regex-syntax) |
| 200 | |
| 201 | |
| 202 | ### Crate features |
| 203 | |
| 204 | This crate comes with several features that permit tweaking the trade off |
| 205 | between binary size, compilation time and runtime performance. Users of this |
| 206 | crate can selectively disable Unicode tables, or choose from a variety of |
| 207 | optimizations performed by this crate to disable. |
| 208 | |
| 209 | When all of these features are disabled, runtime match performance may be much |
| 210 | worse, but if you're matching on short strings, or if high performance isn't |
| 211 | necessary, then such a configuration is perfectly serviceable. To disable |
| 212 | all such features, use the following `Cargo.toml` dependency configuration: |
| 213 | |
| 214 | ```toml |
| 215 | [dependencies.regex] |
| 216 | version = "1.3" |
| 217 | default-features = false |
| 218 | # regex currently requires the standard library, you must re-enable it. |
| 219 | features = ["std"] |
| 220 | ``` |
| 221 | |
| 222 | This will reduce the dependency tree of `regex` down to a single crate |
| 223 | (`regex-syntax`). |
| 224 | |
| 225 | The full set of features one can disable are |
| 226 | [in the "Crate features" section of the documentation](https://docs.rs/regex/*/#crate-features). |
| 227 | |
| 228 | |
| 229 | ### Minimum Rust version policy |
| 230 | |
| 231 | This crate's minimum supported `rustc` version is `1.28.0`. |
| 232 | |
| 233 | The current **tentative** policy is that the minimum Rust version required |
| 234 | to use this crate can be increased in minor version updates. For example, if |
| 235 | regex 1.0 requires Rust 1.20.0, then regex 1.0.z for all values of `z` will |
| 236 | also require Rust 1.20.0 or newer. However, regex 1.y for `y > 0` may require a |
| 237 | newer minimum version of Rust. |
| 238 | |
| 239 | In general, this crate will be conservative with respect to the minimum |
| 240 | supported version of Rust. |
| 241 | |
| 242 | |
| 243 | ### License |
| 244 | |
| 245 | This project is licensed under either of |
| 246 | |
| 247 | * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or |
| 248 | http://www.apache.org/licenses/LICENSE-2.0) |
| 249 | * MIT license ([LICENSE-MIT](LICENSE-MIT) or |
| 250 | http://opensource.org/licenses/MIT) |
| 251 | |
| 252 | at your option. |
| 253 | |
| 254 | The data in `regex-syntax/src/unicode_tables/` is licensed under the Unicode |
| 255 | License Agreement |
| 256 | ([LICENSE-UNICODE](http://www.unicode.org/copyright.html#License)). |