Chih-Hung Hsieh | 048fc04 | 2020-04-16 10:44:22 -0700 | [diff] [blame] | 1 | /*! |
| 2 | This crate provides a robust regular expression parser. |
| 3 | |
| 4 | This crate defines two primary types: |
| 5 | |
| 6 | * [`Ast`](ast/enum.Ast.html) is the abstract syntax of a regular expression. |
| 7 | An abstract syntax corresponds to a *structured representation* of the |
| 8 | concrete syntax of a regular expression, where the concrete syntax is the |
| 9 | pattern string itself (e.g., `foo(bar)+`). Given some abstract syntax, it |
| 10 | can be converted back to the original concrete syntax (modulo some details, |
| 11 | like whitespace). To a first approximation, the abstract syntax is complex |
| 12 | and difficult to analyze. |
| 13 | * [`Hir`](hir/struct.Hir.html) is the high-level intermediate representation |
| 14 | ("HIR" or "high-level IR" for short) of regular expression. It corresponds to |
| 15 | an intermediate state of a regular expression that sits between the abstract |
| 16 | syntax and the low level compiled opcodes that are eventually responsible for |
| 17 | executing a regular expression search. Given some high-level IR, it is not |
| 18 | possible to produce the original concrete syntax (although it is possible to |
| 19 | produce an equivalent concrete syntax, but it will likely scarcely resemble |
| 20 | the original pattern). To a first approximation, the high-level IR is simple |
| 21 | and easy to analyze. |
| 22 | |
| 23 | These two types come with conversion routines: |
| 24 | |
| 25 | * An [`ast::parse::Parser`](ast/parse/struct.Parser.html) converts concrete |
| 26 | syntax (a `&str`) to an [`Ast`](ast/enum.Ast.html). |
| 27 | * A [`hir::translate::Translator`](hir/translate/struct.Translator.html) |
| 28 | converts an [`Ast`](ast/enum.Ast.html) to a [`Hir`](hir/struct.Hir.html). |
| 29 | |
| 30 | As a convenience, the above two conversion routines are combined into one via |
| 31 | the top-level [`Parser`](struct.Parser.html) type. This `Parser` will first |
| 32 | convert your pattern to an `Ast` and then convert the `Ast` to an `Hir`. |
| 33 | |
| 34 | |
| 35 | # Example |
| 36 | |
| 37 | This example shows how to parse a pattern string into its HIR: |
| 38 | |
| 39 | ``` |
| 40 | use regex_syntax::Parser; |
| 41 | use regex_syntax::hir::{self, Hir}; |
| 42 | |
| 43 | let hir = Parser::new().parse("a|b").unwrap(); |
| 44 | assert_eq!(hir, Hir::alternation(vec![ |
| 45 | Hir::literal(hir::Literal::Unicode('a')), |
| 46 | Hir::literal(hir::Literal::Unicode('b')), |
| 47 | ])); |
| 48 | ``` |
| 49 | |
| 50 | |
| 51 | # Concrete syntax supported |
| 52 | |
| 53 | The concrete syntax is documented as part of the public API of the |
| 54 | [`regex` crate](https://docs.rs/regex/%2A/regex/#syntax). |
| 55 | |
| 56 | |
| 57 | # Input safety |
| 58 | |
| 59 | A key feature of this library is that it is safe to use with end user facing |
| 60 | input. This plays a significant role in the internal implementation. In |
| 61 | particular: |
| 62 | |
| 63 | 1. Parsers provide a `nest_limit` option that permits callers to control how |
| 64 | deeply nested a regular expression is allowed to be. This makes it possible |
| 65 | to do case analysis over an `Ast` or an `Hir` using recursion without |
| 66 | worrying about stack overflow. |
| 67 | 2. Since relying on a particular stack size is brittle, this crate goes to |
| 68 | great lengths to ensure that all interactions with both the `Ast` and the |
| 69 | `Hir` do not use recursion. Namely, they use constant stack space and heap |
| 70 | space proportional to the size of the original pattern string (in bytes). |
| 71 | This includes the type's corresponding destructors. (One exception to this |
| 72 | is literal extraction, but this will eventually get fixed.) |
| 73 | |
| 74 | |
| 75 | # Error reporting |
| 76 | |
| 77 | The `Display` implementations on all `Error` types exposed in this library |
| 78 | provide nice human readable errors that are suitable for showing to end users |
| 79 | in a monospace font. |
| 80 | |
| 81 | |
| 82 | # Literal extraction |
| 83 | |
| 84 | This crate provides limited support for |
| 85 | [literal extraction from `Hir` values](hir/literal/struct.Literals.html). |
| 86 | Be warned that literal extraction currently uses recursion, and therefore, |
| 87 | stack size proportional to the size of the `Hir`. |
| 88 | |
| 89 | The purpose of literal extraction is to speed up searches. That is, if you |
| 90 | know a regular expression must match a prefix or suffix literal, then it is |
| 91 | often quicker to search for instances of that literal, and then confirm or deny |
| 92 | the match using the full regular expression engine. These optimizations are |
| 93 | done automatically in the `regex` crate. |
| 94 | |
| 95 | |
| 96 | # Crate features |
| 97 | |
| 98 | An important feature provided by this crate is its Unicode support. This |
| 99 | includes things like case folding, boolean properties, general categories, |
| 100 | scripts and Unicode-aware support for the Perl classes `\w`, `\s` and `\d`. |
| 101 | However, a downside of this support is that it requires bundling several |
| 102 | Unicode data tables that are substantial in size. |
| 103 | |
| 104 | A fair number of use cases do not require full Unicode support. For this |
| 105 | reason, this crate exposes a number of features to control which Unicode |
| 106 | data is available. |
| 107 | |
| 108 | If a regular expression attempts to use a Unicode feature that is not available |
| 109 | because the corresponding crate feature was disabled, then translating that |
| 110 | regular expression to an `Hir` will return an error. (It is still possible |
| 111 | construct an `Ast` for such a regular expression, since Unicode data is not |
| 112 | used until translation to an `Hir`.) Stated differently, enabling or disabling |
| 113 | any of the features below can only add or subtract from the total set of valid |
| 114 | regular expressions. Enabling or disabling a feature will never modify the |
| 115 | match semantics of a regular expression. |
| 116 | |
| 117 | The following features are available: |
| 118 | |
| 119 | * **unicode** - |
| 120 | Enables all Unicode features. This feature is enabled by default, and will |
| 121 | always cover all Unicode features, even if more are added in the future. |
| 122 | * **unicode-age** - |
| 123 | Provide the data for the |
| 124 | [Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age). |
| 125 | This makes it possible to use classes like `\p{Age:6.0}` to refer to all |
| 126 | codepoints first introduced in Unicode 6.0 |
| 127 | * **unicode-bool** - |
| 128 | Provide the data for numerous Unicode boolean properties. The full list |
| 129 | is not included here, but contains properties like `Alphabetic`, `Emoji`, |
| 130 | `Lowercase`, `Math`, `Uppercase` and `White_Space`. |
| 131 | * **unicode-case** - |
| 132 | Provide the data for case insensitive matching using |
| 133 | [Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches). |
| 134 | * **unicode-gencat** - |
| 135 | Provide the data for |
| 136 | [Uncode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values). |
| 137 | This includes, but is not limited to, `Decimal_Number`, `Letter`, |
| 138 | `Math_Symbol`, `Number` and `Punctuation`. |
| 139 | * **unicode-perl** - |
| 140 | Provide the data for supporting the Unicode-aware Perl character classes, |
| 141 | corresponding to `\w`, `\s` and `\d`. This is also necessary for using |
| 142 | Unicode-aware word boundary assertions. Note that if this feature is |
| 143 | disabled, the `\s` and `\d` character classes are still available if the |
| 144 | `unicode-bool` and `unicode-gencat` features are enabled, respectively. |
| 145 | * **unicode-script** - |
| 146 | Provide the data for |
| 147 | [Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/). |
| 148 | This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`, |
| 149 | `Latin` and `Thai`. |
| 150 | * **unicode-segment** - |
| 151 | Provide the data necessary to provide the properties used to implement the |
| 152 | [Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/). |
| 153 | This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and |
| 154 | `\p{sb=ATerm}`. |
| 155 | */ |
| 156 | |
| 157 | #![deny(missing_docs)] |
| 158 | #![forbid(unsafe_code)] |
| 159 | |
| 160 | pub use error::{Error, Result}; |
| 161 | pub use parser::{Parser, ParserBuilder}; |
| 162 | pub use unicode::UnicodeWordError; |
| 163 | |
| 164 | pub mod ast; |
| 165 | mod either; |
| 166 | mod error; |
| 167 | pub mod hir; |
| 168 | mod parser; |
| 169 | mod unicode; |
| 170 | mod unicode_tables; |
| 171 | pub mod utf8; |
| 172 | |
| 173 | /// Escapes all regular expression meta characters in `text`. |
| 174 | /// |
| 175 | /// The string returned may be safely used as a literal in a regular |
| 176 | /// expression. |
| 177 | pub fn escape(text: &str) -> String { |
Haibo Huang | 5b6434b | 2020-05-28 20:14:37 -0700 | [diff] [blame] | 178 | let mut quoted = String::new(); |
Chih-Hung Hsieh | 048fc04 | 2020-04-16 10:44:22 -0700 | [diff] [blame] | 179 | escape_into(text, &mut quoted); |
| 180 | quoted |
| 181 | } |
| 182 | |
| 183 | /// Escapes all meta characters in `text` and writes the result into `buf`. |
| 184 | /// |
| 185 | /// This will append escape characters into the given buffer. The characters |
| 186 | /// that are appended are safe to use as a literal in a regular expression. |
| 187 | pub fn escape_into(text: &str, buf: &mut String) { |
Haibo Huang | 5b6434b | 2020-05-28 20:14:37 -0700 | [diff] [blame] | 188 | buf.reserve(text.len()); |
Chih-Hung Hsieh | 048fc04 | 2020-04-16 10:44:22 -0700 | [diff] [blame] | 189 | for c in text.chars() { |
| 190 | if is_meta_character(c) { |
| 191 | buf.push('\\'); |
| 192 | } |
| 193 | buf.push(c); |
| 194 | } |
| 195 | } |
| 196 | |
| 197 | /// Returns true if the give character has significance in a regex. |
| 198 | /// |
| 199 | /// These are the only characters that are allowed to be escaped, with one |
| 200 | /// exception: an ASCII space character may be escaped when extended mode (with |
| 201 | /// the `x` flag) is enabld. In particular, `is_meta_character(' ')` returns |
| 202 | /// `false`. |
| 203 | /// |
| 204 | /// Note that the set of characters for which this function returns `true` or |
| 205 | /// `false` is fixed and won't change in a semver compatible release. |
| 206 | pub fn is_meta_character(c: char) -> bool { |
| 207 | match c { |
| 208 | '\\' | '.' | '+' | '*' | '?' | '(' | ')' | '|' | '[' | ']' | '{' |
| 209 | | '}' | '^' | '$' | '#' | '&' | '-' | '~' => true, |
| 210 | _ => false, |
| 211 | } |
| 212 | } |
| 213 | |
| 214 | /// Returns true if and only if the given character is a Unicode word |
| 215 | /// character. |
| 216 | /// |
| 217 | /// A Unicode word character is defined by |
| 218 | /// [UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties). |
| 219 | /// In particular, a character |
| 220 | /// is considered a word character if it is in either of the `Alphabetic` or |
| 221 | /// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark` |
| 222 | /// or `Connector_Punctuation` general categories. |
| 223 | /// |
| 224 | /// # Panics |
| 225 | /// |
| 226 | /// If the `unicode-perl` feature is not enabled, then this function panics. |
| 227 | /// For this reason, it is recommended that callers use |
| 228 | /// [`try_is_word_character`](fn.try_is_word_character.html) |
| 229 | /// instead. |
| 230 | pub fn is_word_character(c: char) -> bool { |
| 231 | try_is_word_character(c).expect("unicode-perl feature must be enabled") |
| 232 | } |
| 233 | |
| 234 | /// Returns true if and only if the given character is a Unicode word |
| 235 | /// character. |
| 236 | /// |
| 237 | /// A Unicode word character is defined by |
| 238 | /// [UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties). |
| 239 | /// In particular, a character |
| 240 | /// is considered a word character if it is in either of the `Alphabetic` or |
| 241 | /// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark` |
| 242 | /// or `Connector_Punctuation` general categories. |
| 243 | /// |
| 244 | /// # Errors |
| 245 | /// |
| 246 | /// If the `unicode-perl` feature is not enabled, then this function always |
| 247 | /// returns an error. |
| 248 | pub fn try_is_word_character( |
| 249 | c: char, |
| 250 | ) -> std::result::Result<bool, UnicodeWordError> { |
| 251 | unicode::is_word_character(c) |
| 252 | } |
| 253 | |
| 254 | /// Returns true if and only if the given character is an ASCII word character. |
| 255 | /// |
| 256 | /// An ASCII word character is defined by the following character class: |
| 257 | /// `[_0-9a-zA-Z]'. |
| 258 | pub fn is_word_byte(c: u8) -> bool { |
| 259 | match c { |
| 260 | b'_' | b'0'..=b'9' | b'a'..=b'z' | b'A'..=b'Z' => true, |
| 261 | _ => false, |
| 262 | } |
| 263 | } |
| 264 | |
| 265 | #[cfg(test)] |
| 266 | mod tests { |
| 267 | use super::*; |
| 268 | |
| 269 | #[test] |
| 270 | fn escape_meta() { |
| 271 | assert_eq!( |
| 272 | escape(r"\.+*?()|[]{}^$#&-~"), |
| 273 | r"\\\.\+\*\?\(\)\|\[\]\{\}\^\$\#\&\-\~".to_string() |
| 274 | ); |
| 275 | } |
| 276 | |
| 277 | #[test] |
| 278 | fn word_byte() { |
| 279 | assert!(is_word_byte(b'a')); |
| 280 | assert!(!is_word_byte(b'-')); |
| 281 | } |
| 282 | |
| 283 | #[test] |
| 284 | #[cfg(feature = "unicode-perl")] |
| 285 | fn word_char() { |
| 286 | assert!(is_word_character('a'), "ASCII"); |
| 287 | assert!(is_word_character('à'), "Latin-1"); |
| 288 | assert!(is_word_character('β'), "Greek"); |
| 289 | assert!(is_word_character('\u{11011}'), "Brahmi (Unicode 6.0)"); |
| 290 | assert!(is_word_character('\u{11611}'), "Modi (Unicode 7.0)"); |
| 291 | assert!(is_word_character('\u{11711}'), "Ahom (Unicode 8.0)"); |
| 292 | assert!(is_word_character('\u{17828}'), "Tangut (Unicode 9.0)"); |
| 293 | assert!(is_word_character('\u{1B1B1}'), "Nushu (Unicode 10.0)"); |
| 294 | assert!(is_word_character('\u{16E40}'), "Medefaidrin (Unicode 11.0)"); |
| 295 | assert!(!is_word_character('-')); |
| 296 | assert!(!is_word_character('☃')); |
| 297 | } |
| 298 | |
| 299 | #[test] |
| 300 | #[should_panic] |
| 301 | #[cfg(not(feature = "unicode-perl"))] |
| 302 | fn word_char_disabled_panic() { |
| 303 | assert!(is_word_character('a')); |
| 304 | } |
| 305 | |
| 306 | #[test] |
| 307 | #[cfg(not(feature = "unicode-perl"))] |
| 308 | fn word_char_disabled_error() { |
| 309 | assert!(try_is_word_character('a').is_err()); |
| 310 | } |
| 311 | } |