Blame - src/lib.rs - platform/external/rust/crates/regex-syntax

blob: 7892668285e670c7a7589745da0e09e9f362a14e [file] [log] [blame]

Chih-Hung Hsieh	048fc04	2020-04-16 10:44:22 -0700	[diff] [blame]	1	/*!
				2	This crate provides a robust regular expression parser.
				3
				4	This crate defines two primary types:
				5
				6	* [`Ast`](ast/enum.Ast.html) is the abstract syntax of a regular expression.
				7	An abstract syntax corresponds to a structured representation of the
				8	concrete syntax of a regular expression, where the concrete syntax is the
				9	pattern string itself (e.g., `foo(bar)+`). Given some abstract syntax, it
				10	can be converted back to the original concrete syntax (modulo some details,
				11	like whitespace). To a first approximation, the abstract syntax is complex
				12	and difficult to analyze.
				13	* [`Hir`](hir/struct.Hir.html) is the high-level intermediate representation
				14	("HIR" or "high-level IR" for short) of regular expression. It corresponds to
				15	an intermediate state of a regular expression that sits between the abstract
				16	syntax and the low level compiled opcodes that are eventually responsible for
				17	executing a regular expression search. Given some high-level IR, it is not
				18	possible to produce the original concrete syntax (although it is possible to
				19	produce an equivalent concrete syntax, but it will likely scarcely resemble
				20	the original pattern). To a first approximation, the high-level IR is simple
				21	and easy to analyze.
				22
				23	These two types come with conversion routines:
				24
				25	* An [`ast::parse::Parser`](ast/parse/struct.Parser.html) converts concrete
				26	syntax (a `&str`) to an [`Ast`](ast/enum.Ast.html).
				27	* A [`hir::translate::Translator`](hir/translate/struct.Translator.html)
				28	converts an [`Ast`](ast/enum.Ast.html) to a [`Hir`](hir/struct.Hir.html).
				29
				30	As a convenience, the above two conversion routines are combined into one via
				31	the top-level [`Parser`](struct.Parser.html) type. This `Parser` will first
				32	convert your pattern to an `Ast` and then convert the `Ast` to an `Hir`.
				33
				34
				35	# Example
				36
				37	This example shows how to parse a pattern string into its HIR:
				38
				39	```
				40	use regex_syntax::Parser;
				41	use regex_syntax::hir::{self, Hir};
				42
				43	let hir = Parser::new().parse("a\|b").unwrap();
				44	assert_eq!(hir, Hir::alternation(vec![
				45	Hir::literal(hir::Literal::Unicode('a')),
				46	Hir::literal(hir::Literal::Unicode('b')),
				47	]));
				48	```
				49
				50
				51	# Concrete syntax supported
				52
				53	The concrete syntax is documented as part of the public API of the
				54	[`regex` crate](https://docs.rs/regex/%2A/regex/#syntax).
				55
				56
				57	# Input safety
				58
				59	A key feature of this library is that it is safe to use with end user facing
				60	input. This plays a significant role in the internal implementation. In
				61	particular:
				62
				63	1. Parsers provide a `nest_limit` option that permits callers to control how
				64	deeply nested a regular expression is allowed to be. This makes it possible
				65	to do case analysis over an `Ast` or an `Hir` using recursion without
				66	worrying about stack overflow.
				67	2. Since relying on a particular stack size is brittle, this crate goes to
				68	great lengths to ensure that all interactions with both the `Ast` and the
				69	`Hir` do not use recursion. Namely, they use constant stack space and heap
				70	space proportional to the size of the original pattern string (in bytes).
				71	This includes the type's corresponding destructors. (One exception to this
				72	is literal extraction, but this will eventually get fixed.)
				73
				74
				75	# Error reporting
				76
				77	The `Display` implementations on all `Error` types exposed in this library
				78	provide nice human readable errors that are suitable for showing to end users
				79	in a monospace font.
				80
				81
				82	# Literal extraction
				83
				84	This crate provides limited support for
				85	[literal extraction from `Hir` values](hir/literal/struct.Literals.html).
				86	Be warned that literal extraction currently uses recursion, and therefore,
				87	stack size proportional to the size of the `Hir`.
				88
				89	The purpose of literal extraction is to speed up searches. That is, if you
				90	know a regular expression must match a prefix or suffix literal, then it is
				91	often quicker to search for instances of that literal, and then confirm or deny
				92	the match using the full regular expression engine. These optimizations are
				93	done automatically in the `regex` crate.
				94
				95
				96	# Crate features
				97
				98	An important feature provided by this crate is its Unicode support. This
				99	includes things like case folding, boolean properties, general categories,
				100	scripts and Unicode-aware support for the Perl classes `\w`, `\s` and `\d`.
				101	However, a downside of this support is that it requires bundling several
				102	Unicode data tables that are substantial in size.
				103
				104	A fair number of use cases do not require full Unicode support. For this
				105	reason, this crate exposes a number of features to control which Unicode
				106	data is available.
				107
				108	If a regular expression attempts to use a Unicode feature that is not available
				109	because the corresponding crate feature was disabled, then translating that
				110	regular expression to an `Hir` will return an error. (It is still possible
				111	construct an `Ast` for such a regular expression, since Unicode data is not
				112	used until translation to an `Hir`.) Stated differently, enabling or disabling
				113	any of the features below can only add or subtract from the total set of valid
				114	regular expressions. Enabling or disabling a feature will never modify the
				115	match semantics of a regular expression.
				116
				117	The following features are available:
				118
				119	* unicode -
				120	Enables all Unicode features. This feature is enabled by default, and will
				121	always cover all Unicode features, even if more are added in the future.
				122	* unicode-age -
				123	Provide the data for the
				124	[Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age).
				125	This makes it possible to use classes like `\p{Age:6.0}` to refer to all
				126	codepoints first introduced in Unicode 6.0
				127	* unicode-bool -
				128	Provide the data for numerous Unicode boolean properties. The full list
				129	is not included here, but contains properties like `Alphabetic`, `Emoji`,
				130	`Lowercase`, `Math`, `Uppercase` and `White_Space`.
				131	* unicode-case -
				132	Provide the data for case insensitive matching using
				133	[Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches).
				134	* unicode-gencat -
				135	Provide the data for
				136	[Uncode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values).
				137	This includes, but is not limited to, `Decimal_Number`, `Letter`,
				138	`Math_Symbol`, `Number` and `Punctuation`.
				139	* unicode-perl -
				140	Provide the data for supporting the Unicode-aware Perl character classes,
				141	corresponding to `\w`, `\s` and `\d`. This is also necessary for using
				142	Unicode-aware word boundary assertions. Note that if this feature is
				143	disabled, the `\s` and `\d` character classes are still available if the
				144	`unicode-bool` and `unicode-gencat` features are enabled, respectively.
				145	* unicode-script -
				146	Provide the data for
				147	[Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/).
				148	This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`,
				149	`Latin` and `Thai`.
				150	* unicode-segment -
				151	Provide the data necessary to provide the properties used to implement the
				152	[Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/).
				153	This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and
				154	`\p{sb=ATerm}`.
				155	*/
				156
				157	#![deny(missing_docs)]
				158	#![forbid(unsafe_code)]
				159
				160	pub use error::{Error, Result};
				161	pub use parser::{Parser, ParserBuilder};
				162	pub use unicode::UnicodeWordError;
				163
				164	pub mod ast;
				165	mod either;
				166	mod error;
				167	pub mod hir;
				168	mod parser;
				169	mod unicode;
				170	mod unicode_tables;
				171	pub mod utf8;
				172
				173	/// Escapes all regular expression meta characters in `text`.
				174	///
				175	/// The string returned may be safely used as a literal in a regular
				176	/// expression.
				177	pub fn escape(text: &str) -> String {
Haibo Huang	5b6434b	2020-05-28 20:14:37 -0700	[diff] [blame]	178	let mut quoted = String::new();
Chih-Hung Hsieh	048fc04	2020-04-16 10:44:22 -0700	[diff] [blame]	179	escape_into(text, &mut quoted);
				180	quoted
				181	}
				182
				183	/// Escapes all meta characters in `text` and writes the result into `buf`.
				184	///
				185	/// This will append escape characters into the given buffer. The characters
				186	/// that are appended are safe to use as a literal in a regular expression.
				187	pub fn escape_into(text: &str, buf: &mut String) {
Haibo Huang	5b6434b	2020-05-28 20:14:37 -0700	[diff] [blame]	188	buf.reserve(text.len());
Chih-Hung Hsieh	048fc04	2020-04-16 10:44:22 -0700	[diff] [blame]	189	for c in text.chars() {
				190	if is_meta_character(c) {
				191	buf.push('\\');
				192	}
				193	buf.push(c);
				194	}
				195	}
				196
				197	/// Returns true if the give character has significance in a regex.
				198	///
				199	/// These are the only characters that are allowed to be escaped, with one
				200	/// exception: an ASCII space character may be escaped when extended mode (with
				201	/// the `x` flag) is enabld. In particular, `is_meta_character(' ')` returns
				202	/// `false`.
				203	///
				204	/// Note that the set of characters for which this function returns `true` or
				205	/// `false` is fixed and won't change in a semver compatible release.
				206	pub fn is_meta_character(c: char) -> bool {
				207	match c {
				208	'\\' \| '.' \| '+' \| '*' \| '?' \| '(' \| ')' \| '\|' \| '[' \| ']' \| '{'
				209	\| '}' \| '^' \| '$' \| '#' \| '&' \| '-' \| '~' => true,
				210	_ => false,
				211	}
				212	}
				213
				214	/// Returns true if and only if the given character is a Unicode word
				215	/// character.
				216	///
				217	/// A Unicode word character is defined by
				218	/// [UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties).
				219	/// In particular, a character
				220	/// is considered a word character if it is in either of the `Alphabetic` or
				221	/// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark`
				222	/// or `Connector_Punctuation` general categories.
				223	///
				224	/// # Panics
				225	///
				226	/// If the `unicode-perl` feature is not enabled, then this function panics.
				227	/// For this reason, it is recommended that callers use
				228	/// [`try_is_word_character`](fn.try_is_word_character.html)
				229	/// instead.
				230	pub fn is_word_character(c: char) -> bool {
				231	try_is_word_character(c).expect("unicode-perl feature must be enabled")
				232	}
				233
				234	/// Returns true if and only if the given character is a Unicode word
				235	/// character.
				236	///
				237	/// A Unicode word character is defined by
				238	/// [UTS#18 Annex C](http://unicode.org/reports/tr18/#Compatibility_Properties).
				239	/// In particular, a character
				240	/// is considered a word character if it is in either of the `Alphabetic` or
				241	/// `Join_Control` properties, or is in one of the `Decimal_Number`, `Mark`
				242	/// or `Connector_Punctuation` general categories.
				243	///
				244	/// # Errors
				245	///
				246	/// If the `unicode-perl` feature is not enabled, then this function always
				247	/// returns an error.
				248	pub fn try_is_word_character(
				249	c: char,
				250	) -> std::result::Result<bool, UnicodeWordError> {
				251	unicode::is_word_character(c)
				252	}
				253
				254	/// Returns true if and only if the given character is an ASCII word character.
				255	///
				256	/// An ASCII word character is defined by the following character class:
				257	/// `[_0-9a-zA-Z]'.
				258	pub fn is_word_byte(c: u8) -> bool {
				259	match c {
				260	b'_' \| b'0'..=b'9' \| b'a'..=b'z' \| b'A'..=b'Z' => true,
				261	_ => false,
				262	}
				263	}
				264
				265	#[cfg(test)]
				266	mod tests {
				267	use super::*;
				268
				269	#[test]
				270	fn escape_meta() {
				271	assert_eq!(
				272	escape(r"\.+*?()\|[]{}^$#&-~"),
				273	r"\\\.\+\*\?\\|\[\]\{\}\^\$\#\&\-\~".to_string()
				274	);
				275	}
				276
				277	#[test]
				278	fn word_byte() {
				279	assert!(is_word_byte(b'a'));
				280	assert!(!is_word_byte(b'-'));
				281	}
				282
				283	#[test]
				284	#[cfg(feature = "unicode-perl")]
				285	fn word_char() {
				286	assert!(is_word_character('a'), "ASCII");
				287	assert!(is_word_character('à'), "Latin-1");
				288	assert!(is_word_character('β'), "Greek");
				289	assert!(is_word_character('\u{11011}'), "Brahmi (Unicode 6.0)");
				290	assert!(is_word_character('\u{11611}'), "Modi (Unicode 7.0)");
				291	assert!(is_word_character('\u{11711}'), "Ahom (Unicode 8.0)");
				292	assert!(is_word_character('\u{17828}'), "Tangut (Unicode 9.0)");
				293	assert!(is_word_character('\u{1B1B1}'), "Nushu (Unicode 10.0)");
				294	assert!(is_word_character('\u{16E40}'), "Medefaidrin (Unicode 11.0)");
				295	assert!(!is_word_character('-'));
				296	assert!(!is_word_character('☃'));
				297	}
				298
				299	#[test]
				300	#[should_panic]
				301	#[cfg(not(feature = "unicode-perl"))]
				302	fn word_char_disabled_panic() {
				303	assert!(is_word_character('a'));
				304	}
				305
				306	#[test]
				307	#[cfg(not(feature = "unicode-perl"))]
				308	fn word_char_disabled_error() {
				309	assert!(try_is_word_character('a').is_err());
				310	}
				311	}