Blame - src/lib.rs - platform/external/rust/crates/regex

blob: e0a0975f5206fbecc66bcc3daa17246827954131 [file] [log] [blame]

Chih-Hung Hsieh	e42c505	2020-04-16 10:44:21 -0700	[diff] [blame]	1	/*!
				2	This crate provides a library for parsing, compiling, and executing regular
				3	expressions. Its syntax is similar to Perl-style regular expressions, but lacks
				4	a few features like look around and backreferences. In exchange, all searches
				5	execute in linear time with respect to the size of the regular expression and
				6	search text.
				7
				8	This crate's documentation provides some simple examples, describes
				9	[Unicode support](#unicode) and exhaustively lists the
				10	[supported syntax](#syntax).
				11
				12	For more specific details on the API for regular expressions, please see the
				13	documentation for the [`Regex`](struct.Regex.html) type.
				14
				15	# Usage
				16
				17	This crate is [on crates.io](https://crates.io/crates/regex) and can be
				18	used by adding `regex` to your dependencies in your project's `Cargo.toml`.
				19
				20	```toml
				21	[dependencies]
				22	regex = "1"
				23	```
				24
				25	If you're using Rust 2015, then you'll also need to add it to your crate root:
				26
				27	```rust
				28	extern crate regex;
				29	```
				30
				31	# Example: find a date
				32
				33	General use of regular expressions in this package involves compiling an
				34	expression and then using it to search, split or replace text. For example,
				35	to confirm that some text resembles a date:
				36
				37	```rust
				38	use regex::Regex;
				39	let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap();
				40	assert!(re.is_match("2014-01-01"));
				41	```
				42
				43	Notice the use of the `^` and `$` anchors. In this crate, every expression
				44	is executed with an implicit `.*?` at the beginning and end, which allows
				45	it to match anywhere in the text. Anchors can be used to ensure that the
				46	full text matches an expression.
				47
				48	This example also demonstrates the utility of
				49	[raw strings](https://doc.rust-lang.org/stable/reference/tokens.html#raw-string-literals)
				50	in Rust, which
				51	are just like regular strings except they are prefixed with an `r` and do
				52	not process any escape sequences. For example, `"\\d"` is the same
				53	expression as `r"\d"`.
				54
				55	# Example: Avoid compiling the same regex in a loop
				56
				57	It is an anti-pattern to compile the same regular expression in a loop
				58	since compilation is typically expensive. (It takes anywhere from a few
				59	microseconds to a few milliseconds depending on the size of the
				60	regex.) Not only is compilation itself expensive, but this also prevents
				61	optimizations that reuse allocations internally to the matching engines.
				62
				63	In Rust, it can sometimes be a pain to pass regular expressions around if
				64	they're used from inside a helper function. Instead, we recommend using the
				65	[`lazy_static`](https://crates.io/crates/lazy_static) crate to ensure that
				66	regular expressions are compiled exactly once.
				67
				68	For example:
				69
				70	```rust
				71	#[macro_use] extern crate lazy_static;
				72	extern crate regex;
				73
				74	use regex::Regex;
				75
				76	fn some_helper_function(text: &str) -> bool {
				77	lazy_static! {
				78	static ref RE: Regex = Regex::new("...").unwrap();
				79	}
				80	RE.is_match(text)
				81	}
				82
				83	fn main() {}
				84	```
				85
				86	Specifically, in this example, the regex will be compiled when it is used for
				87	the first time. On subsequent uses, it will reuse the previous compilation.
				88
				89	# Example: iterating over capture groups
				90
				91	This crate provides convenient iterators for matching an expression
				92	repeatedly against a search string to find successive non-overlapping
				93	matches. For example, to find all dates in a string and be able to access
				94	them by their component pieces:
				95
				96	```rust
				97	# extern crate regex; use regex::Regex;
				98	# fn main() {
				99	let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap();
				100	let text = "2012-03-14, 2013-01-01 and 2014-07-05";
				101	for cap in re.captures_iter(text) {
				102	println!("Month: {} Day: {} Year: {}", &cap[2], &cap[3], &cap[1]);
				103	}
				104	// Output:
				105	// Month: 03 Day: 14 Year: 2012
				106	// Month: 01 Day: 01 Year: 2013
				107	// Month: 07 Day: 05 Year: 2014
				108	# }
				109	```
				110
				111	Notice that the year is in the capture group indexed at `1`. This is
				112	because the entire match is stored in the capture group at index `0`.
				113
				114	# Example: replacement with named capture groups
				115
				116	Building on the previous example, perhaps we'd like to rearrange the date
				117	formats. This can be done with text replacement. But to make the code
				118	clearer, we can name our capture groups and use those names as variables
				119	in our replacement text:
				120
				121	```rust
				122	# extern crate regex; use regex::Regex;
				123	# fn main() {
				124	let re = Regex::new(r"(?P<y>\d{4})-(?P<m>\d{2})-(?P<d>\d{2})").unwrap();
				125	let before = "2012-03-14, 2013-01-01 and 2014-07-05";
				126	let after = re.replace_all(before, "$m/$d/$y");
				127	assert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014");
				128	# }
				129	```
				130
				131	The `replace` methods are actually polymorphic in the replacement, which
				132	provides more flexibility than is seen here. (See the documentation for
				133	`Regex::replace` for more details.)
				134
				135	Note that if your regex gets complicated, you can use the `x` flag to
				136	enable insignificant whitespace mode, which also lets you write comments:
				137
				138	```rust
				139	# extern crate regex; use regex::Regex;
				140	# fn main() {
				141	let re = Regex::new(r"(?x)
				142	(?P<y>\d{4}) # the year
				143	-
				144	(?P<m>\d{2}) # the month
				145	-
				146	(?P<d>\d{2}) # the day
				147	").unwrap();
				148	let before = "2012-03-14, 2013-01-01 and 2014-07-05";
				149	let after = re.replace_all(before, "$m/$d/$y");
				150	assert_eq!(after, "03/14/2012, 01/01/2013 and 07/05/2014");
				151	# }
				152	```
				153
				154	If you wish to match against whitespace in this mode, you can still use `\s`,
Haibo Huang	49cbe5f	2020-05-28 20:14:24 -0700	[diff] [blame^]	155	`\n`, `\t`, etc. For escaping a single space character, you can escape it
				156	directly with `\ `, use its hex character code `\x20` or temporarily disable
				157	the `x` flag, e.g., `(?-x: )`.
Chih-Hung Hsieh	e42c505	2020-04-16 10:44:21 -0700	[diff] [blame]	158
				159	# Example: match multiple regular expressions simultaneously
				160
				161	This demonstrates how to use a `RegexSet` to match multiple (possibly
				162	overlapping) regular expressions in a single scan of the search text:
				163
				164	```rust
				165	use regex::RegexSet;
				166
				167	let set = RegexSet::new(&[
				168	r"\w+",
				169	r"\d+",
				170	r"\pL+",
				171	r"foo",
				172	r"bar",
				173	r"barfoo",
				174	r"foobar",
				175	]).unwrap();
				176
				177	// Iterate over and collect all of the matches.
				178	let matches: Vec<_> = set.matches("foobar").into_iter().collect();
				179	assert_eq!(matches, vec![0, 2, 3, 4, 6]);
				180
				181	// You can also test whether a particular regex matched:
				182	let matches = set.matches("foobar");
				183	assert!(!matches.matched(5));
				184	assert!(matches.matched(6));
				185	```
				186
				187	# Pay for what you use
				188
				189	With respect to searching text with a regular expression, there are three
				190	questions that can be asked:
				191
				192	1. Does the text match this expression?
				193	2. If so, where does it match?
				194	3. Where did the capturing groups match?
				195
				196	Generally speaking, this crate could provide a function to answer only #3,
				197	which would subsume #1 and #2 automatically. However, it can be significantly
				198	more expensive to compute the location of capturing group matches, so it's best
				199	not to do it if you don't need to.
				200
				201	Therefore, only use what you need. For example, don't use `find` if you
				202	only need to test if an expression matches a string. (Use `is_match`
				203	instead.)
				204
				205	# Unicode
				206
				207	This implementation executes regular expressions only on valid UTF-8
				208	while exposing match locations as byte indices into the search string. (To
				209	relax this restriction, use the [`bytes`](bytes/index.html) sub-module.)
				210
				211	Only simple case folding is supported. Namely, when matching
				212	case-insensitively, the characters are first mapped using the "simple" case
				213	folding rules defined by Unicode.
				214
				215	Regular expressions themselves are only interpreted as a sequence of
				216	Unicode scalar values. This means you can use Unicode characters directly
				217	in your expression:
				218
				219	```rust
				220	# extern crate regex; use regex::Regex;
				221	# fn main() {
				222	let re = Regex::new(r"(?i)Δ+").unwrap();
				223	let mat = re.find("ΔδΔ").unwrap();
				224	assert_eq!((mat.start(), mat.end()), (0, 6));
				225	# }
				226	```
				227
				228	Most features of the regular expressions in this crate are Unicode aware. Here
				229	are some examples:
				230
				231	* `.` will match any valid UTF-8 encoded Unicode scalar value except for `\n`.
				232	(To also match `\n`, enable the `s` flag, e.g., `(?s:.)`.)
				233	* `\w`, `\d` and `\s` are Unicode aware. For example, `\s` will match all forms
				234	of whitespace categorized by Unicode.
				235	* `\b` matches a Unicode word boundary.
				236	* Negated character classes like `[^a]` match all Unicode scalar values except
				237	for `a`.
				238	* `^` and `$` are not Unicode aware in multi-line mode. Namely, they only
				239	recognize `\n` and not any of the other forms of line terminators defined
				240	by Unicode.
				241
				242	Unicode general categories, scripts, script extensions, ages and a smattering
				243	of boolean properties are available as character classes. For example, you can
				244	match a sequence of numerals, Greek or Cherokee letters:
				245
				246	```rust
				247	# extern crate regex; use regex::Regex;
				248	# fn main() {
				249	let re = Regex::new(r"[\pN\p{Greek}\p{Cherokee}]+").unwrap();
				250	let mat = re.find("abcΔᎠβⅠᏴγδⅡxyz").unwrap();
				251	assert_eq!((mat.start(), mat.end()), (3, 23));
				252	# }
				253	```
				254
				255	For a more detailed breakdown of Unicode support with respect to
				256	[UTS#18](http://unicode.org/reports/tr18/),
				257	please see the
				258	[UNICODE](https://github.com/rust-lang/regex/blob/master/UNICODE.md)
				259	document in the root of the regex repository.
				260
				261	# Opt out of Unicode support
				262
				263	The `bytes` sub-module provides a `Regex` type that can be used to match
				264	on `&[u8]`. By default, text is interpreted as UTF-8 just like it is with
				265	the main `Regex` type. However, this behavior can be disabled by turning
				266	off the `u` flag, even if doing so could result in matching invalid UTF-8.
				267	For example, when the `u` flag is disabled, `.` will match any byte instead
				268	of any Unicode scalar value.
				269
				270	Disabling the `u` flag is also possible with the standard `&str`-based `Regex`
				271	type, but it is only allowed where the UTF-8 invariant is maintained. For
				272	example, `(?-u:\w)` is an ASCII-only `\w` character class and is legal in an
				273	`&str`-based `Regex`, but `(?-u:\xFF)` will attempt to match the raw byte
				274	`\xFF`, which is invalid UTF-8 and therefore is illegal in `&str`-based
				275	regexes.
				276
				277	Finally, since Unicode support requires bundling large Unicode data
				278	tables, this crate exposes knobs to disable the compilation of those
				279	data tables, which can be useful for shrinking binary size and reducing
				280	compilation times. For details on how to do that, see the section on [crate
				281	features](#crate-features).
				282
				283	# Syntax
				284
				285	The syntax supported in this crate is documented below.
				286
				287	Note that the regular expression parser and abstract syntax are exposed in
				288	a separate crate, [`regex-syntax`](https://docs.rs/regex-syntax).
				289
				290	## Matching one character
				291
				292	<pre class="rust">
				293	. any character except new line (includes new line with s flag)
				294	\d digit (\p{Nd})
				295	\D not digit
				296	\pN One-letter name Unicode character class
				297	\p{Greek} Unicode character class (general category or script)
				298	\PN Negated one-letter name Unicode character class
				299	\P{Greek} negated Unicode character class (general category or script)
				300	</pre>
				301
				302	### Character classes
				303
				304	<pre class="rust">
				305	[xyz] A character class matching either x, y or z (union).
				306	[^xyz] A character class matching any character except x, y and z.
				307	[a-z] A character class matching any character in range a-z.
				308	[[:alpha:]] ASCII character class ([A-Za-z])
				309	[[:^alpha:]] Negated ASCII character class ([^A-Za-z])
				310	[x[^xyz]] Nested/grouping character class (matching any character except y and z)
				311	[a-y&&xyz] Intersection (matching x or y)
				312	[0-9&&[^4]] Subtraction using intersection and negation (matching 0-9 except 4)
				313	[0-9--4] Direct subtraction (matching 0-9 except 4)
				314	[a-g~~b-h] Symmetric difference (matching `a` and `h` only)
				315	[\[\]] Escaping in character classes (matching [ or ])
				316	</pre>
				317
				318	Any named character class may appear inside a bracketed `[...]` character
				319	class. For example, `[\p{Greek}[:digit:]]` matches any Greek or ASCII
				320	digit. `[\p{Greek}&&\pL]` matches Greek letters.
				321
				322	Precedence in character classes, from most binding to least:
				323
				324	1. Ranges: `a-cd` == `[a-c]d`
				325	2. Union: `ab&&bc` == `[ab]&&[bc]`
				326	3. Intersection: `^a-z&&b` == `^[a-z&&b]`
				327	4. Negation
				328
				329	## Composites
				330
				331	<pre class="rust">
				332	xy concatenation (x followed by y)
				333	x\|y alternation (x or y, prefer x)
				334	</pre>
				335
				336	## Repetitions
				337
				338	<pre class="rust">
				339	x* zero or more of x (greedy)
				340	x+ one or more of x (greedy)
				341	x? zero or one of x (greedy)
				342	x*? zero or more of x (ungreedy/lazy)
				343	x+? one or more of x (ungreedy/lazy)
				344	x?? zero or one of x (ungreedy/lazy)
				345	x{n,m} at least n x and at most m x (greedy)
				346	x{n,} at least n x (greedy)
				347	x{n} exactly n x
				348	x{n,m}? at least n x and at most m x (ungreedy/lazy)
				349	x{n,}? at least n x (ungreedy/lazy)
				350	x{n}? exactly n x
				351	</pre>
				352
				353	## Empty matches
				354
				355	<pre class="rust">
				356	^ the beginning of text (or start-of-line with multi-line mode)
				357	$ the end of text (or end-of-line with multi-line mode)
				358	\A only the beginning of text (even with multi-line mode enabled)
				359	\z only the end of text (even with multi-line mode enabled)
				360	\b a Unicode word boundary (\w on one side and \W, \A, or \z on other)
				361	\B not a Unicode word boundary
				362	</pre>
				363
				364	## Grouping and flags
				365
				366	<pre class="rust">
				367	(exp) numbered capture group (indexed by opening parenthesis)
				368	(?P<name>exp) named (also numbered) capture group (allowed chars: [_0-9a-zA-Z])
				369	(?:exp) non-capturing group
				370	(?flags) set flags within current group
				371	(?flags:exp) set flags for exp (non-capturing)
				372	</pre>
				373
				374	Flags are each a single character. For example, `(?x)` sets the flag `x`
				375	and `(?-x)` clears the flag `x`. Multiple flags can be set or cleared at
				376	the same time: `(?xy)` sets both the `x` and `y` flags and `(?x-y)` sets
				377	the `x` flag and clears the `y` flag.
				378
				379	All flags are by default disabled unless stated otherwise. They are:
				380
				381	<pre class="rust">
				382	i case-insensitive: letters match both upper and lower case
				383	m multi-line mode: ^ and $ match begin/end of line
				384	s allow . to match \n
				385	U swap the meaning of x* and x*?
				386	u Unicode support (enabled by default)
				387	x ignore whitespace and allow line comments (starting with `#`)
				388	</pre>
				389
				390	Flags can be toggled within a pattern. Here's an example that matches
				391	case-insensitively for the first part but case-sensitively for the second part:
				392
				393	```rust
				394	# extern crate regex; use regex::Regex;
				395	# fn main() {
				396	let re = Regex::new(r"(?i)a+(?-i)b+").unwrap();
				397	let cap = re.captures("AaAaAbbBBBb").unwrap();
				398	assert_eq!(&cap[0], "AaAaAbb");
				399	# }
				400	```
				401
				402	Notice that the `a+` matches either `a` or `A`, but the `b+` only matches
				403	`b`.
				404
				405	Multi-line mode means `^` and `$` no longer match just at the beginning/end of
				406	the input, but at the beginning/end of lines:
				407
				408	```
				409	# use regex::Regex;
				410	let re = Regex::new(r"(?m)^line \d+").unwrap();
				411	let m = re.find("line one\nline 2\n").unwrap();
				412	assert_eq!(m.as_str(), "line 2");
				413	```
				414
				415	Note that `^` matches after new lines, even at the end of input:
				416
				417	```
				418	# use regex::Regex;
				419	let re = Regex::new(r"(?m)^").unwrap();
				420	let m = re.find_iter("test\n").last().unwrap();
				421	assert_eq!((m.start(), m.end()), (5, 5));
				422	```
				423
				424	Here is an example that uses an ASCII word boundary instead of a Unicode
				425	word boundary:
				426
				427	```rust
				428	# extern crate regex; use regex::Regex;
				429	# fn main() {
				430	let re = Regex::new(r"(?-u:\b).+(?-u:\b)").unwrap();
				431	let cap = re.captures("$$abc$$").unwrap();
				432	assert_eq!(&cap[0], "abc");
				433	# }
				434	```
				435
				436	## Escape sequences
				437
				438	<pre class="rust">
				439	\* literal , works for any punctuation character: \.+?()\|[]{}^$
				440	\a bell (\x07)
				441	\f form feed (\x0C)
				442	\t horizontal tab
				443	\n new line
				444	\r carriage return
				445	\v vertical tab (\x0B)
				446	\123 octal character code (up to three digits) (when enabled)
				447	\x7F hex character code (exactly two digits)
				448	\x{10FFFF} any hex character code corresponding to a Unicode code point
				449	\u007F hex character code (exactly four digits)
				450	\u{7F} any hex character code corresponding to a Unicode code point
				451	\U0000007F hex character code (exactly eight digits)
				452	\U{7F} any hex character code corresponding to a Unicode code point
				453	</pre>
				454
				455	## Perl character classes (Unicode friendly)
				456
				457	These classes are based on the definitions provided in
				458	[UTS#18](http://www.unicode.org/reports/tr18/#Compatibility_Properties):
				459
				460	<pre class="rust">
				461	\d digit (\p{Nd})
				462	\D not digit
				463	\s whitespace (\p{White_Space})
				464	\S not whitespace
				465	\w word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
				466	\W not word character
				467	</pre>
				468
				469	## ASCII character classes
				470
				471	<pre class="rust">
				472	[[:alnum:]] alphanumeric ([0-9A-Za-z])
				473	[[:alpha:]] alphabetic ([A-Za-z])
				474	[[:ascii:]] ASCII ([\x00-\x7F])
				475	[[:blank:]] blank ([\t ])
				476	[[:cntrl:]] control ([\x00-\x1F\x7F])
				477	[[:digit:]] digits ([0-9])
				478	[[:graph:]] graphical ([!-~])
				479	[[:lower:]] lower case ([a-z])
				480	[[:print:]] printable ([ -~])
				481	[[:punct:]] punctuation ([!-/:-@\[-`{-~])
				482	[[:space:]] whitespace ([\t\n\v\f\r ])
				483	[[:upper:]] upper case ([A-Z])
				484	[[:word:]] word characters ([0-9A-Za-z_])
				485	[[:xdigit:]] hex digit ([0-9A-Fa-f])
				486	</pre>
				487
				488	# Crate features
				489
				490	By default, this crate tries pretty hard to make regex matching both as fast
				491	as possible and as correct as it can be, within reason. This means that there
				492	is a lot of code dedicated to performance, the handling of Unicode data and the
				493	Unicode data itself. Overall, this leads to more dependencies, larger binaries
				494	and longer compile times. This trade off may not be appropriate in all cases,
				495	and indeed, even when all Unicode and performance features are disabled, one
				496	is still left with a perfectly serviceable regex engine that will work well
				497	in many cases.
				498
				499	This crate exposes a number of features for controlling that trade off. Some
				500	of these features are strictly performance oriented, such that disabling them
				501	won't result in a loss of functionality, but may result in worse performance.
				502	Other features, such as the ones controlling the presence or absence of Unicode
				503	data, can result in a loss of functionality. For example, if one disables the
				504	`unicode-case` feature (described below), then compiling the regex `(?i)a`
				505	will fail since Unicode case insensitivity is enabled by default. Instead,
				506	callers must use `(?i-u)a` instead to disable Unicode case folding. Stated
				507	differently, enabling or disabling any of the features below can only add or
				508	subtract from the total set of valid regular expressions. Enabling or disabling
				509	a feature will never modify the match semantics of a regular expression.
				510
				511	All features below are enabled by default.
				512
				513	### Ecosystem features
				514
				515	* std -
				516	When enabled, this will cause `regex` to use the standard library. Currently,
				517	disabling this feature will always result in a compilation error. It is
				518	intended to add `alloc`-only support to regex in the future.
				519
				520	### Performance features
				521
				522	* perf -
				523	Enables all performance related features. This feature is enabled by default
				524	and will always cover all features that improve performance, even if more
				525	are added in the future.
				526	* perf-cache -
				527	Enables the use of very fast thread safe caching for internal match state.
				528	When this is disabled, caching is still used, but with a slower and simpler
				529	implementation. Disabling this drops the `thread_local` and `lazy_static`
				530	dependencies.
				531	* perf-dfa -
				532	Enables the use of a lazy DFA for matching. The lazy DFA is used to compile
				533	portions of a regex to a very fast DFA on an as-needed basis. This can
				534	result in substantial speedups, usually by an order of magnitude on large
				535	haystacks. The lazy DFA does not bring in any new dependencies, but it can
				536	make compile times longer.
				537	* perf-inline -
				538	Enables the use of aggressive inlining inside match routines. This reduces
				539	the overhead of each match. The aggressive inlining, however, increases
				540	compile times and binary size.
				541	* perf-literal -
				542	Enables the use of literal optimizations for speeding up matches. In some
				543	cases, literal optimizations can result in speedups of _several_ orders of
				544	magnitude. Disabling this drops the `aho-corasick` and `memchr` dependencies.
				545
				546	### Unicode features
				547
				548	* unicode -
				549	Enables all Unicode features. This feature is enabled by default, and will
				550	always cover all Unicode features, even if more are added in the future.
				551	* unicode-age -
				552	Provide the data for the
				553	[Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age).
				554	This makes it possible to use classes like `\p{Age:6.0}` to refer to all
				555	codepoints first introduced in Unicode 6.0
				556	* unicode-bool -
				557	Provide the data for numerous Unicode boolean properties. The full list
				558	is not included here, but contains properties like `Alphabetic`, `Emoji`,
				559	`Lowercase`, `Math`, `Uppercase` and `White_Space`.
				560	* unicode-case -
				561	Provide the data for case insensitive matching using
				562	[Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches).
				563	* unicode-gencat -
				564	Provide the data for
				565	[Uncode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values).
				566	This includes, but is not limited to, `Decimal_Number`, `Letter`,
				567	`Math_Symbol`, `Number` and `Punctuation`.
				568	* unicode-perl -
				569	Provide the data for supporting the Unicode-aware Perl character classes,
				570	corresponding to `\w`, `\s` and `\d`. This is also necessary for using
				571	Unicode-aware word boundary assertions. Note that if this feature is
				572	disabled, the `\s` and `\d` character classes are still available if the
				573	`unicode-bool` and `unicode-gencat` features are enabled, respectively.
				574	* unicode-script -
				575	Provide the data for
				576	[Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/).
				577	This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`,
				578	`Latin` and `Thai`.
				579	* unicode-segment -
				580	Provide the data necessary to provide the properties used to implement the
				581	[Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/).
				582	This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and
				583	`\p{sb=ATerm}`.
				584
				585
				586	# Untrusted input
				587
				588	This crate can handle both untrusted regular expressions and untrusted
				589	search text.
				590
				591	Untrusted regular expressions are handled by capping the size of a compiled
				592	regular expression.
				593	(See [`RegexBuilder::size_limit`](struct.RegexBuilder.html#method.size_limit).)
				594	Without this, it would be trivial for an attacker to exhaust your system's
				595	memory with expressions like `a{100}{100}{100}`.
				596
				597	Untrusted search text is allowed because the matching engine(s) in this
				598	crate have time complexity `O(mn)` (with `m ~ regex` and `n ~ search
				599	text`), which means there's no way to cause exponential blow-up like with
				600	some other regular expression engines. (We pay for this by disallowing
				601	features like arbitrary look-ahead and backreferences.)
				602
				603	When a DFA is used, pathological cases with exponential state blow-up are
				604	avoided by constructing the DFA lazily or in an "online" manner. Therefore,
				605	at most one new state can be created for each byte of input. This satisfies
				606	our time complexity guarantees, but can lead to memory growth
				607	proportional to the size of the input. As a stopgap, the DFA is only
				608	allowed to store a fixed number of states. When the limit is reached, its
				609	states are wiped and continues on, possibly duplicating previous work. If
				610	the limit is reached too frequently, it gives up and hands control off to
				611	another matching engine with fixed memory requirements.
				612	(The DFA size limit can also be tweaked. See
				613	[`RegexBuilder::dfa_size_limit`](struct.RegexBuilder.html#method.dfa_size_limit).)
				614	*/
				615
				616	#![deny(missing_docs)]
				617	#![cfg_attr(test, deny(warnings))]
				618	#![cfg_attr(feature = "pattern", feature(pattern))]
				619
				620	#[cfg(not(feature = "std"))]
				621	compile_error!("`std` feature is currently required to build this crate");
				622
				623	#[cfg(feature = "perf-literal")]
				624	extern crate aho_corasick;
Haibo Huang	49cbe5f	2020-05-28 20:14:24 -0700	[diff] [blame^]	625	// #[cfg(doctest)]
				626	// extern crate doc_comment;
Chih-Hung Hsieh	e42c505	2020-04-16 10:44:21 -0700	[diff] [blame]	627	#[cfg(feature = "perf-literal")]
				628	extern crate memchr;
				629	#[cfg(test)]
				630	#[cfg_attr(feature = "perf-literal", macro_use)]
				631	extern crate quickcheck;
				632	extern crate regex_syntax as syntax;
				633	#[cfg(feature = "perf-cache")]
				634	extern crate thread_local;
				635
Haibo Huang	49cbe5f	2020-05-28 20:14:24 -0700	[diff] [blame^]	636	// #[cfg(doctest)]
				637	// doc_comment::doctest!("../README.md");
Chih-Hung Hsieh	e42c505	2020-04-16 10:44:21 -0700	[diff] [blame]	638
				639	#[cfg(feature = "std")]
				640	pub use error::Error;
				641	#[cfg(feature = "std")]
				642	pub use re_builder::set_unicode::*;
				643	#[cfg(feature = "std")]
				644	pub use re_builder::unicode::*;
				645	#[cfg(feature = "std")]
				646	pub use re_set::unicode::*;
				647	#[cfg(feature = "std")]
				648	#[cfg(feature = "std")]
				649	pub use re_unicode::{
				650	escape, CaptureLocations, CaptureMatches, CaptureNames, Captures,
				651	Locations, Match, Matches, NoExpand, Regex, Replacer, ReplacerRef, Split,
				652	SplitN, SubCaptureMatches,
				653	};
				654
				655	/**
				656	Match regular expressions on arbitrary bytes.
				657
				658	This module provides a nearly identical API to the one found in the
				659	top-level of this crate. There are two important differences:
				660
				661	1. Matching is done on `&[u8]` instead of `&str`. Additionally, `Vec<u8>`
				662	is used where `String` would have been used.
				663	2. Unicode support can be disabled even when disabling it would result in
				664	matching invalid UTF-8 bytes.
				665
				666	# Example: match null terminated string
				667
				668	This shows how to find all null-terminated strings in a slice of bytes:
				669
				670	```rust
				671	# use regex::bytes::Regex;
				672	let re = Regex::new(r"(?-u)(?P<cstr>[^\x00]+)\x00").unwrap();
				673	let text = b"foo\x00bar\x00baz\x00";
				674
				675	// Extract all of the strings without the null terminator from each match.
				676	// The unwrap is OK here since a match requires the `cstr` capture to match.
				677	let cstrs: Vec<&[u8]> =
				678	re.captures_iter(text)
				679	.map(\|c\| c.name("cstr").unwrap().as_bytes())
				680	.collect();
				681	assert_eq!(vec![&b"foo"[..], &b"bar"[..], &b"baz"[..]], cstrs);
				682	```
				683
				684	# Example: selectively enable Unicode support
				685
				686	This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded
				687	string (e.g., to extract a title from a Matroska file):
				688
				689	```rust
				690	# use std::str;
				691	# use regex::bytes::Regex;
				692	let re = Regex::new(
				693	r"(?-u)\x7b\xa9(?:[\x80-\xfe]\|[\x40-\xff].)(?u:(.*))"
				694	).unwrap();
				695	let text = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65";
				696	let caps = re.captures(text).unwrap();
				697
				698	// Notice that despite the `.*` at the end, it will only match valid UTF-8
				699	// because Unicode mode was enabled with the `u` flag. Without the `u` flag,
				700	// the `.*` would match the rest of the bytes.
				701	let mat = caps.get(1).unwrap();
				702	assert_eq!((7, 10), (mat.start(), mat.end()));
				703
				704	// If there was a match, Unicode mode guarantees that `title` is valid UTF-8.
				705	let title = str::from_utf8(&caps[1]).unwrap();
				706	assert_eq!("☃", title);
				707	```
				708
				709	In general, if the Unicode flag is enabled in a capture group and that capture
				710	is part of the overall match, then the capture is guaranteed to be valid
				711	UTF-8.
				712
				713	# Syntax
				714
				715	The supported syntax is pretty much the same as the syntax for Unicode
				716	regular expressions with a few changes that make sense for matching arbitrary
				717	bytes:
				718
				719	1. The `u` flag can be disabled even when disabling it might cause the regex to
				720	match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in
				721	"ASCII compatible" mode.
				722	2. In ASCII compatible mode, neither Unicode scalar values nor Unicode
				723	character classes are allowed.
				724	3. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`)
				725	revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps
				726	to `[[:digit:]]` and `\s` maps to `[[:space:]]`.
				727	4. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to
				728	determine whether a byte is a word byte or not.
				729	5. Hexadecimal notation can be used to specify arbitrary bytes instead of
				730	Unicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the
				731	literal byte `\xFF`, while in Unicode mode, `\xFF` is a Unicode codepoint that
				732	matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal notation when
				733	enabled.
				734	6. `.` matches any byte except for `\n` instead of any Unicode scalar value.
				735	When the `s` flag is enabled, `.` matches any byte.
				736
				737	# Performance
				738
				739	In general, one should expect performance on `&[u8]` to be roughly similar to
				740	performance on `&str`.
				741	*/
				742	#[cfg(feature = "std")]
				743	pub mod bytes {
				744	pub use re_builder::bytes::*;
				745	pub use re_builder::set_bytes::*;
				746	pub use re_bytes::*;
				747	pub use re_set::bytes::*;
				748	}
				749
				750	mod backtrack;
				751	mod cache;
				752	mod compile;
				753	#[cfg(feature = "perf-dfa")]
				754	mod dfa;
				755	mod error;
				756	mod exec;
				757	mod expand;
				758	mod find_byte;
				759	#[cfg(feature = "perf-literal")]
				760	mod freqs;
				761	mod input;
				762	mod literal;
				763	#[cfg(feature = "pattern")]
				764	mod pattern;
				765	mod pikevm;
				766	mod prog;
				767	mod re_builder;
				768	mod re_bytes;
				769	mod re_set;
				770	mod re_trait;
				771	mod re_unicode;
				772	mod sparse;
				773	mod utf8;
				774
				775	/// The `internal` module exists to support suspicious activity, such as
				776	/// testing different matching engines and supporting the `regex-debug` CLI
				777	/// utility.
				778	#[doc(hidden)]
				779	#[cfg(feature = "std")]
				780	pub mod internal {
				781	pub use compile::Compiler;
				782	pub use exec::{Exec, ExecBuilder};
				783	pub use input::{Char, CharInput, Input, InputAt};
				784	pub use literal::LiteralSearcher;
				785	pub use prog::{EmptyLook, Inst, InstRanges, Program};
				786	}