Blame - PERFORMANCE.md - platform/external/rust/crates/regex

blob: b4aeb89c1b17bdeebb7185d837d3334d65cb4d48 [file] [log] [blame] [view]

Chih-Hung Hsieh	e42c505	2020-04-16 10:44:21 -0700	[diff] [blame]	1	Your friendly guide to understanding the performance characteristics of this
				2	crate.
				3
				4	This guide assumes some familiarity with the public API of this crate, which
				5	can be found here: https://docs.rs/regex
				6
				7	## Theory vs. Practice
				8
				9	One of the design goals of this crate is to provide worst case linear time
				10	behavior with respect to the text searched using finite state automata. This
				11	means that, in theory, the performance of this crate is much better than most
				12	regex implementations, which typically use backtracking which has worst case
				13	exponential time.
				14
				15	For example, try opening a Python interpreter and typing this:
				16
				17	>>> import re
				18	>>> re.search('(a)c', 'a' * 30).span()
				19
				20	I'll wait.
				21
				22	At some point, you'll figure out that it won't terminate any time soon. ^C it.
				23
				24	The promise of this crate is that this pathological behavior can't happen.
				25
				26	With that said, just because we have protected ourselves against worst case
				27	exponential behavior doesn't mean we are immune from large constant factors
				28	or places where the current regex engine isn't quite optimal. This guide will
				29	detail those cases and provide guidance on how to avoid them, among other
				30	bits of general advice.
				31
				32	## Thou Shalt Not Compile Regular Expressions In A Loop
				33
				34	Advice: Use `lazy_static` to amortize the cost of `Regex` compilation.
				35
				36	Don't do it unless you really don't mind paying for it. Compiling a regular
				37	expression in this crate is quite expensive. It is conceivable that it may get
				38	faster some day, but I wouldn't hold out hope for, say, an order of magnitude
				39	improvement. In particular, compilation can take any where from a few dozen
				40	microseconds to a few dozen milliseconds. Yes, milliseconds. Unicode character
				41	classes, in particular, have the largest impact on compilation performance. At
				42	the time of writing, for example, `\pL{100}` takes around 44ms to compile. This
				43	is because `\pL` corresponds to every letter in Unicode and compilation must
				44	turn it into a proper automaton that decodes a subset of UTF-8 which
				45	corresponds to those letters. Compilation also spends some cycles shrinking the
				46	size of the automaton.
				47
				48	This means that in order to realize efficient regex matching, one must
				49	amortize the cost of compilation. Trivially, if a call to `is_match` is
				50	inside a loop, then make sure your call to `Regex::new` is outside that loop.
				51
				52	In many programming languages, regular expressions can be conveniently defined
				53	and compiled in a global scope, and code can reach out and use them as if
				54	they were global static variables. In Rust, there is really no concept of
				55	life-before-main, and therefore, one cannot utter this:
				56
				57	static MY_REGEX: Regex = Regex::new("...").unwrap();
				58
				59	Unfortunately, this would seem to imply that one must pass `Regex` objects
				60	around to everywhere they are used, which can be especially painful depending
				61	on how your program is structured. Thankfully, the
				62	[`lazy_static`](https://crates.io/crates/lazy_static)
				63	crate provides an answer that works well:
				64
				65	#[macro_use] extern crate lazy_static;
				66	extern crate regex;
				67
				68	use regex::Regex;
				69
				70	fn some_helper_function(text: &str) -> bool {
				71	lazy_static! {
				72	static ref MY_REGEX: Regex = Regex::new("...").unwrap();
				73	}
				74	MY_REGEX.is_match(text)
				75	}
				76
				77	In other words, the `lazy_static!` macro enables us to define a `Regex` as if
				78	it were a global static value. What is actually happening under the covers is
				79	that the code inside the macro (i.e., `Regex::new(...)`) is run on first use
				80	of `MY_REGEX` via a `Deref` impl. The implementation is admittedly magical, but
				81	it's self contained and everything works exactly as you expect. In particular,
				82	`MY_REGEX` can be used from multiple threads without wrapping it in an `Arc` or
				83	a `Mutex`. On that note...
				84
				85	## Using a regex from multiple threads
				86
				87	Advice: The performance impact from using a `Regex` from multiple threads
				88	is likely negligible. If necessary, clone the `Regex` so that each thread gets
				89	its own copy. Cloning a regex does not incur any additional memory overhead
				90	than what would be used by using a `Regex` from multiple threads
				91	simultaneously. Its only cost is ergonomics.
				92
				93	It is supported and encouraged to define your regexes using `lazy_static!` as
				94	if they were global static values, and then use them to search text from
				95	multiple threads simultaneously.
				96
				97	One might imagine that this is possible because a `Regex` represents a
				98	compiled program, so that any allocation or mutation is already done, and is
				99	therefore read-only. Unfortunately, this is not true. Each type of search
				100	strategy in this crate requires some kind of mutable scratch space to use
				101	during search. For example, when executing a DFA, its states are computed
				102	lazily and reused on subsequent searches. Those states go into that mutable
				103	scratch space.
				104
				105	The mutable scratch space is an implementation detail, and in general, its
				106	mutation should not be observable from users of this crate. Therefore, it uses
				107	interior mutability. This implies that `Regex` can either only be used from one
				108	thread, or it must do some sort of synchronization. Either choice is
				109	reasonable, but this crate chooses the latter, in particular because it is
				110	ergonomic and makes use with `lazy_static!` straight forward.
				111
				112	Synchronization implies some amount of overhead. When a `Regex` is used from
				113	a single thread, this overhead is negligible. When a `Regex` is used from
				114	multiple threads simultaneously, it is possible for the overhead of
				115	synchronization from contention to impact performance. The specific cases where
				116	contention may happen is if you are calling any of these methods repeatedly
				117	from multiple threads simultaneously:
				118
				119	* shortest_match
				120	* is_match
				121	* find
				122	* captures
				123
				124	In particular, every invocation of one of these methods must synchronize with
				125	other threads to retrieve its mutable scratch space before searching can start.
				126	If, however, you are using one of these methods:
				127
				128	* find_iter
				129	* captures_iter
				130
				131	Then you may not suffer from contention since the cost of synchronization is
				132	amortized on construction of the iterator. That is, the mutable scratch space
				133	is obtained when the iterator is created and retained throughout its lifetime.
				134
				135	## Only ask for what you need
				136
				137	Advice: Prefer in this order: `is_match`, `find`, `captures`.
				138
				139	There are three primary search methods on a `Regex`:
				140
				141	* is_match
				142	* find
				143	* captures
				144
				145	In general, these are ordered from fastest to slowest.
				146
				147	`is_match` is fastest because it doesn't actually need to find the start or the
				148	end of the leftmost-first match. It can quit immediately after it knows there
				149	is a match. For example, given the regex `a+` and the haystack, `aaaaa`, the
				150	search will quit after examing the first byte.
				151
				152	In constrast, `find` must return both the start and end location of the
				153	leftmost-first match. It can use the DFA matcher for this, but must run it
				154	forwards once to find the end of the match and then run it backwards to find
				155	the start of the match. The two scans and the cost of finding the real end of
				156	the leftmost-first match make this more expensive than `is_match`.
				157
				158	`captures` is the most expensive of them all because it must do what `find`
				159	does, and then run either the bounded backtracker or the Pike VM to fill in the
				160	capture group locations. Both of these are simulations of an NFA, which must
				161	spend a lot of time shuffling states around. The DFA limits the performance hit
				162	somewhat by restricting the amount of text that must be searched via an NFA
				163	simulation.
				164
				165	One other method not mentioned is `shortest_match`. This method has precisely
				166	the same performance characteristics as `is_match`, except it will return the
				167	end location of when it discovered a match. For example, given the regex `a+`
				168	and the haystack `aaaaa`, `shortest_match` may return `1` as opposed to `5`,
				169	the latter of which being the correct end location of the leftmost-first match.
				170
				171	## Literals in your regex may make it faster
				172
				173	Advice: Literals can reduce the work that the regex engine needs to do. Use
				174	them if you can, especially as prefixes.
				175
				176	In particular, if your regex starts with a prefix literal, the prefix is
				177	quickly searched before entering the (much slower) regex engine. For example,
				178	given the regex `foo\w+`, the literal `foo` will be searched for using
				179	Boyer-Moore. If there's no match, then no regex engine is ever used. Only when
				180	there's a match is the regex engine invoked at the location of the match, which
				181	effectively permits the regex engine to skip large portions of a haystack.
				182	If a regex is comprised entirely of literals (possibly more than one), then
				183	it's possible that the regex engine can be avoided entirely even when there's a
				184	match.
				185
				186	When one literal is found, Boyer-Moore is used. When multiple literals are
				187	found, then an optimized version of Aho-Corasick is used.
				188
				189	This optimization is in particular extended quite a bit in this crate. Here are
				190	a few examples of regexes that get literal prefixes detected:
				191
				192	* `(foo\|bar)` detects `foo` and `bar`
				193	* `(a\|b)c` detects `ac` and `bc`
				194	* `[ab]foo[yz]` detects `afooy`, `afooz`, `bfooy` and `bfooz`
				195	* `a?b` detects `a` and `b`
				196	* `a*b` detects `a` and `b`
				197	* `(ab){3,6}` detects `ababab`
				198
				199	Literals in anchored regexes can also be used for detecting non-matches very
				200	quickly. For example, `^foo\w+` and `\w+foo$` may be able to detect a non-match
				201	just by examing the first (or last) three bytes of the haystack.
				202
				203	## Unicode word boundaries may prevent the DFA from being used
				204
				205	Advice: In most cases, `\b` should work well. If not, use `(?-u:\b)`
				206	instead of `\b` if you care about consistent performance more than correctness.
				207
				208	It's a sad state of the current implementation. At the moment, the DFA will try
				209	to interpret Unicode word boundaries as if they were ASCII word boundaries.
				210	If the DFA comes across any non-ASCII byte, it will quit and fall back to an
				211	alternative matching engine that can handle Unicode word boundaries correctly.
				212	The alternate matching engine is generally quite a bit slower (perhaps by an
				213	order of magnitude). If necessary, this can be ameliorated in two ways.
				214
				215	The first way is to add some number of literal prefixes to your regular
				216	expression. Even though the DFA may not be used, specialized routines will
				217	still kick in to find prefix literals quickly, which limits how much work the
				218	NFA simulation will need to do.
				219
				220	The second way is to give up on Unicode and use an ASCII word boundary instead.
				221	One can use an ASCII word boundary by disabling Unicode support. That is,
				222	instead of using `\b`, use `(?-u:\b)`. Namely, given the regex `\b.+\b`, it
				223	can be transformed into a regex that uses the DFA with `(?-u:\b).+(?-u:\b)`. It
				224	is important to limit the scope of disabling the `u` flag, since it might lead
				225	to a syntax error if the regex could match arbitrary bytes. For example, if one
				226	wrote `(?-u)\b.+\b`, then a syntax error would be returned because `.` matches
				227	any byte when the Unicode flag is disabled.
				228
				229	The second way isn't appreciably different than just using a Unicode word
				230	boundary in the first place, since the DFA will speculatively interpret it as
				231	an ASCII word boundary anyway. The key difference is that if an ASCII word
				232	boundary is used explicitly, then the DFA won't quit in the presence of
				233	non-ASCII UTF-8 bytes. This results in giving up correctness in exchange for
				234	more consistent performance.
				235
				236	N.B. When using `bytes::Regex`, Unicode support is disabled by default, so one
				237	can simply write `\b` to get an ASCII word boundary.
				238
				239	## Excessive counting can lead to exponential state blow up in the DFA
				240
				241	Advice: Don't write regexes that cause DFA state blow up if you care about
				242	match performance.
				243
				244	Wait, didn't I say that this crate guards against exponential worst cases?
				245	Well, it turns out that the process of converting an NFA to a DFA can lead to
				246	an exponential blow up in the number of states. This crate specifically guards
				247	against exponential blow up by doing two things:
				248
				249	1. The DFA is computed lazily. That is, a state in the DFA only exists in
				250	memory if it is visited. In particular, the lazy DFA guarantees that *at
				251	most* one state is created for every byte of input. This, on its own,
				252	guarantees linear time complexity.
				253	2. Of course, creating a new state for every byte of input means that search
				254	will go incredibly slow because of very large constant factors. On top of
				255	that, creating a state for every byte in a large haystack could result in
				256	exorbitant memory usage. To ameliorate this, the DFA bounds the number of
				257	states it can store. Once it reaches its limit, it flushes its cache. This
				258	prevents reuse of states that it already computed. If the cache is flushed
				259	too frequently, then the DFA will give up and execution will fall back to
				260	one of the NFA simulations.
				261
				262	In effect, this crate will detect exponential state blow up and fall back to
				263	a search routine with fixed memory requirements. This does, however, mean that
				264	searching will be much slower than one might expect. Regexes that rely on
				265	counting in particular are strong aggravators of this behavior. For example,
				266	matching `[01]*1[01]{20}$` against a random sequence of `0`s and `1`s.
				267
				268	In the future, it may be possible to increase the bound that the DFA uses,
				269	which would allow the caller to choose how much memory they're willing to
				270	spend.
				271
				272	## Resist the temptation to "optimize" regexes
				273
				274	Advice: This ain't a backtracking engine.
				275
				276	An entire book was written on how to optimize Perl-style regular expressions.
				277	Most of those techniques are not applicable for this library. For example,
				278	there is no problem with using non-greedy matching or having lots of
				279	alternations in your regex.