Blame - Documentation/networking/fib_trie.txt - kernel/msm-4.9

blob: 0723db7f8495a1a0466ae06cd311c3e72ce24337 [file] [log] [blame]

Robert Olsson	b2f5710	2005-07-05 16:38:26 -0700	[diff] [blame]	1	LC-trie implementation notes.
				2
				3	Node types
				4	----------
				5	leaf
				6	An end node with data. This has a copy of the relevant key, along
				7	with 'hlist' with routing table entries sorted by prefix length.
				8	See struct leaf and struct leaf_info.
				9
				10	trie node or tnode
				11	An internal node, holding an array of child (leaf or tnode) pointers,
				12	indexed through a subset of the key. See Level Compression.
				13
				14	A few concepts explained
				15	------------------------
				16	Bits (tnode)
				17	The number of bits in the key segment used for indexing into the
				18	child array - the "child index". See Level Compression.
				19
				20	Pos (tnode)
				21	The position (in the key) of the key segment used for indexing into
				22	the child array. See Path Compression.
				23
				24	Path Compression / skipped bits
				25	Any given tnode is linked to from the child array of its parent, using
				26	a segment of the key specified by the parent's "pos" and "bits"
				27	In certain cases, this tnode's own "pos" will not be immediately
				28	adjacent to the parent (pos+bits), but there will be some bits
				29	in the key skipped over because they represent a single path with no
				30	deviations. These "skipped bits" constitute Path Compression.
				31	Note that the search algorithm will simply skip over these bits when
				32	searching, making it necessary to save the keys in the leaves to
				33	verify that they actually do match the key we are searching for.
				34
				35	Level Compression / child arrays
				36	the trie is kept level balanced moving, under certain conditions, the
				37	children of a full child (see "full_children") up one level, so that
				38	instead of a pure binary tree, each internal node ("tnode") may
				39	contain an arbitrarily large array of links to several children.
				40	Conversely, a tnode with a mostly empty child array (see empty_children)
				41	may be "halved", having some of its children moved downwards one level,
				42	in order to avoid ever-increasing child arrays.
				43
				44	empty_children
				45	the number of positions in the child array of a given tnode that are
				46	NULL.
				47
				48	full_children
				49	the number of children of a given tnode that aren't path compressed.
				50	(in other words, they aren't NULL or leaves and their "pos" is equal
				51	to this tnode's "pos"+"bits").
				52
				53	(The word "full" here is used more in the sense of "complete" than
				54	as the opposite of "empty", which might be a tad confusing.)
				55
				56	Comments
				57	---------
				58
				59	We have tried to keep the structure of the code as close to fib_hash as
				60	possible to allow verification and help up reviewing.
				61
				62	fib_find_node()
				63	A good start for understanding this code. This function implements a
				64	straightforward trie lookup.
				65
				66	fib_insert_node()
				67	Inserts a new leaf node in the trie. This is bit more complicated than
				68	fib_find_node(). Inserting a new node means we might have to run the
				69	level compression algorithm on part of the trie.
				70
				71	trie_leaf_remove()
				72	Looks up a key, deletes it and runs the level compression algorithm.
				73
				74	trie_rebalance()
				75	The key function for the dynamic trie after any change in the trie
				76	it is run to optimize and reorganize. Tt will walk the trie upwards
				77	towards the root from a given tnode, doing a resize() at each step
				78	to implement level compression.
				79
				80	resize()
				81	Analyzes a tnode and optimizes the child array size by either inflating
Matt LaPlante	a2ffd27	2006-10-03 22:49:15 +0200	[diff] [blame]	82	or shrinking it repeatedly until it fulfills the criteria for optimal
Robert Olsson	b2f5710	2005-07-05 16:38:26 -0700	[diff] [blame]	83	level compression. This part follows the original paper pretty closely
				84	and there may be some room for experimentation here.
				85
				86	inflate()
				87	Doubles the size of the child array within a tnode. Used by resize().
				88
				89	halve()
				90	Halves the size of the child array within a tnode - the inverse of
				91	inflate(). Used by resize();
				92
				93	fn_trie_insert(), fn_trie_delete(), fn_trie_select_default()
				94	The route manipulation functions. Should conform pretty closely to the
				95	corresponding functions in fib_hash.
				96
				97	fn_trie_flush()
				98	This walks the full trie (using nextleaf()) and searches for empty
				99	leaves which have to be removed.
				100
				101	fn_trie_dump()
				102	Dumps the routing table ordered by prefix length. This is somewhat
				103	slower than the corresponding fib_hash function, as we have to walk the
				104	entire trie for each prefix length. In comparison, fib_hash is organized
				105	as one "zone"/hash per prefix length.
				106
				107	Locking
				108	-------
				109
				110	fib_lock is used for an RW-lock in the same way that this is done in fib_hash.
				111	However, the functions are somewhat separated for other possible locking
				112	scenarios. It might conceivably be possible to run trie_rebalance via RCU
				113	to avoid read_lock in the fn_trie_lookup() function.
				114
				115	Main lookup mechanism
				116	---------------------
				117	fn_trie_lookup() is the main lookup function.
				118
				119	The lookup is in its simplest form just like fib_find_node(). We descend the
				120	trie, key segment by key segment, until we find a leaf. check_leaf() does
				121	the fib_semantic_match in the leaf's sorted prefix hlist.
				122
				123	If we find a match, we are done.
				124
				125	If we don't find a match, we enter prefix matching mode. The prefix length,
				126	starting out at the same as the key length, is reduced one step at a time,
				127	and we backtrack upwards through the trie trying to find a longest matching
				128	prefix. The goal is always to reach a leaf and get a positive result from the
				129	fib_semantic_match mechanism.
				130
				131	Inside each tnode, the search for longest matching prefix consists of searching
				132	through the child array, chopping off (zeroing) the least significant "1" of
				133	the child index until we find a match or the child index consists of nothing but
				134	zeros.
				135
				136	At this point we backtrack (t->stats.backtrack++) up the trie, continuing to
				137	chop off part of the key in order to find the longest matching prefix.
				138
				139	At this point we will repeatedly descend subtries to look for a match, and there
				140	are some optimizations available that can provide us with "shortcuts" to avoid
				141	descending into dead ends. Look for "HL_OPTIMIZE" sections in the code.
				142
				143	To alleviate any doubts about the correctness of the route selection process,
				144	a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which
				145	gives userland access to fib_lookup().