Robert Olsson | b2f5710 | 2005-07-05 16:38:26 -0700 | [diff] [blame] | 1 | LC-trie implementation notes. |
| 2 | |
| 3 | Node types |
| 4 | ---------- |
| 5 | leaf |
| 6 | An end node with data. This has a copy of the relevant key, along |
| 7 | with 'hlist' with routing table entries sorted by prefix length. |
| 8 | See struct leaf and struct leaf_info. |
| 9 | |
| 10 | trie node or tnode |
| 11 | An internal node, holding an array of child (leaf or tnode) pointers, |
| 12 | indexed through a subset of the key. See Level Compression. |
| 13 | |
| 14 | A few concepts explained |
| 15 | ------------------------ |
| 16 | Bits (tnode) |
| 17 | The number of bits in the key segment used for indexing into the |
| 18 | child array - the "child index". See Level Compression. |
| 19 | |
| 20 | Pos (tnode) |
| 21 | The position (in the key) of the key segment used for indexing into |
| 22 | the child array. See Path Compression. |
| 23 | |
| 24 | Path Compression / skipped bits |
| 25 | Any given tnode is linked to from the child array of its parent, using |
| 26 | a segment of the key specified by the parent's "pos" and "bits" |
| 27 | In certain cases, this tnode's own "pos" will not be immediately |
| 28 | adjacent to the parent (pos+bits), but there will be some bits |
| 29 | in the key skipped over because they represent a single path with no |
| 30 | deviations. These "skipped bits" constitute Path Compression. |
| 31 | Note that the search algorithm will simply skip over these bits when |
| 32 | searching, making it necessary to save the keys in the leaves to |
| 33 | verify that they actually do match the key we are searching for. |
| 34 | |
| 35 | Level Compression / child arrays |
| 36 | the trie is kept level balanced moving, under certain conditions, the |
| 37 | children of a full child (see "full_children") up one level, so that |
| 38 | instead of a pure binary tree, each internal node ("tnode") may |
| 39 | contain an arbitrarily large array of links to several children. |
| 40 | Conversely, a tnode with a mostly empty child array (see empty_children) |
| 41 | may be "halved", having some of its children moved downwards one level, |
| 42 | in order to avoid ever-increasing child arrays. |
| 43 | |
| 44 | empty_children |
| 45 | the number of positions in the child array of a given tnode that are |
| 46 | NULL. |
| 47 | |
| 48 | full_children |
| 49 | the number of children of a given tnode that aren't path compressed. |
| 50 | (in other words, they aren't NULL or leaves and their "pos" is equal |
| 51 | to this tnode's "pos"+"bits"). |
| 52 | |
| 53 | (The word "full" here is used more in the sense of "complete" than |
| 54 | as the opposite of "empty", which might be a tad confusing.) |
| 55 | |
| 56 | Comments |
| 57 | --------- |
| 58 | |
| 59 | We have tried to keep the structure of the code as close to fib_hash as |
| 60 | possible to allow verification and help up reviewing. |
| 61 | |
| 62 | fib_find_node() |
| 63 | A good start for understanding this code. This function implements a |
| 64 | straightforward trie lookup. |
| 65 | |
| 66 | fib_insert_node() |
| 67 | Inserts a new leaf node in the trie. This is bit more complicated than |
| 68 | fib_find_node(). Inserting a new node means we might have to run the |
| 69 | level compression algorithm on part of the trie. |
| 70 | |
| 71 | trie_leaf_remove() |
| 72 | Looks up a key, deletes it and runs the level compression algorithm. |
| 73 | |
| 74 | trie_rebalance() |
| 75 | The key function for the dynamic trie after any change in the trie |
| 76 | it is run to optimize and reorganize. Tt will walk the trie upwards |
| 77 | towards the root from a given tnode, doing a resize() at each step |
| 78 | to implement level compression. |
| 79 | |
| 80 | resize() |
| 81 | Analyzes a tnode and optimizes the child array size by either inflating |
Matt LaPlante | a2ffd27 | 2006-10-03 22:49:15 +0200 | [diff] [blame] | 82 | or shrinking it repeatedly until it fulfills the criteria for optimal |
Robert Olsson | b2f5710 | 2005-07-05 16:38:26 -0700 | [diff] [blame] | 83 | level compression. This part follows the original paper pretty closely |
| 84 | and there may be some room for experimentation here. |
| 85 | |
| 86 | inflate() |
| 87 | Doubles the size of the child array within a tnode. Used by resize(). |
| 88 | |
| 89 | halve() |
| 90 | Halves the size of the child array within a tnode - the inverse of |
| 91 | inflate(). Used by resize(); |
| 92 | |
| 93 | fn_trie_insert(), fn_trie_delete(), fn_trie_select_default() |
| 94 | The route manipulation functions. Should conform pretty closely to the |
| 95 | corresponding functions in fib_hash. |
| 96 | |
| 97 | fn_trie_flush() |
| 98 | This walks the full trie (using nextleaf()) and searches for empty |
| 99 | leaves which have to be removed. |
| 100 | |
| 101 | fn_trie_dump() |
| 102 | Dumps the routing table ordered by prefix length. This is somewhat |
| 103 | slower than the corresponding fib_hash function, as we have to walk the |
| 104 | entire trie for each prefix length. In comparison, fib_hash is organized |
| 105 | as one "zone"/hash per prefix length. |
| 106 | |
| 107 | Locking |
| 108 | ------- |
| 109 | |
| 110 | fib_lock is used for an RW-lock in the same way that this is done in fib_hash. |
| 111 | However, the functions are somewhat separated for other possible locking |
| 112 | scenarios. It might conceivably be possible to run trie_rebalance via RCU |
| 113 | to avoid read_lock in the fn_trie_lookup() function. |
| 114 | |
| 115 | Main lookup mechanism |
| 116 | --------------------- |
| 117 | fn_trie_lookup() is the main lookup function. |
| 118 | |
| 119 | The lookup is in its simplest form just like fib_find_node(). We descend the |
| 120 | trie, key segment by key segment, until we find a leaf. check_leaf() does |
| 121 | the fib_semantic_match in the leaf's sorted prefix hlist. |
| 122 | |
| 123 | If we find a match, we are done. |
| 124 | |
| 125 | If we don't find a match, we enter prefix matching mode. The prefix length, |
| 126 | starting out at the same as the key length, is reduced one step at a time, |
| 127 | and we backtrack upwards through the trie trying to find a longest matching |
| 128 | prefix. The goal is always to reach a leaf and get a positive result from the |
| 129 | fib_semantic_match mechanism. |
| 130 | |
| 131 | Inside each tnode, the search for longest matching prefix consists of searching |
| 132 | through the child array, chopping off (zeroing) the least significant "1" of |
| 133 | the child index until we find a match or the child index consists of nothing but |
| 134 | zeros. |
| 135 | |
| 136 | At this point we backtrack (t->stats.backtrack++) up the trie, continuing to |
| 137 | chop off part of the key in order to find the longest matching prefix. |
| 138 | |
| 139 | At this point we will repeatedly descend subtries to look for a match, and there |
| 140 | are some optimizations available that can provide us with "shortcuts" to avoid |
| 141 | descending into dead ends. Look for "HL_OPTIMIZE" sections in the code. |
| 142 | |
| 143 | To alleviate any doubts about the correctness of the route selection process, |
| 144 | a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which |
| 145 | gives userland access to fib_lookup(). |