| Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 1 | NOTES ON OPTIMIZING DICTIONARIES | 
 | 2 | ================================ | 
 | 3 |  | 
 | 4 |  | 
 | 5 | Principal Use Cases for Dictionaries | 
 | 6 | ------------------------------------ | 
 | 7 |  | 
 | 8 | Passing keyword arguments | 
 | 9 |     Typically, one read and one write for 1 to 3 elements. | 
 | 10 |     Occurs frequently in normal python code. | 
 | 11 |  | 
 | 12 | Class method lookup | 
 | 13 |     Dictionaries vary in size with 8 to 16 elements being common. | 
 | 14 |     Usually written once with many lookups. | 
 | 15 |     When base classes are used, there are many failed lookups | 
 | 16 |         followed by a lookup in a base class. | 
 | 17 |  | 
 | 18 | Instance attribute lookup and Global variables | 
 | 19 |     Dictionaries vary in size.  4 to 10 elements are common. | 
 | 20 |     Both reads and writes are common. | 
 | 21 |  | 
 | 22 | Builtins | 
 | 23 |     Frequent reads.  Almost never written. | 
 | 24 |     Size 126 interned strings (as of Py2.3b1). | 
 | 25 |     A few keys are accessed much more frequently than others. | 
 | 26 |  | 
 | 27 | Uniquification | 
 | 28 |     Dictionaries of any size.  Bulk of work is in creation. | 
 | 29 |     Repeated writes to a smaller set of keys. | 
 | 30 |     Single read of each key. | 
| Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 31 |     Some use cases have two consecutive accesses to the same key. | 
| Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 32 |  | 
 | 33 |     * Removing duplicates from a sequence. | 
 | 34 |         dict.fromkeys(seqn).keys() | 
| Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 35 |  | 
| Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 36 |     * Counting elements in a sequence. | 
| Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 37 |         for e in seqn: | 
 | 38 |           d[e] = d.get(e,0) + 1 | 
 | 39 |  | 
 | 40 |     * Accumulating references in a dictionary of lists: | 
 | 41 |  | 
 | 42 |         for pagenumber, page in enumerate(pages): | 
 | 43 |           for word in page: | 
 | 44 |             d.setdefault(word, []).append(pagenumber) | 
 | 45 |  | 
 | 46 |     Note, the second example is a use case characterized by a get and set | 
| Thomas Wouters | 902d6eb | 2007-01-09 23:18:33 +0000 | [diff] [blame] | 47 |     to the same key.  There are similar use cases with a __contains__ | 
| Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 48 |     followed by a get, set, or del to the same key.  Part of the | 
 | 49 |     justification for d.setdefault is combining the two lookups into one. | 
| Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 50 |  | 
 | 51 | Membership Testing | 
 | 52 |     Dictionaries of any size.  Created once and then rarely changes. | 
 | 53 |     Single write to each key. | 
 | 54 |     Many calls to __contains__() or has_key(). | 
 | 55 |     Similar access patterns occur with replacement dictionaries | 
 | 56 |         such as with the % formatting operator. | 
 | 57 |  | 
| Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 58 | Dynamic Mappings | 
| Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 59 |     Characterized by deletions interspersed with adds and replacements. | 
| Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 60 |     Performance benefits greatly from the re-use of dummy entries. | 
 | 61 |  | 
| Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 62 |  | 
 | 63 | Data Layout (assuming a 32-bit box with 64 bytes per cache line) | 
 | 64 | ---------------------------------------------------------------- | 
 | 65 |  | 
 | 66 | Smalldicts (8 entries) are attached to the dictobject structure | 
 | 67 | and the whole group nearly fills two consecutive cache lines. | 
 | 68 |  | 
 | 69 | Larger dicts use the first half of the dictobject structure (one cache | 
 | 70 | line) and a separate, continuous block of entries (at 12 bytes each | 
 | 71 | for a total of 5.333 entries per cache line). | 
 | 72 |  | 
 | 73 |  | 
 | 74 | Tunable Dictionary Parameters | 
 | 75 | ----------------------------- | 
 | 76 |  | 
 | 77 | * PyDict_MINSIZE.  Currently set to 8. | 
 | 78 |     Must be a power of two.  New dicts have to zero-out every cell. | 
 | 79 |     Each additional 8 consumes 1.5 cache lines.  Increasing improves | 
 | 80 |     the sparseness of small dictionaries but costs time to read in | 
 | 81 |     the additional cache lines if they are not already in cache. | 
 | 82 |     That case is common when keyword arguments are passed. | 
 | 83 |  | 
 | 84 | * Maximum dictionary load in PyDict_SetItem.  Currently set to 2/3. | 
 | 85 |     Increasing this ratio makes dictionaries more dense resulting | 
 | 86 |     in more collisions.  Decreasing it improves sparseness at the | 
 | 87 |     expense of spreading entries over more cache lines and at the | 
 | 88 |     cost of total memory consumed. | 
 | 89 |  | 
 | 90 |     The load test occurs in highly time sensitive code.  Efforts | 
 | 91 |     to make the test more complex (for example, varying the load | 
 | 92 |     for different sizes) have degraded performance. | 
 | 93 |  | 
 | 94 | * Growth rate upon hitting maximum load.  Currently set to *2. | 
 | 95 |     Raising this to *4 results in half the number of resizes, | 
 | 96 |     less effort to resize, better sparseness for some (but not | 
| Raymond Hettinger | 9d5c443 | 2004-03-15 15:52:22 +0000 | [diff] [blame] | 97 |     all dict sizes), and potentially doubles memory consumption | 
| Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 98 |     depending on the size of the dictionary.  Setting to *4 | 
 | 99 |     eliminates every other resize step. | 
 | 100 |  | 
| Thomas Wouters | cf297e4 | 2007-02-23 15:07:44 +0000 | [diff] [blame] | 101 | * Maximum sparseness (minimum dictionary load).  What percentage | 
 | 102 |     of entries can be unused before the dictionary shrinks to | 
 | 103 |     free up memory and speed up iteration?  (The current CPython | 
 | 104 |     code does not represent this parameter directly.) | 
 | 105 |  | 
 | 106 | * Shrinkage rate upon exceeding maximum sparseness.  The current | 
 | 107 |     CPython code never even checks sparseness when deleting a | 
 | 108 |     key.  When a new key is added, it resizes based on the number | 
 | 109 |     of active keys, so that the addition may trigger shrinkage | 
 | 110 |     rather than growth. | 
 | 111 |  | 
| Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 112 | Tune-ups should be measured across a broad range of applications and | 
 | 113 | use cases.  A change to any parameter will help in some situations and | 
 | 114 | hurt in others.  The key is to find settings that help the most common | 
 | 115 | cases and do the least damage to the less common cases.  Results will | 
 | 116 | vary dramatically depending on the exact number of keys, whether the | 
 | 117 | keys are all strings, whether reads or writes dominate, the exact | 
 | 118 | hash values of the keys (some sets of values have fewer collisions than | 
 | 119 | others).  Any one test or benchmark is likely to prove misleading. | 
 | 120 |  | 
| Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 121 | While making a dictionary more sparse reduces collisions, it impairs | 
 | 122 | iteration and key listing.  Those methods loop over every potential | 
 | 123 | entry.  Doubling the size of dictionary results in twice as many | 
 | 124 | non-overlapping memory accesses for keys(), items(), values(), | 
 | 125 | __iter__(), iterkeys(), iteritems(), itervalues(), and update(). | 
| Raymond Hettinger | 9d5c443 | 2004-03-15 15:52:22 +0000 | [diff] [blame] | 126 | Also, every dictionary iterates at least twice, once for the memset() | 
 | 127 | when it is created and once by dealloc(). | 
| Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 128 |  | 
| Thomas Wouters | cf297e4 | 2007-02-23 15:07:44 +0000 | [diff] [blame] | 129 | Dictionary operations involving only a single key can be O(1) unless  | 
 | 130 | resizing is possible.  By checking for a resize only when the  | 
 | 131 | dictionary can grow (and may *require* resizing), other operations | 
 | 132 | remain O(1), and the odds of resize thrashing or memory fragmentation | 
 | 133 | are reduced. In particular, an algorithm that empties a dictionary | 
 | 134 | by repeatedly invoking .pop will see no resizing, which might | 
 | 135 | not be necessary at all because the dictionary is eventually | 
 | 136 | discarded entirely. | 
 | 137 |  | 
| Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 138 |  | 
 | 139 | Results of Cache Locality Experiments | 
 | 140 | ------------------------------------- | 
 | 141 |  | 
 | 142 | When an entry is retrieved from memory, 4.333 adjacent entries are also | 
 | 143 | retrieved into a cache line.  Since accessing items in cache is *much* | 
 | 144 | cheaper than a cache miss, an enticing idea is to probe the adjacent | 
 | 145 | entries as a first step in collision resolution.  Unfortunately, the | 
 | 146 | introduction of any regularity into collision searches results in more | 
 | 147 | collisions than the current random chaining approach. | 
 | 148 |  | 
 | 149 | Exploiting cache locality at the expense of additional collisions fails | 
 | 150 | to payoff when the entries are already loaded in cache (the expense | 
 | 151 | is paid with no compensating benefit).  This occurs in small dictionaries | 
 | 152 | where the whole dictionary fits into a pair of cache lines.  It also | 
 | 153 | occurs frequently in large dictionaries which have a common access pattern | 
 | 154 | where some keys are accessed much more frequently than others.  The | 
 | 155 | more popular entries *and* their collision chains tend to remain in cache. | 
 | 156 |  | 
 | 157 | To exploit cache locality, change the collision resolution section | 
 | 158 | in lookdict() and lookdict_string().  Set i^=1 at the top of the | 
 | 159 | loop and move the  i = (i << 2) + i + perturb + 1 to an unrolled | 
 | 160 | version of the loop. | 
 | 161 |  | 
 | 162 | This optimization strategy can be leveraged in several ways: | 
 | 163 |  | 
 | 164 | * If the dictionary is kept sparse (through the tunable parameters), | 
 | 165 | then the occurrence of additional collisions is lessened. | 
 | 166 |  | 
 | 167 | * If lookdict() and lookdict_string() are specialized for small dicts | 
 | 168 | and for largedicts, then the versions for large_dicts can be given | 
 | 169 | an alternate search strategy without increasing collisions in small dicts | 
 | 170 | which already have the maximum benefit of cache locality. | 
 | 171 |  | 
 | 172 | * If the use case for a dictionary is known to have a random key | 
 | 173 | access pattern (as opposed to a more common pattern with a Zipf's law | 
 | 174 | distribution), then there will be more benefit for large dictionaries | 
 | 175 | because any given key is no more likely than another to already be | 
 | 176 | in cache. | 
 | 177 |  | 
| Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 178 | * In use cases with paired accesses to the same key, the second access | 
 | 179 | is always in cache and gets no benefit from efforts to further improve | 
 | 180 | cache locality. | 
| Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 181 |  | 
 | 182 | Optimizing the Search of Small Dictionaries | 
 | 183 | ------------------------------------------- | 
 | 184 |  | 
 | 185 | If lookdict() and lookdict_string() are specialized for smaller dictionaries, | 
 | 186 | then a custom search approach can be implemented that exploits the small | 
 | 187 | search space and cache locality. | 
 | 188 |  | 
 | 189 | * The simplest example is a linear search of contiguous entries.  This is | 
 | 190 |   simple to implement, guaranteed to terminate rapidly, never searches | 
 | 191 |   the same entry twice, and precludes the need to check for dummy entries. | 
 | 192 |  | 
 | 193 | * A more advanced example is a self-organizing search so that the most | 
 | 194 |   frequently accessed entries get probed first.  The organization | 
 | 195 |   adapts if the access pattern changes over time.  Treaps are ideally | 
 | 196 |   suited for self-organization with the most common entries at the | 
 | 197 |   top of the heap and a rapid binary search pattern.  Most probes and | 
 | 198 |   results are all located at the top of the tree allowing them all to | 
 | 199 |   be located in one or two cache lines. | 
 | 200 |  | 
 | 201 | * Also, small dictionaries may be made more dense, perhaps filling all | 
 | 202 |   eight cells to take the maximum advantage of two cache lines. | 
 | 203 |  | 
 | 204 |  | 
 | 205 | Strategy Pattern | 
 | 206 | ---------------- | 
 | 207 |  | 
 | 208 | Consider allowing the user to set the tunable parameters or to select a | 
 | 209 | particular search method.  Since some dictionary use cases have known | 
 | 210 | sizes and access patterns, the user may be able to provide useful hints. | 
 | 211 |  | 
 | 212 | 1) For example, if membership testing or lookups dominate runtime and memory | 
 | 213 |    is not at a premium, the user may benefit from setting the maximum load | 
 | 214 |    ratio at 5% or 10% instead of the usual 66.7%.  This will sharply | 
| Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 215 |    curtail the number of collisions but will increase iteration time. | 
| Raymond Hettinger | 9d5c443 | 2004-03-15 15:52:22 +0000 | [diff] [blame] | 216 |    The builtin namespace is a prime example of a dictionary that can | 
 | 217 |    benefit from being highly sparse. | 
| Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 218 |  | 
 | 219 | 2) Dictionary creation time can be shortened in cases where the ultimate | 
 | 220 |    size of the dictionary is known in advance.  The dictionary can be | 
 | 221 |    pre-sized so that no resize operations are required during creation. | 
 | 222 |    Not only does this save resizes, but the key insertion will go | 
 | 223 |    more quickly because the first half of the keys will be inserted into | 
 | 224 |    a more sparse environment than before.  The preconditions for this | 
 | 225 |    strategy arise whenever a dictionary is created from a key or item | 
| Raymond Hettinger | 9d5c443 | 2004-03-15 15:52:22 +0000 | [diff] [blame] | 226 |    sequence and the number of *unique* keys is known. | 
| Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 227 |  | 
 | 228 | 3) If the key space is large and the access pattern is known to be random, | 
 | 229 |    then search strategies exploiting cache locality can be fruitful. | 
 | 230 |    The preconditions for this strategy arise in simulations and | 
 | 231 |    numerical analysis. | 
 | 232 |  | 
 | 233 | 4) If the keys are fixed and the access pattern strongly favors some of | 
 | 234 |    the keys, then the entries can be stored contiguously and accessed | 
 | 235 |    with a linear search or treap.  This exploits knowledge of the data, | 
 | 236 |    cache locality, and a simplified search routine.  It also eliminates | 
 | 237 |    the need to test for dummy entries on each probe.  The preconditions | 
 | 238 |    for this strategy arise in symbol tables and in the builtin dictionary. | 
| Raymond Hettinger | 4887a12 | 2003-05-05 21:31:51 +0000 | [diff] [blame] | 239 |  | 
 | 240 |  | 
 | 241 | Readonly Dictionaries | 
 | 242 | --------------------- | 
 | 243 | Some dictionary use cases pass through a build stage and then move to a | 
 | 244 | more heavily exercised lookup stage with no further changes to the | 
 | 245 | dictionary. | 
 | 246 |  | 
 | 247 | An idea that emerged on python-dev is to be able to convert a dictionary | 
 | 248 | to a read-only state.  This can help prevent programming errors and also | 
 | 249 | provide knowledge that can be exploited for lookup optimization. | 
 | 250 |  | 
 | 251 | The dictionary can be immediately rebuilt (eliminating dummy entries), | 
 | 252 | resized (to an appropriate level of sparseness), and the keys can be | 
 | 253 | jostled (to minimize collisions).  The lookdict() routine can then | 
 | 254 | eliminate the test for dummy entries (saving about 1/4 of the time | 
| Raymond Hettinger | 9d5c443 | 2004-03-15 15:52:22 +0000 | [diff] [blame] | 255 | spent in the collision resolution loop). | 
| Raymond Hettinger | 4887a12 | 2003-05-05 21:31:51 +0000 | [diff] [blame] | 256 |  | 
 | 257 | An additional possibility is to insert links into the empty spaces | 
 | 258 | so that dictionary iteration can proceed in len(d) steps instead of | 
| Raymond Hettinger | 9d5c443 | 2004-03-15 15:52:22 +0000 | [diff] [blame] | 259 | (mp->mask + 1) steps.  Alternatively, a separate tuple of keys can be | 
 | 260 | kept just for iteration. | 
| Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 261 |  | 
 | 262 |  | 
 | 263 | Caching Lookups | 
 | 264 | --------------- | 
 | 265 | The idea is to exploit key access patterns by anticipating future lookups | 
| Thomas Wouters | 0e3f591 | 2006-08-11 14:57:12 +0000 | [diff] [blame] | 266 | based on previous lookups. | 
| Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 267 |  | 
 | 268 | The simplest incarnation is to save the most recently accessed entry. | 
 | 269 | This gives optimal performance for use cases where every get is followed | 
 | 270 | by a set or del to the same key. |