Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 1 | NOTES ON OPTIMIZING DICTIONARIES |
| 2 | ================================ |
| 3 | |
| 4 | |
| 5 | Principal Use Cases for Dictionaries |
| 6 | ------------------------------------ |
| 7 | |
| 8 | Passing keyword arguments |
| 9 | Typically, one read and one write for 1 to 3 elements. |
| 10 | Occurs frequently in normal python code. |
| 11 | |
| 12 | Class method lookup |
| 13 | Dictionaries vary in size with 8 to 16 elements being common. |
| 14 | Usually written once with many lookups. |
| 15 | When base classes are used, there are many failed lookups |
| 16 | followed by a lookup in a base class. |
| 17 | |
| 18 | Instance attribute lookup and Global variables |
| 19 | Dictionaries vary in size. 4 to 10 elements are common. |
| 20 | Both reads and writes are common. |
| 21 | |
| 22 | Builtins |
| 23 | Frequent reads. Almost never written. |
| 24 | Size 126 interned strings (as of Py2.3b1). |
| 25 | A few keys are accessed much more frequently than others. |
| 26 | |
| 27 | Uniquification |
| 28 | Dictionaries of any size. Bulk of work is in creation. |
| 29 | Repeated writes to a smaller set of keys. |
| 30 | Single read of each key. |
| 31 | |
| 32 | * Removing duplicates from a sequence. |
| 33 | dict.fromkeys(seqn).keys() |
| 34 | * Counting elements in a sequence. |
| 35 | for e in seqn: d[e]=d.get(e,0) + 1 |
| 36 | * Accumulating items in a dictionary of lists. |
| 37 | for k, v in itemseqn: d.setdefault(k, []).append(v) |
| 38 | |
| 39 | Membership Testing |
| 40 | Dictionaries of any size. Created once and then rarely changes. |
| 41 | Single write to each key. |
| 42 | Many calls to __contains__() or has_key(). |
| 43 | Similar access patterns occur with replacement dictionaries |
| 44 | such as with the % formatting operator. |
| 45 | |
Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 46 | Dynamic Mappings |
| 47 | Characterized by deletions interspersed with adds and replacments. |
| 48 | Performance benefits greatly from the re-use of dummy entries. |
| 49 | |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 50 | |
| 51 | Data Layout (assuming a 32-bit box with 64 bytes per cache line) |
| 52 | ---------------------------------------------------------------- |
| 53 | |
| 54 | Smalldicts (8 entries) are attached to the dictobject structure |
| 55 | and the whole group nearly fills two consecutive cache lines. |
| 56 | |
| 57 | Larger dicts use the first half of the dictobject structure (one cache |
| 58 | line) and a separate, continuous block of entries (at 12 bytes each |
| 59 | for a total of 5.333 entries per cache line). |
| 60 | |
| 61 | |
| 62 | Tunable Dictionary Parameters |
| 63 | ----------------------------- |
| 64 | |
| 65 | * PyDict_MINSIZE. Currently set to 8. |
| 66 | Must be a power of two. New dicts have to zero-out every cell. |
| 67 | Each additional 8 consumes 1.5 cache lines. Increasing improves |
| 68 | the sparseness of small dictionaries but costs time to read in |
| 69 | the additional cache lines if they are not already in cache. |
| 70 | That case is common when keyword arguments are passed. |
| 71 | |
| 72 | * Maximum dictionary load in PyDict_SetItem. Currently set to 2/3. |
| 73 | Increasing this ratio makes dictionaries more dense resulting |
| 74 | in more collisions. Decreasing it improves sparseness at the |
| 75 | expense of spreading entries over more cache lines and at the |
| 76 | cost of total memory consumed. |
| 77 | |
| 78 | The load test occurs in highly time sensitive code. Efforts |
| 79 | to make the test more complex (for example, varying the load |
| 80 | for different sizes) have degraded performance. |
| 81 | |
| 82 | * Growth rate upon hitting maximum load. Currently set to *2. |
| 83 | Raising this to *4 results in half the number of resizes, |
| 84 | less effort to resize, better sparseness for some (but not |
| 85 | all dict sizes), and potentially double memory consumption |
| 86 | depending on the size of the dictionary. Setting to *4 |
| 87 | eliminates every other resize step. |
| 88 | |
| 89 | Tune-ups should be measured across a broad range of applications and |
| 90 | use cases. A change to any parameter will help in some situations and |
| 91 | hurt in others. The key is to find settings that help the most common |
| 92 | cases and do the least damage to the less common cases. Results will |
| 93 | vary dramatically depending on the exact number of keys, whether the |
| 94 | keys are all strings, whether reads or writes dominate, the exact |
| 95 | hash values of the keys (some sets of values have fewer collisions than |
| 96 | others). Any one test or benchmark is likely to prove misleading. |
| 97 | |
Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 98 | While making a dictionary more sparse reduces collisions, it impairs |
| 99 | iteration and key listing. Those methods loop over every potential |
| 100 | entry. Doubling the size of dictionary results in twice as many |
| 101 | non-overlapping memory accesses for keys(), items(), values(), |
| 102 | __iter__(), iterkeys(), iteritems(), itervalues(), and update(). |
| 103 | |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 104 | |
| 105 | Results of Cache Locality Experiments |
| 106 | ------------------------------------- |
| 107 | |
| 108 | When an entry is retrieved from memory, 4.333 adjacent entries are also |
| 109 | retrieved into a cache line. Since accessing items in cache is *much* |
| 110 | cheaper than a cache miss, an enticing idea is to probe the adjacent |
| 111 | entries as a first step in collision resolution. Unfortunately, the |
| 112 | introduction of any regularity into collision searches results in more |
| 113 | collisions than the current random chaining approach. |
| 114 | |
| 115 | Exploiting cache locality at the expense of additional collisions fails |
| 116 | to payoff when the entries are already loaded in cache (the expense |
| 117 | is paid with no compensating benefit). This occurs in small dictionaries |
| 118 | where the whole dictionary fits into a pair of cache lines. It also |
| 119 | occurs frequently in large dictionaries which have a common access pattern |
| 120 | where some keys are accessed much more frequently than others. The |
| 121 | more popular entries *and* their collision chains tend to remain in cache. |
| 122 | |
| 123 | To exploit cache locality, change the collision resolution section |
| 124 | in lookdict() and lookdict_string(). Set i^=1 at the top of the |
| 125 | loop and move the i = (i << 2) + i + perturb + 1 to an unrolled |
| 126 | version of the loop. |
| 127 | |
| 128 | This optimization strategy can be leveraged in several ways: |
| 129 | |
| 130 | * If the dictionary is kept sparse (through the tunable parameters), |
| 131 | then the occurrence of additional collisions is lessened. |
| 132 | |
| 133 | * If lookdict() and lookdict_string() are specialized for small dicts |
| 134 | and for largedicts, then the versions for large_dicts can be given |
| 135 | an alternate search strategy without increasing collisions in small dicts |
| 136 | which already have the maximum benefit of cache locality. |
| 137 | |
| 138 | * If the use case for a dictionary is known to have a random key |
| 139 | access pattern (as opposed to a more common pattern with a Zipf's law |
| 140 | distribution), then there will be more benefit for large dictionaries |
| 141 | because any given key is no more likely than another to already be |
| 142 | in cache. |
| 143 | |
| 144 | |
| 145 | Optimizing the Search of Small Dictionaries |
| 146 | ------------------------------------------- |
| 147 | |
| 148 | If lookdict() and lookdict_string() are specialized for smaller dictionaries, |
| 149 | then a custom search approach can be implemented that exploits the small |
| 150 | search space and cache locality. |
| 151 | |
| 152 | * The simplest example is a linear search of contiguous entries. This is |
| 153 | simple to implement, guaranteed to terminate rapidly, never searches |
| 154 | the same entry twice, and precludes the need to check for dummy entries. |
| 155 | |
| 156 | * A more advanced example is a self-organizing search so that the most |
| 157 | frequently accessed entries get probed first. The organization |
| 158 | adapts if the access pattern changes over time. Treaps are ideally |
| 159 | suited for self-organization with the most common entries at the |
| 160 | top of the heap and a rapid binary search pattern. Most probes and |
| 161 | results are all located at the top of the tree allowing them all to |
| 162 | be located in one or two cache lines. |
| 163 | |
| 164 | * Also, small dictionaries may be made more dense, perhaps filling all |
| 165 | eight cells to take the maximum advantage of two cache lines. |
| 166 | |
| 167 | |
| 168 | Strategy Pattern |
| 169 | ---------------- |
| 170 | |
| 171 | Consider allowing the user to set the tunable parameters or to select a |
| 172 | particular search method. Since some dictionary use cases have known |
| 173 | sizes and access patterns, the user may be able to provide useful hints. |
| 174 | |
| 175 | 1) For example, if membership testing or lookups dominate runtime and memory |
| 176 | is not at a premium, the user may benefit from setting the maximum load |
| 177 | ratio at 5% or 10% instead of the usual 66.7%. This will sharply |
Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 178 | curtail the number of collisions but will increase iteration time. |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 179 | |
| 180 | 2) Dictionary creation time can be shortened in cases where the ultimate |
| 181 | size of the dictionary is known in advance. The dictionary can be |
| 182 | pre-sized so that no resize operations are required during creation. |
| 183 | Not only does this save resizes, but the key insertion will go |
| 184 | more quickly because the first half of the keys will be inserted into |
| 185 | a more sparse environment than before. The preconditions for this |
| 186 | strategy arise whenever a dictionary is created from a key or item |
| 187 | sequence of known length. |
| 188 | |
| 189 | 3) If the key space is large and the access pattern is known to be random, |
| 190 | then search strategies exploiting cache locality can be fruitful. |
| 191 | The preconditions for this strategy arise in simulations and |
| 192 | numerical analysis. |
| 193 | |
| 194 | 4) If the keys are fixed and the access pattern strongly favors some of |
| 195 | the keys, then the entries can be stored contiguously and accessed |
| 196 | with a linear search or treap. This exploits knowledge of the data, |
| 197 | cache locality, and a simplified search routine. It also eliminates |
| 198 | the need to test for dummy entries on each probe. The preconditions |
| 199 | for this strategy arise in symbol tables and in the builtin dictionary. |
Raymond Hettinger | 4887a12 | 2003-05-05 21:31:51 +0000 | [diff] [blame] | 200 | |
| 201 | |
| 202 | Readonly Dictionaries |
| 203 | --------------------- |
| 204 | Some dictionary use cases pass through a build stage and then move to a |
| 205 | more heavily exercised lookup stage with no further changes to the |
| 206 | dictionary. |
| 207 | |
| 208 | An idea that emerged on python-dev is to be able to convert a dictionary |
| 209 | to a read-only state. This can help prevent programming errors and also |
| 210 | provide knowledge that can be exploited for lookup optimization. |
| 211 | |
| 212 | The dictionary can be immediately rebuilt (eliminating dummy entries), |
| 213 | resized (to an appropriate level of sparseness), and the keys can be |
| 214 | jostled (to minimize collisions). The lookdict() routine can then |
| 215 | eliminate the test for dummy entries (saving about 1/4 of the time |
| 216 | spend in the collision resolution loop). |
| 217 | |
| 218 | An additional possibility is to insert links into the empty spaces |
| 219 | so that dictionary iteration can proceed in len(d) steps instead of |
| 220 | (mp->mask + 1) steps. |