Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 1 | NOTES ON DICTIONARIES |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 2 | ================================ |
| 3 | |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 4 | Principal Use Cases for Dictionaries |
| 5 | ------------------------------------ |
| 6 | |
| 7 | Passing keyword arguments |
| 8 | Typically, one read and one write for 1 to 3 elements. |
| 9 | Occurs frequently in normal python code. |
| 10 | |
| 11 | Class method lookup |
| 12 | Dictionaries vary in size with 8 to 16 elements being common. |
| 13 | Usually written once with many lookups. |
| 14 | When base classes are used, there are many failed lookups |
| 15 | followed by a lookup in a base class. |
| 16 | |
| 17 | Instance attribute lookup and Global variables |
| 18 | Dictionaries vary in size. 4 to 10 elements are common. |
| 19 | Both reads and writes are common. |
| 20 | |
| 21 | Builtins |
| 22 | Frequent reads. Almost never written. |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 23 | About 150 interned strings (as of Py3.3). |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 24 | A few keys are accessed much more frequently than others. |
| 25 | |
| 26 | Uniquification |
| 27 | Dictionaries of any size. Bulk of work is in creation. |
| 28 | Repeated writes to a smaller set of keys. |
| 29 | Single read of each key. |
Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 30 | Some use cases have two consecutive accesses to the same key. |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 31 | |
| 32 | * Removing duplicates from a sequence. |
| 33 | dict.fromkeys(seqn).keys() |
Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 34 | |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 35 | * Counting elements in a sequence. |
Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 36 | for e in seqn: |
| 37 | d[e] = d.get(e,0) + 1 |
| 38 | |
| 39 | * Accumulating references in a dictionary of lists: |
| 40 | |
| 41 | for pagenumber, page in enumerate(pages): |
| 42 | for word in page: |
| 43 | d.setdefault(word, []).append(pagenumber) |
| 44 | |
| 45 | Note, the second example is a use case characterized by a get and set |
Thomas Wouters | 902d6eb | 2007-01-09 23:18:33 +0000 | [diff] [blame] | 46 | to the same key. There are similar use cases with a __contains__ |
Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 47 | followed by a get, set, or del to the same key. Part of the |
| 48 | justification for d.setdefault is combining the two lookups into one. |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 49 | |
| 50 | Membership Testing |
| 51 | Dictionaries of any size. Created once and then rarely changes. |
| 52 | Single write to each key. |
| 53 | Many calls to __contains__() or has_key(). |
| 54 | Similar access patterns occur with replacement dictionaries |
| 55 | such as with the % formatting operator. |
| 56 | |
Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 57 | Dynamic Mappings |
Raymond Hettinger | e509b2a | 2003-05-28 14:10:46 +0000 | [diff] [blame] | 58 | Characterized by deletions interspersed with adds and replacements. |
Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 59 | Performance benefits greatly from the re-use of dummy entries. |
| 60 | |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 61 | Data Layout |
| 62 | ----------- |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 63 | |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 64 | Dictionaries are composed of 3 components: |
| 65 | The dictobject struct itself |
| 66 | A dict-keys object (keys & hashes) |
| 67 | A values array |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 68 | |
| 69 | |
| 70 | Tunable Dictionary Parameters |
| 71 | ----------------------------- |
| 72 | |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 73 | * PyDict_STARTSIZE. Starting size of dict (unless an instance dict). |
| 74 | Currently set to 8. Must be a power of two. |
| 75 | New dicts have to zero-out every cell. |
| 76 | Increasing improves the sparseness of small dictionaries but costs |
| 77 | time to read in the additional cache lines if they are not already |
| 78 | in cache. That case is common when keyword arguments are passed. |
| 79 | Prior to version 3.3, PyDict_MINSIZE was used as the starting size |
| 80 | of a new dict. |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 81 | |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 82 | * PyDict_MINSIZE. Minimum size of a dict. |
| 83 | Currently set to 4 (to keep instance dicts small). |
| 84 | Must be a power of two. Prior to version 3.3, PyDict_MINSIZE was |
| 85 | set to 8. |
| 86 | |
| 87 | * USABLE_FRACTION. Maximum dictionary load in PyDict_SetItem. |
| 88 | Currently set to 2/3. Increasing this ratio makes dictionaries more |
| 89 | dense resulting in more collisions. Decreasing it improves sparseness |
| 90 | at the expense of spreading entries over more cache lines and at the |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 91 | cost of total memory consumed. |
| 92 | |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 93 | * Growth rate upon hitting maximum load. Currently set to *2. |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 94 | Raising this to *4 results in half the number of resizes, less |
| 95 | effort to resize, better sparseness for some (but not all dict sizes), |
| 96 | and potentially doubles memory consumption depending on the size of |
| 97 | the dictionary. Setting to *4 eliminates every other resize step. |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 98 | |
Thomas Wouters | cf297e4 | 2007-02-23 15:07:44 +0000 | [diff] [blame] | 99 | * Maximum sparseness (minimum dictionary load). What percentage |
| 100 | of entries can be unused before the dictionary shrinks to |
| 101 | free up memory and speed up iteration? (The current CPython |
| 102 | code does not represent this parameter directly.) |
| 103 | |
| 104 | * Shrinkage rate upon exceeding maximum sparseness. The current |
| 105 | CPython code never even checks sparseness when deleting a |
| 106 | key. When a new key is added, it resizes based on the number |
| 107 | of active keys, so that the addition may trigger shrinkage |
| 108 | rather than growth. |
| 109 | |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 110 | Tune-ups should be measured across a broad range of applications and |
| 111 | use cases. A change to any parameter will help in some situations and |
| 112 | hurt in others. The key is to find settings that help the most common |
| 113 | cases and do the least damage to the less common cases. Results will |
| 114 | vary dramatically depending on the exact number of keys, whether the |
| 115 | keys are all strings, whether reads or writes dominate, the exact |
| 116 | hash values of the keys (some sets of values have fewer collisions than |
| 117 | others). Any one test or benchmark is likely to prove misleading. |
| 118 | |
Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 119 | While making a dictionary more sparse reduces collisions, it impairs |
| 120 | iteration and key listing. Those methods loop over every potential |
| 121 | entry. Doubling the size of dictionary results in twice as many |
| 122 | non-overlapping memory accesses for keys(), items(), values(), |
| 123 | __iter__(), iterkeys(), iteritems(), itervalues(), and update(). |
Raymond Hettinger | 9d5c443 | 2004-03-15 15:52:22 +0000 | [diff] [blame] | 124 | Also, every dictionary iterates at least twice, once for the memset() |
| 125 | when it is created and once by dealloc(). |
Raymond Hettinger | 258dfeb | 2003-05-04 21:25:19 +0000 | [diff] [blame] | 126 | |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 127 | Dictionary operations involving only a single key can be O(1) unless |
| 128 | resizing is possible. By checking for a resize only when the |
Thomas Wouters | cf297e4 | 2007-02-23 15:07:44 +0000 | [diff] [blame] | 129 | dictionary can grow (and may *require* resizing), other operations |
| 130 | remain O(1), and the odds of resize thrashing or memory fragmentation |
| 131 | are reduced. In particular, an algorithm that empties a dictionary |
| 132 | by repeatedly invoking .pop will see no resizing, which might |
| 133 | not be necessary at all because the dictionary is eventually |
| 134 | discarded entirely. |
| 135 | |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 136 | The key differences between this implementation and earlier versions are: |
| 137 | 1. The table can be split into two parts, the keys and the values. |
| 138 | |
| 139 | 2. There is an additional key-value combination: (key, NULL). |
| 140 | Unlike (<dummy>, NULL) which represents a deleted value, (key, NULL) |
| 141 | represented a yet to be inserted value. This combination can only occur |
| 142 | when the table is split. |
| 143 | |
| 144 | 3. No small table embedded in the dict, |
| 145 | as this would make sharing of key-tables impossible. |
| 146 | |
| 147 | |
| 148 | These changes have the following consequences. |
| 149 | 1. General dictionaries are slightly larger. |
| 150 | |
| 151 | 2. All object dictionaries of a single class can share a single key-table, |
| 152 | saving about 60% memory for such cases. |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 153 | |
| 154 | Results of Cache Locality Experiments |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 155 | -------------------------------------- |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 156 | |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 157 | Experiments on an earlier design of dictionary, in which all tables were |
| 158 | combined, showed the following: |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 159 | |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 160 | When an entry is retrieved from memory, several adjacent entries are also |
| 161 | retrieved into a cache line. Since accessing items in cache is *much* |
| 162 | cheaper than a cache miss, an enticing idea is to probe the adjacent |
| 163 | entries as a first step in collision resolution. Unfortunately, the |
| 164 | introduction of any regularity into collision searches results in more |
| 165 | collisions than the current random chaining approach. |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 166 | |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 167 | Exploiting cache locality at the expense of additional collisions fails |
| 168 | to payoff when the entries are already loaded in cache (the expense |
| 169 | is paid with no compensating benefit). This occurs in small dictionaries |
| 170 | where the whole dictionary fits into a pair of cache lines. It also |
| 171 | occurs frequently in large dictionaries which have a common access pattern |
| 172 | where some keys are accessed much more frequently than others. The |
| 173 | more popular entries *and* their collision chains tend to remain in cache. |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 174 | |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 175 | To exploit cache locality, change the collision resolution section |
| 176 | in lookdict() and lookdict_string(). Set i^=1 at the top of the |
| 177 | loop and move the i = (i << 2) + i + perturb + 1 to an unrolled |
| 178 | version of the loop. |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 179 | |
Benjamin Peterson | 7d95e40 | 2012-04-23 11:24:50 -0400 | [diff] [blame^] | 180 | For split tables, the above will apply to the keys, but the value will |
| 181 | always be in a different cache line from the key. |
Raymond Hettinger | 5466296 | 2003-05-02 20:11:29 +0000 | [diff] [blame] | 182 | |
| 183 | |