blob: a38b05208255e26e36587abf2bd21fde56b675ca [file] [log] [blame]
Benjamin Peterson7d95e402012-04-23 11:24:50 -04001NOTES ON DICTIONARIES
Raymond Hettinger54662962003-05-02 20:11:29 +00002================================
3
Raymond Hettinger54662962003-05-02 20:11:29 +00004Principal Use Cases for Dictionaries
5------------------------------------
6
7Passing keyword arguments
8 Typically, one read and one write for 1 to 3 elements.
9 Occurs frequently in normal python code.
10
11Class method lookup
12 Dictionaries vary in size with 8 to 16 elements being common.
13 Usually written once with many lookups.
14 When base classes are used, there are many failed lookups
15 followed by a lookup in a base class.
16
17Instance attribute lookup and Global variables
18 Dictionaries vary in size. 4 to 10 elements are common.
19 Both reads and writes are common.
20
21Builtins
22 Frequent reads. Almost never written.
Benjamin Peterson7d95e402012-04-23 11:24:50 -040023 About 150 interned strings (as of Py3.3).
Raymond Hettinger54662962003-05-02 20:11:29 +000024 A few keys are accessed much more frequently than others.
25
26Uniquification
27 Dictionaries of any size. Bulk of work is in creation.
28 Repeated writes to a smaller set of keys.
29 Single read of each key.
Raymond Hettingere509b2a2003-05-28 14:10:46 +000030 Some use cases have two consecutive accesses to the same key.
Raymond Hettinger54662962003-05-02 20:11:29 +000031
32 * Removing duplicates from a sequence.
33 dict.fromkeys(seqn).keys()
Raymond Hettingere509b2a2003-05-28 14:10:46 +000034
Raymond Hettinger54662962003-05-02 20:11:29 +000035 * Counting elements in a sequence.
Raymond Hettingere509b2a2003-05-28 14:10:46 +000036 for e in seqn:
37 d[e] = d.get(e,0) + 1
38
39 * Accumulating references in a dictionary of lists:
40
41 for pagenumber, page in enumerate(pages):
42 for word in page:
43 d.setdefault(word, []).append(pagenumber)
44
45 Note, the second example is a use case characterized by a get and set
Thomas Wouters902d6eb2007-01-09 23:18:33 +000046 to the same key. There are similar use cases with a __contains__
Raymond Hettingere509b2a2003-05-28 14:10:46 +000047 followed by a get, set, or del to the same key. Part of the
48 justification for d.setdefault is combining the two lookups into one.
Raymond Hettinger54662962003-05-02 20:11:29 +000049
50Membership Testing
51 Dictionaries of any size. Created once and then rarely changes.
52 Single write to each key.
53 Many calls to __contains__() or has_key().
54 Similar access patterns occur with replacement dictionaries
55 such as with the % formatting operator.
56
Raymond Hettinger258dfeb2003-05-04 21:25:19 +000057Dynamic Mappings
Raymond Hettingere509b2a2003-05-28 14:10:46 +000058 Characterized by deletions interspersed with adds and replacements.
Raymond Hettinger258dfeb2003-05-04 21:25:19 +000059 Performance benefits greatly from the re-use of dummy entries.
60
Benjamin Peterson7d95e402012-04-23 11:24:50 -040061Data Layout
62-----------
Raymond Hettinger54662962003-05-02 20:11:29 +000063
Benjamin Peterson7d95e402012-04-23 11:24:50 -040064Dictionaries are composed of 3 components:
65The dictobject struct itself
66A dict-keys object (keys & hashes)
67A values array
Raymond Hettinger54662962003-05-02 20:11:29 +000068
69
70Tunable Dictionary Parameters
71-----------------------------
72
Benjamin Peterson7d95e402012-04-23 11:24:50 -040073* PyDict_STARTSIZE. Starting size of dict (unless an instance dict).
74 Currently set to 8. Must be a power of two.
75 New dicts have to zero-out every cell.
76 Increasing improves the sparseness of small dictionaries but costs
77 time to read in the additional cache lines if they are not already
78 in cache. That case is common when keyword arguments are passed.
79 Prior to version 3.3, PyDict_MINSIZE was used as the starting size
80 of a new dict.
Raymond Hettinger54662962003-05-02 20:11:29 +000081
Benjamin Peterson7d95e402012-04-23 11:24:50 -040082* PyDict_MINSIZE. Minimum size of a dict.
83 Currently set to 4 (to keep instance dicts small).
84 Must be a power of two. Prior to version 3.3, PyDict_MINSIZE was
85 set to 8.
86
87* USABLE_FRACTION. Maximum dictionary load in PyDict_SetItem.
88 Currently set to 2/3. Increasing this ratio makes dictionaries more
89 dense resulting in more collisions. Decreasing it improves sparseness
90 at the expense of spreading entries over more cache lines and at the
Raymond Hettinger54662962003-05-02 20:11:29 +000091 cost of total memory consumed.
92
Raymond Hettinger54662962003-05-02 20:11:29 +000093* Growth rate upon hitting maximum load. Currently set to *2.
Benjamin Peterson7d95e402012-04-23 11:24:50 -040094 Raising this to *4 results in half the number of resizes, less
95 effort to resize, better sparseness for some (but not all dict sizes),
96 and potentially doubles memory consumption depending on the size of
97 the dictionary. Setting to *4 eliminates every other resize step.
Raymond Hettinger54662962003-05-02 20:11:29 +000098
Thomas Wouterscf297e42007-02-23 15:07:44 +000099* Maximum sparseness (minimum dictionary load). What percentage
100 of entries can be unused before the dictionary shrinks to
101 free up memory and speed up iteration? (The current CPython
102 code does not represent this parameter directly.)
103
104* Shrinkage rate upon exceeding maximum sparseness. The current
105 CPython code never even checks sparseness when deleting a
106 key. When a new key is added, it resizes based on the number
107 of active keys, so that the addition may trigger shrinkage
108 rather than growth.
109
Raymond Hettinger54662962003-05-02 20:11:29 +0000110Tune-ups should be measured across a broad range of applications and
111use cases. A change to any parameter will help in some situations and
112hurt in others. The key is to find settings that help the most common
113cases and do the least damage to the less common cases. Results will
114vary dramatically depending on the exact number of keys, whether the
115keys are all strings, whether reads or writes dominate, the exact
116hash values of the keys (some sets of values have fewer collisions than
117others). Any one test or benchmark is likely to prove misleading.
118
Raymond Hettinger258dfeb2003-05-04 21:25:19 +0000119While making a dictionary more sparse reduces collisions, it impairs
120iteration and key listing. Those methods loop over every potential
121entry. Doubling the size of dictionary results in twice as many
122non-overlapping memory accesses for keys(), items(), values(),
123__iter__(), iterkeys(), iteritems(), itervalues(), and update().
Raymond Hettinger9d5c4432004-03-15 15:52:22 +0000124Also, every dictionary iterates at least twice, once for the memset()
125when it is created and once by dealloc().
Raymond Hettinger258dfeb2003-05-04 21:25:19 +0000126
Benjamin Peterson7d95e402012-04-23 11:24:50 -0400127Dictionary operations involving only a single key can be O(1) unless
128resizing is possible. By checking for a resize only when the
Thomas Wouterscf297e42007-02-23 15:07:44 +0000129dictionary can grow (and may *require* resizing), other operations
130remain O(1), and the odds of resize thrashing or memory fragmentation
131are reduced. In particular, an algorithm that empties a dictionary
132by repeatedly invoking .pop will see no resizing, which might
133not be necessary at all because the dictionary is eventually
134discarded entirely.
135
Benjamin Peterson7d95e402012-04-23 11:24:50 -0400136The key differences between this implementation and earlier versions are:
137 1. The table can be split into two parts, the keys and the values.
138
139 2. There is an additional key-value combination: (key, NULL).
140 Unlike (<dummy>, NULL) which represents a deleted value, (key, NULL)
141 represented a yet to be inserted value. This combination can only occur
142 when the table is split.
143
144 3. No small table embedded in the dict,
145 as this would make sharing of key-tables impossible.
146
147
148These changes have the following consequences.
149 1. General dictionaries are slightly larger.
150
151 2. All object dictionaries of a single class can share a single key-table,
152 saving about 60% memory for such cases.
Raymond Hettinger54662962003-05-02 20:11:29 +0000153
154Results of Cache Locality Experiments
Benjamin Peterson7d95e402012-04-23 11:24:50 -0400155--------------------------------------
Raymond Hettinger54662962003-05-02 20:11:29 +0000156
Benjamin Peterson7d95e402012-04-23 11:24:50 -0400157Experiments on an earlier design of dictionary, in which all tables were
158combined, showed the following:
Raymond Hettinger54662962003-05-02 20:11:29 +0000159
Benjamin Peterson7d95e402012-04-23 11:24:50 -0400160 When an entry is retrieved from memory, several adjacent entries are also
161 retrieved into a cache line. Since accessing items in cache is *much*
162 cheaper than a cache miss, an enticing idea is to probe the adjacent
163 entries as a first step in collision resolution. Unfortunately, the
164 introduction of any regularity into collision searches results in more
165 collisions than the current random chaining approach.
Raymond Hettinger54662962003-05-02 20:11:29 +0000166
Benjamin Peterson7d95e402012-04-23 11:24:50 -0400167 Exploiting cache locality at the expense of additional collisions fails
168 to payoff when the entries are already loaded in cache (the expense
169 is paid with no compensating benefit). This occurs in small dictionaries
170 where the whole dictionary fits into a pair of cache lines. It also
171 occurs frequently in large dictionaries which have a common access pattern
172 where some keys are accessed much more frequently than others. The
173 more popular entries *and* their collision chains tend to remain in cache.
Raymond Hettinger54662962003-05-02 20:11:29 +0000174
Benjamin Peterson7d95e402012-04-23 11:24:50 -0400175 To exploit cache locality, change the collision resolution section
176 in lookdict() and lookdict_string(). Set i^=1 at the top of the
177 loop and move the i = (i << 2) + i + perturb + 1 to an unrolled
178 version of the loop.
Raymond Hettinger54662962003-05-02 20:11:29 +0000179
Benjamin Peterson7d95e402012-04-23 11:24:50 -0400180For split tables, the above will apply to the keys, but the value will
181always be in a different cache line from the key.
Raymond Hettinger54662962003-05-02 20:11:29 +0000182
183