blob: 673142c6c7d8c0a50c883d3f0b6a4128dc231cd1 [file] [log] [blame]
Raymond Hettinger54662962003-05-02 20:11:29 +00001NOTES ON OPTIMIZING DICTIONARIES
2================================
3
4
5Principal Use Cases for Dictionaries
6------------------------------------
7
8Passing keyword arguments
9 Typically, one read and one write for 1 to 3 elements.
10 Occurs frequently in normal python code.
11
12Class method lookup
13 Dictionaries vary in size with 8 to 16 elements being common.
14 Usually written once with many lookups.
15 When base classes are used, there are many failed lookups
16 followed by a lookup in a base class.
17
18Instance attribute lookup and Global variables
19 Dictionaries vary in size. 4 to 10 elements are common.
20 Both reads and writes are common.
21
22Builtins
23 Frequent reads. Almost never written.
24 Size 126 interned strings (as of Py2.3b1).
25 A few keys are accessed much more frequently than others.
26
27Uniquification
28 Dictionaries of any size. Bulk of work is in creation.
29 Repeated writes to a smaller set of keys.
30 Single read of each key.
31
32 * Removing duplicates from a sequence.
33 dict.fromkeys(seqn).keys()
34 * Counting elements in a sequence.
35 for e in seqn: d[e]=d.get(e,0) + 1
36 * Accumulating items in a dictionary of lists.
37 for k, v in itemseqn: d.setdefault(k, []).append(v)
38
39Membership Testing
40 Dictionaries of any size. Created once and then rarely changes.
41 Single write to each key.
42 Many calls to __contains__() or has_key().
43 Similar access patterns occur with replacement dictionaries
44 such as with the % formatting operator.
45
Raymond Hettinger258dfeb2003-05-04 21:25:19 +000046Dynamic Mappings
47 Characterized by deletions interspersed with adds and replacments.
48 Performance benefits greatly from the re-use of dummy entries.
49
Raymond Hettinger54662962003-05-02 20:11:29 +000050
51Data Layout (assuming a 32-bit box with 64 bytes per cache line)
52----------------------------------------------------------------
53
54Smalldicts (8 entries) are attached to the dictobject structure
55and the whole group nearly fills two consecutive cache lines.
56
57Larger dicts use the first half of the dictobject structure (one cache
58line) and a separate, continuous block of entries (at 12 bytes each
59for a total of 5.333 entries per cache line).
60
61
62Tunable Dictionary Parameters
63-----------------------------
64
65* PyDict_MINSIZE. Currently set to 8.
66 Must be a power of two. New dicts have to zero-out every cell.
67 Each additional 8 consumes 1.5 cache lines. Increasing improves
68 the sparseness of small dictionaries but costs time to read in
69 the additional cache lines if they are not already in cache.
70 That case is common when keyword arguments are passed.
71
72* Maximum dictionary load in PyDict_SetItem. Currently set to 2/3.
73 Increasing this ratio makes dictionaries more dense resulting
74 in more collisions. Decreasing it improves sparseness at the
75 expense of spreading entries over more cache lines and at the
76 cost of total memory consumed.
77
78 The load test occurs in highly time sensitive code. Efforts
79 to make the test more complex (for example, varying the load
80 for different sizes) have degraded performance.
81
82* Growth rate upon hitting maximum load. Currently set to *2.
83 Raising this to *4 results in half the number of resizes,
84 less effort to resize, better sparseness for some (but not
85 all dict sizes), and potentially double memory consumption
86 depending on the size of the dictionary. Setting to *4
87 eliminates every other resize step.
88
89Tune-ups should be measured across a broad range of applications and
90use cases. A change to any parameter will help in some situations and
91hurt in others. The key is to find settings that help the most common
92cases and do the least damage to the less common cases. Results will
93vary dramatically depending on the exact number of keys, whether the
94keys are all strings, whether reads or writes dominate, the exact
95hash values of the keys (some sets of values have fewer collisions than
96others). Any one test or benchmark is likely to prove misleading.
97
Raymond Hettinger258dfeb2003-05-04 21:25:19 +000098While making a dictionary more sparse reduces collisions, it impairs
99iteration and key listing. Those methods loop over every potential
100entry. Doubling the size of dictionary results in twice as many
101non-overlapping memory accesses for keys(), items(), values(),
102__iter__(), iterkeys(), iteritems(), itervalues(), and update().
103
Raymond Hettinger54662962003-05-02 20:11:29 +0000104
105Results of Cache Locality Experiments
106-------------------------------------
107
108When an entry is retrieved from memory, 4.333 adjacent entries are also
109retrieved into a cache line. Since accessing items in cache is *much*
110cheaper than a cache miss, an enticing idea is to probe the adjacent
111entries as a first step in collision resolution. Unfortunately, the
112introduction of any regularity into collision searches results in more
113collisions than the current random chaining approach.
114
115Exploiting cache locality at the expense of additional collisions fails
116to payoff when the entries are already loaded in cache (the expense
117is paid with no compensating benefit). This occurs in small dictionaries
118where the whole dictionary fits into a pair of cache lines. It also
119occurs frequently in large dictionaries which have a common access pattern
120where some keys are accessed much more frequently than others. The
121more popular entries *and* their collision chains tend to remain in cache.
122
123To exploit cache locality, change the collision resolution section
124in lookdict() and lookdict_string(). Set i^=1 at the top of the
125loop and move the i = (i << 2) + i + perturb + 1 to an unrolled
126version of the loop.
127
128This optimization strategy can be leveraged in several ways:
129
130* If the dictionary is kept sparse (through the tunable parameters),
131then the occurrence of additional collisions is lessened.
132
133* If lookdict() and lookdict_string() are specialized for small dicts
134and for largedicts, then the versions for large_dicts can be given
135an alternate search strategy without increasing collisions in small dicts
136which already have the maximum benefit of cache locality.
137
138* If the use case for a dictionary is known to have a random key
139access pattern (as opposed to a more common pattern with a Zipf's law
140distribution), then there will be more benefit for large dictionaries
141because any given key is no more likely than another to already be
142in cache.
143
144
145Optimizing the Search of Small Dictionaries
146-------------------------------------------
147
148If lookdict() and lookdict_string() are specialized for smaller dictionaries,
149then a custom search approach can be implemented that exploits the small
150search space and cache locality.
151
152* The simplest example is a linear search of contiguous entries. This is
153 simple to implement, guaranteed to terminate rapidly, never searches
154 the same entry twice, and precludes the need to check for dummy entries.
155
156* A more advanced example is a self-organizing search so that the most
157 frequently accessed entries get probed first. The organization
158 adapts if the access pattern changes over time. Treaps are ideally
159 suited for self-organization with the most common entries at the
160 top of the heap and a rapid binary search pattern. Most probes and
161 results are all located at the top of the tree allowing them all to
162 be located in one or two cache lines.
163
164* Also, small dictionaries may be made more dense, perhaps filling all
165 eight cells to take the maximum advantage of two cache lines.
166
167
168Strategy Pattern
169----------------
170
171Consider allowing the user to set the tunable parameters or to select a
172particular search method. Since some dictionary use cases have known
173sizes and access patterns, the user may be able to provide useful hints.
174
1751) For example, if membership testing or lookups dominate runtime and memory
176 is not at a premium, the user may benefit from setting the maximum load
177 ratio at 5% or 10% instead of the usual 66.7%. This will sharply
Raymond Hettinger258dfeb2003-05-04 21:25:19 +0000178 curtail the number of collisions but will increase iteration time.
Raymond Hettinger54662962003-05-02 20:11:29 +0000179
1802) Dictionary creation time can be shortened in cases where the ultimate
181 size of the dictionary is known in advance. The dictionary can be
182 pre-sized so that no resize operations are required during creation.
183 Not only does this save resizes, but the key insertion will go
184 more quickly because the first half of the keys will be inserted into
185 a more sparse environment than before. The preconditions for this
186 strategy arise whenever a dictionary is created from a key or item
187 sequence of known length.
188
1893) If the key space is large and the access pattern is known to be random,
190 then search strategies exploiting cache locality can be fruitful.
191 The preconditions for this strategy arise in simulations and
192 numerical analysis.
193
1944) If the keys are fixed and the access pattern strongly favors some of
195 the keys, then the entries can be stored contiguously and accessed
196 with a linear search or treap. This exploits knowledge of the data,
197 cache locality, and a simplified search routine. It also eliminates
198 the need to test for dummy entries on each probe. The preconditions
199 for this strategy arise in symbol tables and in the builtin dictionary.
Raymond Hettinger4887a122003-05-05 21:31:51 +0000200
201
202Readonly Dictionaries
203---------------------
204Some dictionary use cases pass through a build stage and then move to a
205more heavily exercised lookup stage with no further changes to the
206dictionary.
207
208An idea that emerged on python-dev is to be able to convert a dictionary
209to a read-only state. This can help prevent programming errors and also
210provide knowledge that can be exploited for lookup optimization.
211
212The dictionary can be immediately rebuilt (eliminating dummy entries),
213resized (to an appropriate level of sparseness), and the keys can be
214jostled (to minimize collisions). The lookdict() routine can then
215eliminate the test for dummy entries (saving about 1/4 of the time
216spend in the collision resolution loop).
217
218An additional possibility is to insert links into the empty spaces
219so that dictionary iteration can proceed in len(d) steps instead of
220(mp->mask + 1) steps.