Blame - Objects/dictnotes.txt - platform/external/python/cpython3

blob: f89720c9f604e021549b3b3895f5d52cb426dd88 [file] [log] [blame]

Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	1	NOTES ON DICTIONARIES
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	2	================================
				3
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	4	Principal Use Cases for Dictionaries
				5	------------------------------------
				6
				7	Passing keyword arguments
				8	Typically, one read and one write for 1 to 3 elements.
				9	Occurs frequently in normal python code.
				10
				11	Class method lookup
				12	Dictionaries vary in size with 8 to 16 elements being common.
				13	Usually written once with many lookups.
				14	When base classes are used, there are many failed lookups
				15	followed by a lookup in a base class.
				16
				17	Instance attribute lookup and Global variables
				18	Dictionaries vary in size. 4 to 10 elements are common.
				19	Both reads and writes are common.
				20
				21	Builtins
				22	Frequent reads. Almost never written.
Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	23	About 150 interned strings (as of Py3.3).
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	24	A few keys are accessed much more frequently than others.
				25
				26	Uniquification
				27	Dictionaries of any size. Bulk of work is in creation.
				28	Repeated writes to a smaller set of keys.
				29	Single read of each key.
Raymond Hettinger	e509b2a	2003-05-28 14:10:46 +0000	[diff] [blame]	30	Some use cases have two consecutive accesses to the same key.
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	31
				32	* Removing duplicates from a sequence.
				33	dict.fromkeys(seqn).keys()
Raymond Hettinger	e509b2a	2003-05-28 14:10:46 +0000	[diff] [blame]	34
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	35	* Counting elements in a sequence.
Raymond Hettinger	e509b2a	2003-05-28 14:10:46 +0000	[diff] [blame]	36	for e in seqn:
				37	d[e] = d.get(e,0) + 1
				38
				39	* Accumulating references in a dictionary of lists:
				40
				41	for pagenumber, page in enumerate(pages):
				42	for word in page:
				43	d.setdefault(word, []).append(pagenumber)
				44
				45	Note, the second example is a use case characterized by a get and set
Thomas Wouters	902d6eb	2007-01-09 23:18:33 +0000	[diff] [blame]	46	to the same key. There are similar use cases with a __contains__
Raymond Hettinger	e509b2a	2003-05-28 14:10:46 +0000	[diff] [blame]	47	followed by a get, set, or del to the same key. Part of the
				48	justification for d.setdefault is combining the two lookups into one.
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	49
				50	Membership Testing
				51	Dictionaries of any size. Created once and then rarely changes.
				52	Single write to each key.
				53	Many calls to __contains__() or has_key().
				54	Similar access patterns occur with replacement dictionaries
				55	such as with the % formatting operator.
				56
Raymond Hettinger	258dfeb	2003-05-04 21:25:19 +0000	[diff] [blame]	57	Dynamic Mappings
Raymond Hettinger	e509b2a	2003-05-28 14:10:46 +0000	[diff] [blame]	58	Characterized by deletions interspersed with adds and replacements.
Raymond Hettinger	258dfeb	2003-05-04 21:25:19 +0000	[diff] [blame]	59	Performance benefits greatly from the re-use of dummy entries.
				60
Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	61	Data Layout
				62	-----------
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	63
Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	64	Dictionaries are composed of 3 components:
				65	The dictobject struct itself
				66	A dict-keys object (keys & hashes)
				67	A values array
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	68
				69
				70	Tunable Dictionary Parameters
				71	-----------------------------
				72
Antoine Pitrou	a504a7a	2012-06-24 21:03:45 +0200	[diff] [blame]	73	See comments for PyDict_MINSIZE_SPLIT, PyDict_MINSIZE_COMBINED,
				74	USABLE_FRACTION and GROWTH_RATE in dictobject.c
Thomas Wouters	cf297e4	2007-02-23 15:07:44 +0000	[diff] [blame]	75
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	76	Tune-ups should be measured across a broad range of applications and
				77	use cases. A change to any parameter will help in some situations and
				78	hurt in others. The key is to find settings that help the most common
				79	cases and do the least damage to the less common cases. Results will
				80	vary dramatically depending on the exact number of keys, whether the
				81	keys are all strings, whether reads or writes dominate, the exact
				82	hash values of the keys (some sets of values have fewer collisions than
				83	others). Any one test or benchmark is likely to prove misleading.
				84
Raymond Hettinger	258dfeb	2003-05-04 21:25:19 +0000	[diff] [blame]	85	While making a dictionary more sparse reduces collisions, it impairs
				86	iteration and key listing. Those methods loop over every potential
				87	entry. Doubling the size of dictionary results in twice as many
				88	non-overlapping memory accesses for keys(), items(), values(),
				89	__iter__(), iterkeys(), iteritems(), itervalues(), and update().
Raymond Hettinger	9d5c443	2004-03-15 15:52:22 +0000	[diff] [blame]	90	Also, every dictionary iterates at least twice, once for the memset()
				91	when it is created and once by dealloc().
Raymond Hettinger	258dfeb	2003-05-04 21:25:19 +0000	[diff] [blame]	92
Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	93	Dictionary operations involving only a single key can be O(1) unless
				94	resizing is possible. By checking for a resize only when the
Thomas Wouters	cf297e4	2007-02-23 15:07:44 +0000	[diff] [blame]	95	dictionary can grow (and may require resizing), other operations
				96	remain O(1), and the odds of resize thrashing or memory fragmentation
				97	are reduced. In particular, an algorithm that empties a dictionary
				98	by repeatedly invoking .pop will see no resizing, which might
				99	not be necessary at all because the dictionary is eventually
				100	discarded entirely.
				101
Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	102	The key differences between this implementation and earlier versions are:
				103	1. The table can be split into two parts, the keys and the values.
				104
				105	2. There is an additional key-value combination: (key, NULL).
				106	Unlike (<dummy>, NULL) which represents a deleted value, (key, NULL)
				107	represented a yet to be inserted value. This combination can only occur
				108	when the table is split.
				109
				110	3. No small table embedded in the dict,
				111	as this would make sharing of key-tables impossible.
				112
				113
				114	These changes have the following consequences.
				115	1. General dictionaries are slightly larger.
				116
				117	2. All object dictionaries of a single class can share a single key-table,
				118	saving about 60% memory for such cases.
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	119
				120	Results of Cache Locality Experiments
Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	121	--------------------------------------
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	122
Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	123	Experiments on an earlier design of dictionary, in which all tables were
				124	combined, showed the following:
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	125
Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	126	When an entry is retrieved from memory, several adjacent entries are also
				127	retrieved into a cache line. Since accessing items in cache is much
				128	cheaper than a cache miss, an enticing idea is to probe the adjacent
				129	entries as a first step in collision resolution. Unfortunately, the
				130	introduction of any regularity into collision searches results in more
				131	collisions than the current random chaining approach.
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	132
Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	133	Exploiting cache locality at the expense of additional collisions fails
				134	to payoff when the entries are already loaded in cache (the expense
				135	is paid with no compensating benefit). This occurs in small dictionaries
				136	where the whole dictionary fits into a pair of cache lines. It also
				137	occurs frequently in large dictionaries which have a common access pattern
				138	where some keys are accessed much more frequently than others. The
				139	more popular entries and their collision chains tend to remain in cache.
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	140
Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	141	To exploit cache locality, change the collision resolution section
				142	in lookdict() and lookdict_string(). Set i^=1 at the top of the
				143	loop and move the i = (i << 2) + i + perturb + 1 to an unrolled
				144	version of the loop.
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	145
Benjamin Peterson	7d95e40	2012-04-23 11:24:50 -0400	[diff] [blame]	146	For split tables, the above will apply to the keys, but the value will
				147	always be in a different cache line from the key.
Raymond Hettinger	5466296	2003-05-02 20:11:29 +0000	[diff] [blame]	148
				149