Blame - Objects/listsort.txt - platform/external/python/cpython3

blob: 17d27973f82676afaaa4e07cda7b266eec9b7113 [file] [log] [blame]

Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	1	Intro
				2	-----
				3	This describes an adaptive, stable, natural mergesort, modestly called
				4	timsort (hey, I earned it <wink>). It has supernatural performance on many
				5	kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
				6	as few as N-1), yet as fast as Python's previous highly tuned samplesort
				7	hybrid on random arrays.
				8
				9	In a nutshell, the main routine marches over the array once, left to right,
				10	alternately identifying the next run, then merging it into the previous
				11	runs "intelligently". Everything else is complication for speed, and some
				12	hard-won measure of memory efficiency.
				13
				14
				15	Comparison with Python's Samplesort Hybrid
				16	------------------------------------------
				17	+ timsort can require a temp array containing as many as N//2 pointers,
				18	which means as many as 2*N extra bytes on 32-bit boxes. It can be
				19	expected to require a temp array this large when sorting random data; on
				20	data with significant structure, it may get away without using any extra
				21	heap memory. This appears to be the strongest argument against it, but
				22	compared to the size of an object, 2 temp bytes worst-case (also expected-
				23	case for random data) doesn't scare me much.
				24
				25	It turns out that Perl is moving to a stable mergesort, and the code for
				26	that appears always to require a temp array with room for at least N
				27	pointers. (Note that I wouldn't want to do that even if space weren't an
				28	issue; I believe its efforts at memory frugality also save timsort
				29	significant pointer-copying costs, and allow it to have a smaller working
				30	set.)
				31
				32	+ Across about four hours of generating random arrays, and sorting them
				33	under both methods, samplesort required about 1.5% more comparisons
				34	(the program is at the end of this file).
				35
				36	+ In real life, this may be faster or slower on random arrays than
				37	samplesort was, depending on platform quirks. Since it does fewer
				38	comparisons on average, it can be expected to do better the more
				39	expensive a comparison function is. OTOH, it does more data movement
				40	(pointer copying) than samplesort, and that may negate its small
				41	comparison advantage (depending on platform quirks) unless comparison
				42	is very expensive.
				43
				44	+ On arrays with many kinds of pre-existing order, this blows samplesort out
				45	of the water. It's significantly faster than samplesort even on some
				46	cases samplesort was special-casing the snot out of. I believe that lists
				47	very often do have exploitable partial order in real life, and this is the
				48	strongest argument in favor of timsort (indeed, samplesort's special cases
				49	for extreme partial order are appreciated by real users, and timsort goes
				50	much deeper than those, in particular naturally covering every case where
				51	someone has suggested "and it would be cool if list.sort() had a special
				52	case for this too ... and for that ...").
				53
				54	+ Here are exact comparison counts across all the tests in sortperf.py,
				55	when run with arguments "15 20 1".
				56
Skip Montanaro	acb1424	2005-09-23 17:14:22 +0000	[diff] [blame]	57	Column Key:
				58	*sort: random data
				59	\sort: descending data
				60	/sort: ascending data
				61	3sort: ascending, then 3 random exchanges
				62	+sort: ascending, then 10 random at the end
Tim Peters	01e75a6	2013-08-22 18:32:53 -0500	[diff] [blame]	63	%sort: ascending, then randomly replace 1% of elements w/ random values
Skip Montanaro	acb1424	2005-09-23 17:14:22 +0000	[diff] [blame]	64	~sort: many duplicates
				65	=sort: all equal
				66	!sort: worst case scenario
				67
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	68	First the trivial cases, trivial for samplesort because it special-cased
				69	them, and trivial for timsort because it naturally works on runs. Within
				70	an "n" block, the first line gives the # of compares done by samplesort,
				71	the second line by timsort, and the third line is the percentage by
				72	which the samplesort count exceeds the timsort count:
				73
				74	n \sort /sort =sort
				75	------- ------ ------ ------
				76	32768 32768 32767 32767 samplesort
				77	32767 32767 32767 timsort
				78	0.00% 0.00% 0.00% (samplesort - timsort) / timsort
				79
				80	65536 65536 65535 65535
				81	65535 65535 65535
				82	0.00% 0.00% 0.00%
				83
				84	131072 131072 131071 131071
				85	131071 131071 131071
				86	0.00% 0.00% 0.00%
				87
				88	262144 262144 262143 262143
				89	262143 262143 262143
				90	0.00% 0.00% 0.00%
				91
				92	524288 524288 524287 524287
				93	524287 524287 524287
				94	0.00% 0.00% 0.00%
				95
				96	1048576 1048576 1048575 1048575
				97	1048575 1048575 1048575
				98	0.00% 0.00% 0.00%
				99
				100	The algorithms are effectively identical in these cases, except that
				101	timsort does one less compare in \sort.
				102
Tim Peters	ec8147b	2013-08-24 15:15:19 -0500	[diff] [blame]	103	Now for the more interesting cases. Where lg(x) is the logarithm of x to
				104	the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for
				105	the best any comparison-based sorting algorithm can do on average (across
				106	all permutations). When a method gets significantly below that, it's
				107	either astronomically lucky, or is finding exploitable structure in the
				108	data.
				109
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	110
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	111	n lg(n!) *sort 3sort +sort %sort ~sort !sort
				112	------- ------- ------ ------- ------- ------ ------- --------
				113	32768 444255 453096 453614 32908 452871 130491 469141 old
				114	448885 33016 33007 50426 182083 65534 new
				115	0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	116
Tim Peters	b80595f	2002-08-10 03:04:33 +0000	[diff] [blame]	117	65536 954037 972699 981940 65686 973104 260029 1004607
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	118	962991 65821 65808 101667 364341 131070
				119	1.01% 1391.83% -0.19% 857.15% -28.63% 666.47%
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	120
Tim Peters	b80595f	2002-08-10 03:04:33 +0000	[diff] [blame]	121	131072 2039137 2101881 2091491 131232 2092894 554790 2161379
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	122	2057533 131410 131361 206193 728871 262142
				123	2.16% 1491.58% -0.10% 915.02% -23.88% 724.51%
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	124
Tim Peters	b80595f	2002-08-10 03:04:33 +0000	[diff] [blame]	125	262144 4340409 4464460 4403233 262314 4445884 1107842 4584560
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	126	4377402 262437 262459 416347 1457945 524286
				127	1.99% 1577.82% -0.06% 967.83% -24.01% 774.44%
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	128
Tim Peters	3ddb856	2002-08-10 07:04:01 +0000	[diff] [blame]	129	524288 9205096 9453356 9408463 524468 9441930 2218577 9692015
				130	9278734 524580 524633 837947 2916107 1048574
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	131	1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30%
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	132
Tim Peters	b80595f	2002-08-10 03:04:33 +0000	[diff] [blame]	133	1048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	134	19606028 1048958 1048941 1694896 5832445 2097150
				135	1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38%
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	136
				137	Discussion of cases:
				138
				139	*sort: There's no structure in random data to exploit, so the theoretical
				140	limit is lg(n!). Both methods get close to that, and timsort is hugging
				141	it (indeed, in a marginal sense, it's a spectacular improvement --
				142	there's only about 1% left before hitting the wall, and timsort knows
				143	darned well it's doing compares that won't pay on random data -- but so
				144	does the samplesort hybrid). For contrast, Hoare's original random-pivot
				145	quicksort does about 39% more compares than the limit, and the median-of-3
				146	variant about 19% more.
				147
Tim Peters	b80595f	2002-08-10 03:04:33 +0000	[diff] [blame]	148	3sort, %sort, and !sort: No contest; there's structure in this data, but
				149	not of the specific kinds samplesort special-cases. Note that structure
				150	in !sort wasn't put there on purpose -- it was crafted as a worst case for
				151	a previous quicksort implementation. That timsort nails it came as a
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	152	surprise to me (although it's obvious in retrospect).
				153
				154	+sort: samplesort special-cases this data, and does a few less compares
				155	than timsort. However, timsort runs this case significantly faster on all
				156	boxes we have timings for, because timsort is in the business of merging
				157	runs efficiently, while samplesort does much more data movement in this
				158	(for it) special case.
				159
				160	~sort: samplesort's special cases for large masses of equal elements are
				161	extremely effective on ~sort's specific data pattern, and timsort just
				162	isn't going to get close to that, despite that it's clearly getting a
				163	great deal of benefit out of the duplicates (the # of compares is much less
				164	than lg(n!)). ~sort has a perfectly uniform distribution of just 4
				165	distinct values, and as the distribution gets more skewed, samplesort's
				166	equal-element gimmicks become less effective, while timsort's adaptive
				167	strategies find more to exploit; in a database supplied by Kevin Altis, a
				168	sort on its highly skewed "on which stock exchange does this company's
				169	stock trade?" field ran over twice as fast under timsort.
				170
				171	However, despite that timsort does many more comparisons on ~sort, and
				172	that on several platforms ~sort runs highly significantly slower under
				173	timsort, on other platforms ~sort runs highly significantly faster under
				174	timsort. No other kind of data has shown this wild x-platform behavior,
				175	and we don't have an explanation for it. The only thing I can think of
				176	that could transform what "should be" highly significant slowdowns into
				177	highly significant speedups on some boxes are catastrophic cache effects
				178	in samplesort.
				179
				180	But timsort "should be" slower than samplesort on ~sort, so it's hard
				181	to count that it isn't on some boxes as a strike against it <wink>.
				182
Tim Peters	6c511e6	2002-08-08 01:55:16 +0000	[diff] [blame]	183	+ Here's the highwater mark for the number of heap-based temp slots (4
				184	bytes each on this box) needed by each test, again with arguments
				185	"15 20 1":
				186
Tim Peters	6c511e6	2002-08-08 01:55:16 +0000	[diff] [blame]	187	2*i sort \sort /sort 3sort +sort %sort ~sort =sort !sort
				188	32768 16384 0 0 6256 0 10821 12288 0 16383
				189	65536 32766 0 0 21652 0 31276 24576 0 32767
				190	131072 65534 0 0 17258 0 58112 49152 0 65535
				191	262144 131072 0 0 35660 0 123561 98304 0 131071
				192	524288 262142 0 0 31302 0 212057 196608 0 262143
				193	1048576 524286 0 0 312438 0 484942 393216 0 524287
				194
				195	Discussion: The tests that end up doing (close to) perfectly balanced
				196	merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort
				197	also ends up doing balanced merges, but systematically benefits a lot from
				198	the preliminary pre-merge searches described under "Merge Memory" later.
				199	%sort approaches having a balanced merge at the end because the random
				200	selection of elements to replace is expected to produce an out-of-order
				201	element near the midpoint. \sort, /sort, =sort are the trivial one-run
				202	cases, needing no merging at all. +sort ends up having one very long run
				203	and one very short, and so gets all the temp space it needs from the small
				204	temparray member of the MergeState struct (note that the same would be
				205	true if the new random elements were prefixed to the sorted list instead,
				206	but not if they appeared "in the middle"). 3sort approaches N//3 temp
				207	slots twice, but the run lengths that remain after 3 random exchanges
				208	clearly has very high variance.
				209
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	210
				211	A detailed description of timsort follows.
				212
				213	Runs
				214	----
				215	count_run() returns the # of elements in the next run. A run is either
				216	"ascending", which means non-decreasing:
				217
				218	a0 <= a1 <= a2 <= ...
				219
				220	or "descending", which means strictly decreasing:
				221
				222	a0 > a1 > a2 > ...
				223
				224	Note that a run is always at least 2 long, unless we start at the array's
				225	last element.
				226
				227	The definition of descending is strict, because the main routine reverses
				228	a descending run in-place, transforming a descending run into an ascending
				229	run. Reversal is done via the obvious fast "swap elements starting at each
				230	end, and converge at the middle" method, and that can violate stability if
				231	the slice contains any equal elements. Using a strict definition of
				232	descending ensures that a descending run contains distinct elements.
				233
				234	If an array is random, it's very unlikely we'll see long runs. If a natural
				235	run contains less than minrun elements (see next section), the main loop
				236	artificially boosts it to minrun elements, via a stable binary insertion sort
				237	applied to the right number of array elements following the short natural
				238	run. In a random array, all runs are likely to be minrun long as a
				239	result. This has two primary good effects:
				240
				241	1. Random data strongly tends then toward perfectly balanced (both runs have
				242	the same length) merges, which is the most efficient way to proceed when
				243	data is random.
				244
				245	2. Because runs are never very short, the rest of the code doesn't make
				246	heroic efforts to shave a few cycles off per-merge overheads. For
				247	example, reasonable use of function calls is made, rather than trying to
				248	inline everything. Since there are no more than N/minrun runs to begin
				249	with, a few "extra" function calls per merge is barely measurable.
				250
				251
				252	Computing minrun
				253	----------------
				254	If N < 64, minrun is N. IOW, binary insertion sort is used for the whole
				255	array then; it's hard to beat that given the overheads of trying something
Tim Peters	ec8147b	2013-08-24 15:15:19 -0500	[diff] [blame]	256	fancier (see note BINSORT).
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	257
				258	When N is a power of 2, testing on random data showed that minrun values of
				259	16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
				260	in binary insertion sort clearly hurt, and at 8 the increase in the number
				261	of function calls clearly hurt. Picking some power of 2 is important
				262	here, so that the merges end up perfectly balanced (see next section). We
				263	pick 32 as a good value in the sweet range; picking a value at the low end
				264	allows the adaptive gimmicks more opportunity to exploit shorter natural
				265	runs.
				266
				267	Because sortperf.py only tries powers of 2, it took a long time to notice
				268	that 32 isn't a good choice for the general case! Consider N=2112:
				269
				270	>>> divmod(2112, 32)
				271	(66, 0)
				272	>>>
				273
				274	If the data is randomly ordered, we're very likely to end up with 66 runs
				275	each of length 32. The first 64 of these trigger a sequence of perfectly
				276	balanced merges (see next section), leaving runs of lengths 2048 and 64 to
				277	merge at the end. The adaptive gimmicks can do that with fewer than 2048+64
				278	compares, but it's still more compares than necessary, and-- mergesort's
				279	bugaboo relative to samplesort --a lot more data movement (O(N) copies just
				280	to get 64 elements into place).
				281
				282	If we take minrun=33 in this case, then we're very likely to end up with 64
				283	runs each of length 33, and then all merges are perfectly balanced. Better!
				284
				285	What we want to avoid is picking minrun such that in
				286
				287	q, r = divmod(N, minrun)
				288
				289	q is a power of 2 and r>0 (then the last merge only gets r elements into
Tim Peters	671764b	2002-08-09 05:06:44 +0000	[diff] [blame]	290	place, and r < minrun is small compared to N), or q a little larger than a
				291	power of 2 regardless of r (then we've got a case similar to "2112", again
				292	leaving too little work for the last merge to do).
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	293
				294	Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a
				295	power of 2, or if that isn't possible, is close to, but strictly less than,
				296	a power of 2. This is easier to do than it may sound: take the first 6
				297	bits of N, and add 1 if any of the remaining bits are set. In fact, that
				298	rule covers every case in this section, including small N and exact powers
				299	of 2; merge_compute_minrun() is a deceptively simple function.
				300
				301
				302	The Merge Pattern
				303	-----------------
				304	In order to exploit regularities in the data, we're merging on natural
				305	run lengths, and they can become wildly unbalanced. That's a Good Thing
				306	for this sort! It means we have to find a way to manage an assortment of
				307	potentially very different run lengths, though.
				308
				309	Stability constrains permissible merging patterns. For example, if we have
				310	3 consecutive runs of lengths
				311
				312	A:10000 B:20000 C:10000
				313
				314	we dare not merge A with C first, because if A, B and C happen to contain
Mark Dickinson	934896d	2009-02-21 20:59:32 +0000	[diff] [blame]	315	a common element, it would get out of order wrt its occurrence(s) in B. The
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	316	merging must be done as (A+B)+C or A+(B+C) instead.
				317
				318	So merging is always done on two consecutive runs at a time, and in-place,
				319	although this may require some temp memory (more on that later).
				320
				321	When a run is identified, its base address and length are pushed on a stack
				322	in the MergeState struct. merge_collapse() is then called to see whether it
				323	should merge it with preceding run(s). We would like to delay merging as
				324	long as possible in order to exploit patterns that may come up later, but we
				325	like even more to do merging as soon as possible to exploit that the run just
				326	found is still high in the memory hierarchy. We also can't delay merging
				327	"too long" because it consumes memory to remember the runs that are still
				328	unmerged, and the stack has a fixed size.
				329
				330	What turned out to be a good compromise maintains two invariants on the
				331	stack entries, where A, B and C are the lengths of the three righmost not-yet
				332	merged slices:
				333
				334	1. A > B+C
				335	2. B > C
				336
				337	Note that, by induction, #2 implies the lengths of pending runs form a
				338	decreasing sequence. #1 implies that, reading the lengths right to left,
				339	the pending-run lengths grow at least as fast as the Fibonacci numbers.
				340	Therefore the stack can never grow larger than about log_base_phi(N) entries,
				341	where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice
				342	for very large arrays.
				343
				344	If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
				345	freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
				346	e.g., if the last 3 entries are
				347
				348	A:30 B:20 C:10
				349
				350	then B is merged with C, leaving
				351
				352	A:30 BC:30
				353
				354	on the stack. Or if they were
				355
				356	A:500 B:400: C:1000
				357
				358	then A is merged with B, leaving
				359
				360	AB:900 C:1000
				361
				362	on the stack.
				363
				364	In both examples, the stack configuration after the merge still violates
				365	invariant #2, and merge_collapse() goes on to continue merging runs until
				366	both invariants are satisfied. As an extreme case, suppose we didn't do the
				367	minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
				368	and 2. Nothing would get merged until the final 2 was seen, and that would
				369	trigger 7 perfectly balanced merges.
				370
				371	The thrust of these rules when they trigger merging is to balance the run
				372	lengths as closely as possible, while keeping a low bound on the number of
				373	runs we have to remember. This is maximally effective for random data,
				374	where all runs are likely to be of (artificially forced) length minrun, and
				375	then we get a sequence of perfectly balanced merges (with, perhaps, some
				376	oddballs at the end).
				377
				378	OTOH, one reason this sort is so good for partly ordered data has to do
				379	with wildly unbalanced run lengths.
				380
				381
				382	Merge Memory
				383	------------
Tim Peters	ec8147b	2013-08-24 15:15:19 -0500	[diff] [blame]	384	Merging adjacent runs of lengths A and B in-place, and in linear time, is
				385	difficult. Theoretical constructions are known that can do it, but they're
				386	too difficult and slow for practical use. But if we have temp memory equal
				387	to min(A, B), it's easy.
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	388
				389	If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
				390	and then we can do the obvious merge algorithm left to right, from the temp
				391	area and B, starting the stores into where A used to live. There's always a
				392	free area in the original area comprising a number of elements equal to the
				393	number not yet merged from the temp array (trivially true at the start;
				394	proceed by induction). The only tricky bit is that if a comparison raises an
				395	exception, we have to remember to copy the remaining elements back in from
				396	the temp area, lest the array end up with duplicate entries from B. But
				397	that's exactly the same thing we need to do if we reach the end of B first,
				398	so the exit code is pleasantly common to both the normal and error cases.
				399
				400	If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
				401	much the same, except that we need to merge right to left, copying B into a
				402	temp array and starting the stores at the right end of where B used to live.
				403
				404	A refinement: When we're about to merge adjacent runs A and B, we first do
				405	a form of binary search (more on that later) to see where B[0] should end up
				406	in A. Elements in A preceding that point are already in their final
				407	positions, effectively shrinking the size of A. Likewise we also search to
				408	see where A[-1] should end up in B, and elements of B after that point can
				409	also be ignored. This cuts the amount of temp memory needed by the same
				410	amount.
				411
				412	These preliminary searches may not pay off, and can be expected not to
				413	repay their cost if the data is random. But they can win huge in all of
				414	time, copying, and memory savings when they do pay, so this is one of the
				415	"per-merge overheads" mentioned above that we're happy to endure because
				416	there is at most one very short run. It's generally true in this algorithm
				417	that we're willing to gamble a little to win a lot, even though the net
				418	expectation is negative for random data.
				419
				420
				421	Merge Algorithms
				422	----------------
				423	merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo
				424	deals with runs where A <= B, and merge_hi where A > B. They don't know
				425	whether the data is clustered or uniform, but a lovely thing about merging
				426	is that many kinds of clustering "reveal themselves" by how many times in a
				427	row the winning merge element comes from the same run. We'll only discuss
				428	merge_lo here; merge_hi is exactly analogous.
				429
				430	Merging begins in the usual, obvious way, comparing the first element of A
				431	to the first of B, and moving B[0] to the merge area if it's less than A[0],
				432	else moving A[0] to the merge area. Call that the "one pair at a time"
				433	mode. The only twist here is keeping track of how many times in a row "the
				434	winner" comes from the same run.
				435
				436	If that count reaches MIN_GALLOP, we switch to "galloping mode". Here
				437	we search B for where A[0] belongs, and move over all the B's before
				438	that point in one chunk to the merge area, then move A[0] to the merge
				439	area. Then we search A for where B[0] belongs, and similarly move a
				440	slice of A in one chunk. Then back to searching B for where A[0] belongs,
				441	etc. We stay in galloping mode until both searches find slices to copy
				442	less than MIN_GALLOP elements long, at which point we go back to one-pair-
				443	at-a-time mode.
				444
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	445	A refinement: The MergeState struct contains the value of min_gallop that
				446	controls when we enter galloping mode, initialized to MIN_GALLOP.
Tim Peters	3ddb856	2002-08-10 07:04:01 +0000	[diff] [blame]	447	merge_lo() and merge_hi() adjust this higher when galloping isn't paying
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	448	off, and lower when it is.
				449
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	450
				451	Galloping
				452	---------
				453	Still without loss of generality, assume A is the shorter run. In galloping
				454	mode, we first look for A[0] in B. We do this via "galloping", comparing
				455	A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
				456	the k such that B[2(k-1) - 1] < A[0] <= B[2k - 1]. This takes at most
				457	roughly lg(B) comparisons, and, unlike a straight binary search, favors
				458	finding the right spot early in B (more on that later).
				459
				460	After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
				461	consecutive elements, and a straight binary search requires exactly k-1
Tim Peters	ec8147b	2013-08-24 15:15:19 -0500	[diff] [blame]	462	additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we
				463	copy all the B's up to that point in one chunk, and then copy A[0]. Note
				464	that no matter where A[0] belongs in B, the combination of galloping + binary
				465	search finds it in no more than about 2*lg(B) comparisons.
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	466
				467	If we did a straight binary search, we could find it in no more than
				468	ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
				469	comparisons no matter where A[0] belongs. Straight binary search thus loses
				470	to galloping unless the run is quite long, and we simply can't guess
				471	whether it is in advance.
				472
				473	If data is random and runs have the same length, A[0] belongs at B[0] half
				474	the time, at B[1] a quarter of the time, and so on: a consecutive winning
				475	sub-run in B of length k occurs with probability 1/2**(k+1). So long
				476	winning sub-runs are extremely unlikely in random data, and guessing that a
				477	winning sub-run is going to be long is a dangerous game.
				478
				479	OTOH, if data is lopsided or lumpy or contains many duplicates, long
				480	stretches of winning sub-runs are very likely, and cutting the number of
				481	comparisons needed to find one from O(B) to O(log B) is a huge win.
				482
				483	Galloping compromises by getting out fast if there isn't a long winning
				484	sub-run, yet finding such very efficiently when they exist.
				485
				486	I first learned about the galloping strategy in a related context; see:
				487
				488	"Adaptive Set Intersections, Unions, and Differences" (2000)
Serhiy Storchaka	42bb126	2015-12-18 13:23:33 +0200	[diff] [blame]	489	Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	490
				491	and its followup(s). An earlier paper called the same strategy
				492	"exponential search":
				493
				494	"Optimistic Sorting and Information Theoretic Complexity"
				495	Peter McIlroy
				496	SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
				497	467-474, Austin, Texas, 25-27 January 1993.
				498
				499	and it probably dates back to an earlier paper by Bentley and Yao. The
Thomas Wouters	0e3f591	2006-08-11 14:57:12 +0000	[diff] [blame]	500	McIlroy paper in particular has good analysis of a mergesort that's
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	501	probably strongly related to this one in its galloping strategy.
				502
				503
				504	Galloping with a Broken Leg
				505	---------------------------
				506	So why don't we always gallop? Because it can lose, on two counts:
				507
Tim Peters	6c511e6	2002-08-08 01:55:16 +0000	[diff] [blame]	508	1. While we're willing to endure small per-merge overheads, per-comparison
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	509	overheads are a different story. Calling Yet Another Function per
				510	comparison is expensive, and gallop_left() and gallop_right() are
				511	too long-winded for sane inlining.
				512
Tim Peters	6c511e6	2002-08-08 01:55:16 +0000	[diff] [blame]	513	2. Galloping can-- alas --require more comparisons than linear one-at-time
				514	search, depending on the data.
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	515
				516	#2 requires details. If A[0] belongs before B[0], galloping requires 1
				517	compare to determine that, same as linear search, except it costs more
				518	to call the gallop function. If A[0] belongs right before B[1], galloping
				519	requires 2 compares, again same as linear search. On the third compare,
				520	galloping checks A[0] against B[3], and if it's <=, requires one more
				521	compare to determine whether A[0] belongs at B[2] or B[3]. That's a total
				522	of 4 compares, but if A[0] does belong at B[2], linear search would have
				523	discovered that in only 3 compares, and that's a huge loss! Really. It's
				524	an increase of 33% in the number of compares needed, and comparisons are
				525	expensive in Python.
				526
				527	index in B where # compares linear # gallop # binary gallop
				528	A[0] belongs search needs compares compares total
				529	---------------- ----------------- -------- -------- ------
				530	0 1 1 0 1
				531
				532	1 2 2 0 2
				533
				534	2 3 3 1 4
				535	3 4 3 1 4
				536
				537	4 5 4 2 6
				538	5 6 4 2 6
				539	6 7 4 2 6
				540	7 8 4 2 6
				541
				542	8 9 5 3 8
				543	9 10 5 3 8
				544	10 11 5 3 8
				545	11 12 5 3 8
				546	...
				547
				548	In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
				549	to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
				550	The advantage of galloping is unbounded as i grows, but it doesn't win at
				551	all until i=6. Before then, it loses twice (at i=2 and i=4), and ties
				552	at the other values. At and after i=6, galloping always wins.
				553
				554	We can't guess in advance when it's going to win, though, so we do one pair
				555	at a time until the evidence seems strong that galloping may pay. MIN_GALLOP
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	556	is 7, and that's pretty strong evidence. However, if the data is random, it
				557	simply will trigger galloping mode purely by luck every now and again, and
				558	it's quite likely to hit one of the losing cases next. On the other hand,
				559	in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
				560	"should be" then. So the MergeState struct keeps a min_gallop variable
				561	that merge_lo and merge_hi adjust: the longer we stay in galloping mode,
				562	the smaller min_gallop gets, making it easier to transition back to
				563	galloping mode (if we ever leave it in the current merge, and at the
				564	start of the next merge). But whenever the gallop loop doesn't pay,
Tim Peters	3ddb856	2002-08-10 07:04:01 +0000	[diff] [blame]	565	min_gallop is increased by one, making it harder to transition back
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	566	to galloping mode (and again both within a merge and across merges). For
				567	random data, this all but eliminates the gallop penalty: min_gallop grows
				568	large enough that we almost never get into galloping mode. And for cases
				569	like ~sort, min_gallop can fall to as low as 1. This seems to work well,
				570	but in all it's a minor improvement over using a fixed MIN_GALLOP value.
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	571
				572
				573	Galloping Complication
				574	----------------------
				575	The description above was for merge_lo. merge_hi has to merge "from the
				576	other end", and really needs to gallop starting at the last element in a run
				577	instead of the first. Galloping from the first still works, but does more
Tim Peters	ec8147b	2013-08-24 15:15:19 -0500	[diff] [blame]	578	comparisons than it should (this is significant -- I timed it both ways). For
				579	this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT)
				580	functions have a "hint" argument, which is the index at which galloping
				581	should begin. So galloping can actually start at any index, and proceed at
				582	offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	583
				584	In the code as I type it's always called with either 0 or n-1 (where n is
				585	the # of elements in a run). It's tempting to try to do something fancier,
				586	melding galloping with some form of interpolation search; for example, if
				587	we're merging a run of length 1 with a run of length 10000, index 5000 is
				588	probably a better guess at the final result than either 0 or 9999. But
				589	it's unclear how to generalize that intuition usefully, and merging of
				590	wildly unbalanced runs already enjoys excellent performance.
				591
Tim Peters	3ddb856	2002-08-10 07:04:01 +0000	[diff] [blame]	592	~sort is a good example of when balanced runs could benefit from a better
				593	hint value: to the extent possible, this would like to use a starting
				594	offset equal to the previous value of acount/bcount. Doing so saves about
				595	10% of the compares in ~sort. However, doing so is also a mixed bag,
				596	hurting other cases.
				597
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	598
				599	Comparing Average # of Compares on Random Arrays
				600	------------------------------------------------
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	601	[NOTE: This was done when the new algorithm used about 0.1% more compares
				602	on random data than does its current incarnation.]
				603
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	604	Here list.sort() is samplesort, and list.msort() this sort:
				605
				606	"""
				607	import random
				608	from time import clock as now
				609
				610	def fill(n):
				611	from random import random
Georg Brandl	c9a5a0e	2009-09-01 07:34:27 +0000	[diff] [blame]	612	return [random() for i in range(n)]
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	613
				614	def mycmp(x, y):
				615	global ncmp
				616	ncmp += 1
				617	return cmp(x, y)
				618
				619	def timeit(values, method):
				620	global ncmp
				621	X = values[:]
				622	bound = getattr(X, method)
				623	ncmp = 0
				624	t1 = now()
				625	bound(mycmp)
				626	t2 = now()
				627	return t2-t1, ncmp
				628
				629	format = "%5s %9.2f %11d"
				630	f2 = "%5s %9.2f %11.2f"
				631
				632	def drive():
				633	count = sst = sscmp = mst = mscmp = nelts = 0
				634	while True:
				635	n = random.randrange(100000)
				636	nelts += n
				637	x = fill(n)
				638
				639	t, c = timeit(x, 'sort')
				640	sst += t
				641	sscmp += c
				642
				643	t, c = timeit(x, 'msort')
				644	mst += t
				645	mscmp += c
				646
				647	count += 1
				648	if count % 10:
				649	continue
				650
				651	print "count", count, "nelts", nelts
				652	print format % ("sort", sst, sscmp)
				653	print format % ("msort", mst, mscmp)
				654	print f2 % ("", (sst-mst)1e2/mst, (sscmp-mscmp)1e2/mscmp)
				655
				656	drive()
				657	"""
				658
				659	I ran this on Windows and kept using the computer lightly while it was
				660	running. time.clock() is wall-clock time on Windows, with better than
				661	microsecond resolution. samplesort started with a 1.52% #-of-comparisons
				662	disadvantage, fell quickly to 1.48%, and then fluctuated within that small
				663	range. Here's the last chunk of output before I killed the job:
				664
				665	count 2630 nelts 130906543
				666	sort 6110.80 1937887573
				667	msort 6002.78 1909389381
				668	1.80 1.49
				669
				670	We've done nearly 2 billion comparisons apiece at Python speed there, and
				671	that's enough <wink>.
				672
				673	For random arrays of size 2 (yes, there are only 2 interesting ones),
				674	samplesort has a 50%(!) comparison disadvantage. This is a consequence of
				675	samplesort special-casing at most one ascending run at the start, then
				676	falling back to the general case if it doesn't find an ascending run
				677	immediately. The consequence is that it ends up using two compares to sort
				678	[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be
				679	taught how to deal with mixtures of ascending and descending runs
				680	efficiently in all cases.
Tim Peters	ec8147b	2013-08-24 15:15:19 -0500	[diff] [blame]	681
				682
				683	NOTES
				684	-----
				685
				686	BINSORT
				687	A "binary insertion sort" is just like a textbook insertion sort, but instead
				688	of locating the correct position of the next item via linear (one at a time)
				689	search, an equivalent to Python's bisect.bisect_right is used to find the
				690	correct position in logarithmic time. Most texts don't mention this
				691	variation, and those that do usually say it's not worth the bother: insertion
				692	sort remains quadratic (expected and worst cases) either way. Speeding the
				693	search doesn't reduce the quadratic data movement costs.
				694
				695	But in CPython's case, comparisons are extraordinarily expensive compared to
				696	moving data, and the details matter. Moving objects is just copying
Raymond Hettinger	15f44ab	2016-08-30 10:47:49 -0700	[diff] [blame]	697	pointers. Comparisons can be arbitrarily expensive (can invoke arbitrary
Tim Peters	ec8147b	2013-08-24 15:15:19 -0500	[diff] [blame]	698	user-supplied Python code), but even in simple cases (like 3 < 4) _all_
				699	decisions are made at runtime: what's the type of the left comparand? the
				700	type of the right? do they need to be coerced to a common type? where's the
				701	code to compare these types? And so on. Even the simplest Python comparison
				702	triggers a large pile of C-level pointer dereferences, conditionals, and
				703	function calls.
				704
				705	So cutting the number of compares is almost always measurably helpful in
				706	CPython, and the savings swamp the quadratic-time data movement costs for
				707	reasonable minrun values.
				708
				709
				710	LEFT OR RIGHT
				711	gallop_left() and gallop_right() are akin to the Python bisect module's
				712	bisect_left() and bisect_right(): they're the same unless the slice they're
				713	searching contains a (at least one) value equal to the value being searched
				714	for. In that case, gallop_left() returns the position immediately before the
				715	leftmost equal value, and gallop_right() the position immediately after the
				716	rightmost equal value. The distinction is needed to preserve stability. In
				717	general, when merging adjacent runs A and B, gallop_left is used to search
				718	thru B for where an element from A belongs, and gallop_right to search thru A
				719	for where an element from B belongs.
				720
				721
				722	REGION OF UNCERTAINTY
				723	Two kinds of confusion seem to be common about the claim that after finding
				724	a k such that
				725
				726	B[2(k-1) - 1] < A[0] <= B[2k - 1]
				727
				728	then a binary search requires exactly k-1 tries to find A[0]'s proper
				729	location. For concreteness, say k=3, so B[3] < A[0] <= B[7].
				730
				731	The first confusion takes the form "OK, then the region of uncertainty is at
				732	indices 3, 4, 5, 6 and 7: that's 5 elements, not the claimed 2**(k-1) - 1 =
				733	3"; or the region is viewed as a Python slice and the objection is "but that's
				734	the slice B[3:7], so has 7-3 = 4 elements". Resolution: we've already
				735	compared A[0] against B[3] and against B[7], so A[0]'s correct location is
				736	already known wrt _both_ endpoints. What remains is to find A[0]'s correct
				737	location wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the
				738	slice (leaving off both endpoints) (2(k-1)-1)+1 through (2k-1)-1
				739	inclusive = 2(k-1) through (2k-1)-1 inclusive, which has
				740	(2k-1)-1 - 2(k-1) + 1 =
				741	2k-1 - 2(k-1) =
				742	22k-1 - 2*(k-1) =
				743	(2-1)2*(k-1) - 1 =
				744	2**(k-1) - 1
				745	elements.
				746
				747	The second confusion: "k-1 = 2 binary searches can find the correct location
				748	among 2**(k-1) = 4 elements, but you're only applying it to 3 elements: we
				749	could make this more efficient by arranging for the region of uncertainty to
				750	span 2**(k-1) elements." Resolution: that confuses "elements" with
				751	"locations". In a slice with N elements, there are N+1 _locations_. In the
				752	example, with the region of uncertainty B[4], B[5], B[6], there are 4
				753	locations: before B[4], between B[4] and B[5], between B[5] and B[6], and
				754	after B[6]. In general, across 2(k-1)-1 elements, there are 2(k-1)
				755	locations. That's why k-1 binary searches are necessary and sufficient.