Blame - Objects/listsort.txt - platform/external/python/cpython3

blob: 139c17cf00e1f2043bdf60c794b8142b9a156e24 [file] [log] [blame]

Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	1	Intro
				2	-----
				3	This describes an adaptive, stable, natural mergesort, modestly called
				4	timsort (hey, I earned it <wink>). It has supernatural performance on many
				5	kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
				6	as few as N-1), yet as fast as Python's previous highly tuned samplesort
				7	hybrid on random arrays.
				8
				9	In a nutshell, the main routine marches over the array once, left to right,
				10	alternately identifying the next run, then merging it into the previous
				11	runs "intelligently". Everything else is complication for speed, and some
				12	hard-won measure of memory efficiency.
				13
				14
				15	Comparison with Python's Samplesort Hybrid
				16	------------------------------------------
				17	+ timsort can require a temp array containing as many as N//2 pointers,
				18	which means as many as 2*N extra bytes on 32-bit boxes. It can be
				19	expected to require a temp array this large when sorting random data; on
				20	data with significant structure, it may get away without using any extra
				21	heap memory. This appears to be the strongest argument against it, but
				22	compared to the size of an object, 2 temp bytes worst-case (also expected-
				23	case for random data) doesn't scare me much.
				24
				25	It turns out that Perl is moving to a stable mergesort, and the code for
				26	that appears always to require a temp array with room for at least N
				27	pointers. (Note that I wouldn't want to do that even if space weren't an
				28	issue; I believe its efforts at memory frugality also save timsort
				29	significant pointer-copying costs, and allow it to have a smaller working
				30	set.)
				31
				32	+ Across about four hours of generating random arrays, and sorting them
				33	under both methods, samplesort required about 1.5% more comparisons
				34	(the program is at the end of this file).
				35
				36	+ In real life, this may be faster or slower on random arrays than
				37	samplesort was, depending on platform quirks. Since it does fewer
				38	comparisons on average, it can be expected to do better the more
				39	expensive a comparison function is. OTOH, it does more data movement
				40	(pointer copying) than samplesort, and that may negate its small
				41	comparison advantage (depending on platform quirks) unless comparison
				42	is very expensive.
				43
				44	+ On arrays with many kinds of pre-existing order, this blows samplesort out
				45	of the water. It's significantly faster than samplesort even on some
				46	cases samplesort was special-casing the snot out of. I believe that lists
				47	very often do have exploitable partial order in real life, and this is the
				48	strongest argument in favor of timsort (indeed, samplesort's special cases
				49	for extreme partial order are appreciated by real users, and timsort goes
				50	much deeper than those, in particular naturally covering every case where
				51	someone has suggested "and it would be cool if list.sort() had a special
				52	case for this too ... and for that ...").
				53
				54	+ Here are exact comparison counts across all the tests in sortperf.py,
				55	when run with arguments "15 20 1".
				56
				57	First the trivial cases, trivial for samplesort because it special-cased
				58	them, and trivial for timsort because it naturally works on runs. Within
				59	an "n" block, the first line gives the # of compares done by samplesort,
				60	the second line by timsort, and the third line is the percentage by
				61	which the samplesort count exceeds the timsort count:
				62
				63	n \sort /sort =sort
				64	------- ------ ------ ------
				65	32768 32768 32767 32767 samplesort
				66	32767 32767 32767 timsort
				67	0.00% 0.00% 0.00% (samplesort - timsort) / timsort
				68
				69	65536 65536 65535 65535
				70	65535 65535 65535
				71	0.00% 0.00% 0.00%
				72
				73	131072 131072 131071 131071
				74	131071 131071 131071
				75	0.00% 0.00% 0.00%
				76
				77	262144 262144 262143 262143
				78	262143 262143 262143
				79	0.00% 0.00% 0.00%
				80
				81	524288 524288 524287 524287
				82	524287 524287 524287
				83	0.00% 0.00% 0.00%
				84
				85	1048576 1048576 1048575 1048575
				86	1048575 1048575 1048575
				87	0.00% 0.00% 0.00%
				88
				89	The algorithms are effectively identical in these cases, except that
				90	timsort does one less compare in \sort.
				91
				92	Now for the more interesting cases. lg(n!) is the information-theoretic
				93	limit for the best any comparison-based sorting algorithm can do on
				94	average (across all permutations). When a method gets significantly
				95	below that, it's either astronomically lucky, or is finding exploitable
				96	structure in the data.
				97
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	98	n lg(n!) *sort 3sort +sort %sort ~sort !sort
				99	------- ------- ------ ------- ------- ------ ------- --------
				100	32768 444255 453096 453614 32908 452871 130491 469141 old
				101	448885 33016 33007 50426 182083 65534 new
				102	0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	103
Tim Peters	b80595f	2002-08-10 03:04:33 +0000	[diff] [blame]	104	65536 954037 972699 981940 65686 973104 260029 1004607
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	105	962991 65821 65808 101667 364341 131070
				106	1.01% 1391.83% -0.19% 857.15% -28.63% 666.47%
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	107
Tim Peters	b80595f	2002-08-10 03:04:33 +0000	[diff] [blame]	108	131072 2039137 2101881 2091491 131232 2092894 554790 2161379
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	109	2057533 131410 131361 206193 728871 262142
				110	2.16% 1491.58% -0.10% 915.02% -23.88% 724.51%
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	111
Tim Peters	b80595f	2002-08-10 03:04:33 +0000	[diff] [blame]	112	262144 4340409 4464460 4403233 262314 4445884 1107842 4584560
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	113	4377402 262437 262459 416347 1457945 524286
				114	1.99% 1577.82% -0.06% 967.83% -24.01% 774.44%
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	115
Tim Peters	3ddb856	2002-08-10 07:04:01 +0000	[diff] [blame]	116	524288 9205096 9453356 9408463 524468 9441930 2218577 9692015
				117	9278734 524580 524633 837947 2916107 1048574
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	118	1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30%
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	119
Tim Peters	b80595f	2002-08-10 03:04:33 +0000	[diff] [blame]	120	1048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	121	19606028 1048958 1048941 1694896 5832445 2097150
				122	1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38%
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	123
				124	Discussion of cases:
				125
				126	*sort: There's no structure in random data to exploit, so the theoretical
				127	limit is lg(n!). Both methods get close to that, and timsort is hugging
				128	it (indeed, in a marginal sense, it's a spectacular improvement --
				129	there's only about 1% left before hitting the wall, and timsort knows
				130	darned well it's doing compares that won't pay on random data -- but so
				131	does the samplesort hybrid). For contrast, Hoare's original random-pivot
				132	quicksort does about 39% more compares than the limit, and the median-of-3
				133	variant about 19% more.
				134
Tim Peters	b80595f	2002-08-10 03:04:33 +0000	[diff] [blame]	135	3sort, %sort, and !sort: No contest; there's structure in this data, but
				136	not of the specific kinds samplesort special-cases. Note that structure
				137	in !sort wasn't put there on purpose -- it was crafted as a worst case for
				138	a previous quicksort implementation. That timsort nails it came as a
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	139	surprise to me (although it's obvious in retrospect).
				140
				141	+sort: samplesort special-cases this data, and does a few less compares
				142	than timsort. However, timsort runs this case significantly faster on all
				143	boxes we have timings for, because timsort is in the business of merging
				144	runs efficiently, while samplesort does much more data movement in this
				145	(for it) special case.
				146
				147	~sort: samplesort's special cases for large masses of equal elements are
				148	extremely effective on ~sort's specific data pattern, and timsort just
				149	isn't going to get close to that, despite that it's clearly getting a
				150	great deal of benefit out of the duplicates (the # of compares is much less
				151	than lg(n!)). ~sort has a perfectly uniform distribution of just 4
				152	distinct values, and as the distribution gets more skewed, samplesort's
				153	equal-element gimmicks become less effective, while timsort's adaptive
				154	strategies find more to exploit; in a database supplied by Kevin Altis, a
				155	sort on its highly skewed "on which stock exchange does this company's
				156	stock trade?" field ran over twice as fast under timsort.
				157
				158	However, despite that timsort does many more comparisons on ~sort, and
				159	that on several platforms ~sort runs highly significantly slower under
				160	timsort, on other platforms ~sort runs highly significantly faster under
				161	timsort. No other kind of data has shown this wild x-platform behavior,
				162	and we don't have an explanation for it. The only thing I can think of
				163	that could transform what "should be" highly significant slowdowns into
				164	highly significant speedups on some boxes are catastrophic cache effects
				165	in samplesort.
				166
				167	But timsort "should be" slower than samplesort on ~sort, so it's hard
				168	to count that it isn't on some boxes as a strike against it <wink>.
				169
Tim Peters	6c511e6	2002-08-08 01:55:16 +0000	[diff] [blame]	170	+ Here's the highwater mark for the number of heap-based temp slots (4
				171	bytes each on this box) needed by each test, again with arguments
				172	"15 20 1":
				173
Tim Peters	6c511e6	2002-08-08 01:55:16 +0000	[diff] [blame]	174	2*i sort \sort /sort 3sort +sort %sort ~sort =sort !sort
				175	32768 16384 0 0 6256 0 10821 12288 0 16383
				176	65536 32766 0 0 21652 0 31276 24576 0 32767
				177	131072 65534 0 0 17258 0 58112 49152 0 65535
				178	262144 131072 0 0 35660 0 123561 98304 0 131071
				179	524288 262142 0 0 31302 0 212057 196608 0 262143
				180	1048576 524286 0 0 312438 0 484942 393216 0 524287
				181
				182	Discussion: The tests that end up doing (close to) perfectly balanced
				183	merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort
				184	also ends up doing balanced merges, but systematically benefits a lot from
				185	the preliminary pre-merge searches described under "Merge Memory" later.
				186	%sort approaches having a balanced merge at the end because the random
				187	selection of elements to replace is expected to produce an out-of-order
				188	element near the midpoint. \sort, /sort, =sort are the trivial one-run
				189	cases, needing no merging at all. +sort ends up having one very long run
				190	and one very short, and so gets all the temp space it needs from the small
				191	temparray member of the MergeState struct (note that the same would be
				192	true if the new random elements were prefixed to the sorted list instead,
				193	but not if they appeared "in the middle"). 3sort approaches N//3 temp
				194	slots twice, but the run lengths that remain after 3 random exchanges
				195	clearly has very high variance.
				196
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	197
				198	A detailed description of timsort follows.
				199
				200	Runs
				201	----
				202	count_run() returns the # of elements in the next run. A run is either
				203	"ascending", which means non-decreasing:
				204
				205	a0 <= a1 <= a2 <= ...
				206
				207	or "descending", which means strictly decreasing:
				208
				209	a0 > a1 > a2 > ...
				210
				211	Note that a run is always at least 2 long, unless we start at the array's
				212	last element.
				213
				214	The definition of descending is strict, because the main routine reverses
				215	a descending run in-place, transforming a descending run into an ascending
				216	run. Reversal is done via the obvious fast "swap elements starting at each
				217	end, and converge at the middle" method, and that can violate stability if
				218	the slice contains any equal elements. Using a strict definition of
				219	descending ensures that a descending run contains distinct elements.
				220
				221	If an array is random, it's very unlikely we'll see long runs. If a natural
				222	run contains less than minrun elements (see next section), the main loop
				223	artificially boosts it to minrun elements, via a stable binary insertion sort
				224	applied to the right number of array elements following the short natural
				225	run. In a random array, all runs are likely to be minrun long as a
				226	result. This has two primary good effects:
				227
				228	1. Random data strongly tends then toward perfectly balanced (both runs have
				229	the same length) merges, which is the most efficient way to proceed when
				230	data is random.
				231
				232	2. Because runs are never very short, the rest of the code doesn't make
				233	heroic efforts to shave a few cycles off per-merge overheads. For
				234	example, reasonable use of function calls is made, rather than trying to
				235	inline everything. Since there are no more than N/minrun runs to begin
				236	with, a few "extra" function calls per merge is barely measurable.
				237
				238
				239	Computing minrun
				240	----------------
				241	If N < 64, minrun is N. IOW, binary insertion sort is used for the whole
				242	array then; it's hard to beat that given the overheads of trying something
				243	fancier.
				244
				245	When N is a power of 2, testing on random data showed that minrun values of
				246	16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
				247	in binary insertion sort clearly hurt, and at 8 the increase in the number
				248	of function calls clearly hurt. Picking some power of 2 is important
				249	here, so that the merges end up perfectly balanced (see next section). We
				250	pick 32 as a good value in the sweet range; picking a value at the low end
				251	allows the adaptive gimmicks more opportunity to exploit shorter natural
				252	runs.
				253
				254	Because sortperf.py only tries powers of 2, it took a long time to notice
				255	that 32 isn't a good choice for the general case! Consider N=2112:
				256
				257	>>> divmod(2112, 32)
				258	(66, 0)
				259	>>>
				260
				261	If the data is randomly ordered, we're very likely to end up with 66 runs
				262	each of length 32. The first 64 of these trigger a sequence of perfectly
				263	balanced merges (see next section), leaving runs of lengths 2048 and 64 to
				264	merge at the end. The adaptive gimmicks can do that with fewer than 2048+64
				265	compares, but it's still more compares than necessary, and-- mergesort's
				266	bugaboo relative to samplesort --a lot more data movement (O(N) copies just
				267	to get 64 elements into place).
				268
				269	If we take minrun=33 in this case, then we're very likely to end up with 64
				270	runs each of length 33, and then all merges are perfectly balanced. Better!
				271
				272	What we want to avoid is picking minrun such that in
				273
				274	q, r = divmod(N, minrun)
				275
				276	q is a power of 2 and r>0 (then the last merge only gets r elements into
Tim Peters	671764b	2002-08-09 05:06:44 +0000	[diff] [blame]	277	place, and r < minrun is small compared to N), or q a little larger than a
				278	power of 2 regardless of r (then we've got a case similar to "2112", again
				279	leaving too little work for the last merge to do).
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	280
				281	Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a
				282	power of 2, or if that isn't possible, is close to, but strictly less than,
				283	a power of 2. This is easier to do than it may sound: take the first 6
				284	bits of N, and add 1 if any of the remaining bits are set. In fact, that
				285	rule covers every case in this section, including small N and exact powers
				286	of 2; merge_compute_minrun() is a deceptively simple function.
				287
				288
				289	The Merge Pattern
				290	-----------------
				291	In order to exploit regularities in the data, we're merging on natural
				292	run lengths, and they can become wildly unbalanced. That's a Good Thing
				293	for this sort! It means we have to find a way to manage an assortment of
				294	potentially very different run lengths, though.
				295
				296	Stability constrains permissible merging patterns. For example, if we have
				297	3 consecutive runs of lengths
				298
				299	A:10000 B:20000 C:10000
				300
				301	we dare not merge A with C first, because if A, B and C happen to contain
				302	a common element, it would get out of order wrt its occurence(s) in B. The
				303	merging must be done as (A+B)+C or A+(B+C) instead.
				304
				305	So merging is always done on two consecutive runs at a time, and in-place,
				306	although this may require some temp memory (more on that later).
				307
				308	When a run is identified, its base address and length are pushed on a stack
				309	in the MergeState struct. merge_collapse() is then called to see whether it
				310	should merge it with preceding run(s). We would like to delay merging as
				311	long as possible in order to exploit patterns that may come up later, but we
				312	like even more to do merging as soon as possible to exploit that the run just
				313	found is still high in the memory hierarchy. We also can't delay merging
				314	"too long" because it consumes memory to remember the runs that are still
				315	unmerged, and the stack has a fixed size.
				316
				317	What turned out to be a good compromise maintains two invariants on the
				318	stack entries, where A, B and C are the lengths of the three righmost not-yet
				319	merged slices:
				320
				321	1. A > B+C
				322	2. B > C
				323
				324	Note that, by induction, #2 implies the lengths of pending runs form a
				325	decreasing sequence. #1 implies that, reading the lengths right to left,
				326	the pending-run lengths grow at least as fast as the Fibonacci numbers.
				327	Therefore the stack can never grow larger than about log_base_phi(N) entries,
				328	where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice
				329	for very large arrays.
				330
				331	If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
				332	freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
				333	e.g., if the last 3 entries are
				334
				335	A:30 B:20 C:10
				336
				337	then B is merged with C, leaving
				338
				339	A:30 BC:30
				340
				341	on the stack. Or if they were
				342
				343	A:500 B:400: C:1000
				344
				345	then A is merged with B, leaving
				346
				347	AB:900 C:1000
				348
				349	on the stack.
				350
				351	In both examples, the stack configuration after the merge still violates
				352	invariant #2, and merge_collapse() goes on to continue merging runs until
				353	both invariants are satisfied. As an extreme case, suppose we didn't do the
				354	minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
				355	and 2. Nothing would get merged until the final 2 was seen, and that would
				356	trigger 7 perfectly balanced merges.
				357
				358	The thrust of these rules when they trigger merging is to balance the run
				359	lengths as closely as possible, while keeping a low bound on the number of
				360	runs we have to remember. This is maximally effective for random data,
				361	where all runs are likely to be of (artificially forced) length minrun, and
				362	then we get a sequence of perfectly balanced merges (with, perhaps, some
				363	oddballs at the end).
				364
				365	OTOH, one reason this sort is so good for partly ordered data has to do
				366	with wildly unbalanced run lengths.
				367
				368
				369	Merge Memory
				370	------------
				371	Merging adjacent runs of lengths A and B in-place is very difficult.
				372	Theoretical constructions are known that can do it, but they're too difficult
				373	and slow for practical use. But if we have temp memory equal to min(A, B),
				374	it's easy.
				375
				376	If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
				377	and then we can do the obvious merge algorithm left to right, from the temp
				378	area and B, starting the stores into where A used to live. There's always a
				379	free area in the original area comprising a number of elements equal to the
				380	number not yet merged from the temp array (trivially true at the start;
				381	proceed by induction). The only tricky bit is that if a comparison raises an
				382	exception, we have to remember to copy the remaining elements back in from
				383	the temp area, lest the array end up with duplicate entries from B. But
				384	that's exactly the same thing we need to do if we reach the end of B first,
				385	so the exit code is pleasantly common to both the normal and error cases.
				386
				387	If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
				388	much the same, except that we need to merge right to left, copying B into a
				389	temp array and starting the stores at the right end of where B used to live.
				390
				391	A refinement: When we're about to merge adjacent runs A and B, we first do
				392	a form of binary search (more on that later) to see where B[0] should end up
				393	in A. Elements in A preceding that point are already in their final
				394	positions, effectively shrinking the size of A. Likewise we also search to
				395	see where A[-1] should end up in B, and elements of B after that point can
				396	also be ignored. This cuts the amount of temp memory needed by the same
				397	amount.
				398
				399	These preliminary searches may not pay off, and can be expected not to
				400	repay their cost if the data is random. But they can win huge in all of
				401	time, copying, and memory savings when they do pay, so this is one of the
				402	"per-merge overheads" mentioned above that we're happy to endure because
				403	there is at most one very short run. It's generally true in this algorithm
				404	that we're willing to gamble a little to win a lot, even though the net
				405	expectation is negative for random data.
				406
				407
				408	Merge Algorithms
				409	----------------
				410	merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo
				411	deals with runs where A <= B, and merge_hi where A > B. They don't know
				412	whether the data is clustered or uniform, but a lovely thing about merging
				413	is that many kinds of clustering "reveal themselves" by how many times in a
				414	row the winning merge element comes from the same run. We'll only discuss
				415	merge_lo here; merge_hi is exactly analogous.
				416
				417	Merging begins in the usual, obvious way, comparing the first element of A
				418	to the first of B, and moving B[0] to the merge area if it's less than A[0],
				419	else moving A[0] to the merge area. Call that the "one pair at a time"
				420	mode. The only twist here is keeping track of how many times in a row "the
				421	winner" comes from the same run.
				422
				423	If that count reaches MIN_GALLOP, we switch to "galloping mode". Here
				424	we search B for where A[0] belongs, and move over all the B's before
				425	that point in one chunk to the merge area, then move A[0] to the merge
				426	area. Then we search A for where B[0] belongs, and similarly move a
				427	slice of A in one chunk. Then back to searching B for where A[0] belongs,
				428	etc. We stay in galloping mode until both searches find slices to copy
				429	less than MIN_GALLOP elements long, at which point we go back to one-pair-
				430	at-a-time mode.
				431
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	432	A refinement: The MergeState struct contains the value of min_gallop that
				433	controls when we enter galloping mode, initialized to MIN_GALLOP.
Tim Peters	3ddb856	2002-08-10 07:04:01 +0000	[diff] [blame]	434	merge_lo() and merge_hi() adjust this higher when galloping isn't paying
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	435	off, and lower when it is.
				436
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	437
				438	Galloping
				439	---------
				440	Still without loss of generality, assume A is the shorter run. In galloping
				441	mode, we first look for A[0] in B. We do this via "galloping", comparing
				442	A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
				443	the k such that B[2(k-1) - 1] < A[0] <= B[2k - 1]. This takes at most
				444	roughly lg(B) comparisons, and, unlike a straight binary search, favors
				445	finding the right spot early in B (more on that later).
				446
				447	After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
				448	consecutive elements, and a straight binary search requires exactly k-1
				449	additional comparisons to nail it. Then we copy all the B's up to that
				450	point in one chunk, and then copy A[0]. Note that no matter where A[0]
				451	belongs in B, the combination of galloping + binary search finds it in no
				452	more than about 2*lg(B) comparisons.
				453
				454	If we did a straight binary search, we could find it in no more than
				455	ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
				456	comparisons no matter where A[0] belongs. Straight binary search thus loses
				457	to galloping unless the run is quite long, and we simply can't guess
				458	whether it is in advance.
				459
				460	If data is random and runs have the same length, A[0] belongs at B[0] half
				461	the time, at B[1] a quarter of the time, and so on: a consecutive winning
				462	sub-run in B of length k occurs with probability 1/2**(k+1). So long
				463	winning sub-runs are extremely unlikely in random data, and guessing that a
				464	winning sub-run is going to be long is a dangerous game.
				465
				466	OTOH, if data is lopsided or lumpy or contains many duplicates, long
				467	stretches of winning sub-runs are very likely, and cutting the number of
				468	comparisons needed to find one from O(B) to O(log B) is a huge win.
				469
				470	Galloping compromises by getting out fast if there isn't a long winning
				471	sub-run, yet finding such very efficiently when they exist.
				472
				473	I first learned about the galloping strategy in a related context; see:
				474
				475	"Adaptive Set Intersections, Unions, and Differences" (2000)
				476	Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
				477
				478	and its followup(s). An earlier paper called the same strategy
				479	"exponential search":
				480
				481	"Optimistic Sorting and Information Theoretic Complexity"
				482	Peter McIlroy
				483	SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
				484	467-474, Austin, Texas, 25-27 January 1993.
				485
				486	and it probably dates back to an earlier paper by Bentley and Yao. The
				487	McIlory paper in particular has good analysis of a mergesort that's
				488	probably strongly related to this one in its galloping strategy.
				489
				490
				491	Galloping with a Broken Leg
				492	---------------------------
				493	So why don't we always gallop? Because it can lose, on two counts:
				494
Tim Peters	6c511e6	2002-08-08 01:55:16 +0000	[diff] [blame]	495	1. While we're willing to endure small per-merge overheads, per-comparison
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	496	overheads are a different story. Calling Yet Another Function per
				497	comparison is expensive, and gallop_left() and gallop_right() are
				498	too long-winded for sane inlining.
				499
Tim Peters	6c511e6	2002-08-08 01:55:16 +0000	[diff] [blame]	500	2. Galloping can-- alas --require more comparisons than linear one-at-time
				501	search, depending on the data.
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	502
				503	#2 requires details. If A[0] belongs before B[0], galloping requires 1
				504	compare to determine that, same as linear search, except it costs more
				505	to call the gallop function. If A[0] belongs right before B[1], galloping
				506	requires 2 compares, again same as linear search. On the third compare,
				507	galloping checks A[0] against B[3], and if it's <=, requires one more
				508	compare to determine whether A[0] belongs at B[2] or B[3]. That's a total
				509	of 4 compares, but if A[0] does belong at B[2], linear search would have
				510	discovered that in only 3 compares, and that's a huge loss! Really. It's
				511	an increase of 33% in the number of compares needed, and comparisons are
				512	expensive in Python.
				513
				514	index in B where # compares linear # gallop # binary gallop
				515	A[0] belongs search needs compares compares total
				516	---------------- ----------------- -------- -------- ------
				517	0 1 1 0 1
				518
				519	1 2 2 0 2
				520
				521	2 3 3 1 4
				522	3 4 3 1 4
				523
				524	4 5 4 2 6
				525	5 6 4 2 6
				526	6 7 4 2 6
				527	7 8 4 2 6
				528
				529	8 9 5 3 8
				530	9 10 5 3 8
				531	10 11 5 3 8
				532	11 12 5 3 8
				533	...
				534
				535	In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
				536	to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
				537	The advantage of galloping is unbounded as i grows, but it doesn't win at
				538	all until i=6. Before then, it loses twice (at i=2 and i=4), and ties
				539	at the other values. At and after i=6, galloping always wins.
				540
				541	We can't guess in advance when it's going to win, though, so we do one pair
				542	at a time until the evidence seems strong that galloping may pay. MIN_GALLOP
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	543	is 7, and that's pretty strong evidence. However, if the data is random, it
				544	simply will trigger galloping mode purely by luck every now and again, and
				545	it's quite likely to hit one of the losing cases next. On the other hand,
				546	in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
				547	"should be" then. So the MergeState struct keeps a min_gallop variable
				548	that merge_lo and merge_hi adjust: the longer we stay in galloping mode,
				549	the smaller min_gallop gets, making it easier to transition back to
				550	galloping mode (if we ever leave it in the current merge, and at the
				551	start of the next merge). But whenever the gallop loop doesn't pay,
Tim Peters	3ddb856	2002-08-10 07:04:01 +0000	[diff] [blame]	552	min_gallop is increased by one, making it harder to transition back
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	553	to galloping mode (and again both within a merge and across merges). For
				554	random data, this all but eliminates the gallop penalty: min_gallop grows
				555	large enough that we almost never get into galloping mode. And for cases
				556	like ~sort, min_gallop can fall to as low as 1. This seems to work well,
				557	but in all it's a minor improvement over using a fixed MIN_GALLOP value.
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	558
				559
				560	Galloping Complication
				561	----------------------
				562	The description above was for merge_lo. merge_hi has to merge "from the
				563	other end", and really needs to gallop starting at the last element in a run
				564	instead of the first. Galloping from the first still works, but does more
				565	comparisons than it should (this is significant -- I timed it both ways).
				566	For this reason, the gallop_left() and gallop_right() functions have a
				567	"hint" argument, which is the index at which galloping should begin. So
				568	galloping can actually start at any index, and proceed at offsets of 1, 3,
				569	7, 15, ... or -1, -3, -7, -15, ... from the starting index.
				570
				571	In the code as I type it's always called with either 0 or n-1 (where n is
				572	the # of elements in a run). It's tempting to try to do something fancier,
				573	melding galloping with some form of interpolation search; for example, if
				574	we're merging a run of length 1 with a run of length 10000, index 5000 is
				575	probably a better guess at the final result than either 0 or 9999. But
				576	it's unclear how to generalize that intuition usefully, and merging of
				577	wildly unbalanced runs already enjoys excellent performance.
				578
Tim Peters	3ddb856	2002-08-10 07:04:01 +0000	[diff] [blame]	579	~sort is a good example of when balanced runs could benefit from a better
				580	hint value: to the extent possible, this would like to use a starting
				581	offset equal to the previous value of acount/bcount. Doing so saves about
				582	10% of the compares in ~sort. However, doing so is also a mixed bag,
				583	hurting other cases.
				584
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	585
				586	Comparing Average # of Compares on Random Arrays
				587	------------------------------------------------
Tim Peters	e05f65a	2002-08-10 05:21:15 +0000	[diff] [blame]	588	[NOTE: This was done when the new algorithm used about 0.1% more compares
				589	on random data than does its current incarnation.]
				590
Tim Peters	92f81f2	2002-08-01 00:59:42 +0000	[diff] [blame]	591	Here list.sort() is samplesort, and list.msort() this sort:
				592
				593	"""
				594	import random
				595	from time import clock as now
				596
				597	def fill(n):
				598	from random import random
				599	return [random() for i in xrange(n)]
				600
				601	def mycmp(x, y):
				602	global ncmp
				603	ncmp += 1
				604	return cmp(x, y)
				605
				606	def timeit(values, method):
				607	global ncmp
				608	X = values[:]
				609	bound = getattr(X, method)
				610	ncmp = 0
				611	t1 = now()
				612	bound(mycmp)
				613	t2 = now()
				614	return t2-t1, ncmp
				615
				616	format = "%5s %9.2f %11d"
				617	f2 = "%5s %9.2f %11.2f"
				618
				619	def drive():
				620	count = sst = sscmp = mst = mscmp = nelts = 0
				621	while True:
				622	n = random.randrange(100000)
				623	nelts += n
				624	x = fill(n)
				625
				626	t, c = timeit(x, 'sort')
				627	sst += t
				628	sscmp += c
				629
				630	t, c = timeit(x, 'msort')
				631	mst += t
				632	mscmp += c
				633
				634	count += 1
				635	if count % 10:
				636	continue
				637
				638	print "count", count, "nelts", nelts
				639	print format % ("sort", sst, sscmp)
				640	print format % ("msort", mst, mscmp)
				641	print f2 % ("", (sst-mst)1e2/mst, (sscmp-mscmp)1e2/mscmp)
				642
				643	drive()
				644	"""
				645
				646	I ran this on Windows and kept using the computer lightly while it was
				647	running. time.clock() is wall-clock time on Windows, with better than
				648	microsecond resolution. samplesort started with a 1.52% #-of-comparisons
				649	disadvantage, fell quickly to 1.48%, and then fluctuated within that small
				650	range. Here's the last chunk of output before I killed the job:
				651
				652	count 2630 nelts 130906543
				653	sort 6110.80 1937887573
				654	msort 6002.78 1909389381
				655	1.80 1.49
				656
				657	We've done nearly 2 billion comparisons apiece at Python speed there, and
				658	that's enough <wink>.
				659
				660	For random arrays of size 2 (yes, there are only 2 interesting ones),
				661	samplesort has a 50%(!) comparison disadvantage. This is a consequence of
				662	samplesort special-casing at most one ascending run at the start, then
				663	falling back to the general case if it doesn't find an ascending run
				664	immediately. The consequence is that it ends up using two compares to sort
				665	[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be
				666	taught how to deal with mixtures of ascending and descending runs
				667	efficiently in all cases.