blob: 17d27973f82676afaaa4e07cda7b266eec9b7113 [file] [log] [blame]
Tim Peters92f81f22002-08-01 00:59:42 +00001Intro
2-----
3This describes an adaptive, stable, natural mergesort, modestly called
4timsort (hey, I earned it <wink>). It has supernatural performance on many
5kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
6as few as N-1), yet as fast as Python's previous highly tuned samplesort
7hybrid on random arrays.
8
9In a nutshell, the main routine marches over the array once, left to right,
10alternately identifying the next run, then merging it into the previous
11runs "intelligently". Everything else is complication for speed, and some
12hard-won measure of memory efficiency.
13
14
15Comparison with Python's Samplesort Hybrid
16------------------------------------------
17+ timsort can require a temp array containing as many as N//2 pointers,
18 which means as many as 2*N extra bytes on 32-bit boxes. It can be
19 expected to require a temp array this large when sorting random data; on
20 data with significant structure, it may get away without using any extra
21 heap memory. This appears to be the strongest argument against it, but
22 compared to the size of an object, 2 temp bytes worst-case (also expected-
23 case for random data) doesn't scare me much.
24
25 It turns out that Perl is moving to a stable mergesort, and the code for
26 that appears always to require a temp array with room for at least N
27 pointers. (Note that I wouldn't want to do that even if space weren't an
28 issue; I believe its efforts at memory frugality also save timsort
29 significant pointer-copying costs, and allow it to have a smaller working
30 set.)
31
32+ Across about four hours of generating random arrays, and sorting them
33 under both methods, samplesort required about 1.5% more comparisons
34 (the program is at the end of this file).
35
36+ In real life, this may be faster or slower on random arrays than
37 samplesort was, depending on platform quirks. Since it does fewer
38 comparisons on average, it can be expected to do better the more
39 expensive a comparison function is. OTOH, it does more data movement
40 (pointer copying) than samplesort, and that may negate its small
41 comparison advantage (depending on platform quirks) unless comparison
42 is very expensive.
43
44+ On arrays with many kinds of pre-existing order, this blows samplesort out
45 of the water. It's significantly faster than samplesort even on some
46 cases samplesort was special-casing the snot out of. I believe that lists
47 very often do have exploitable partial order in real life, and this is the
48 strongest argument in favor of timsort (indeed, samplesort's special cases
49 for extreme partial order are appreciated by real users, and timsort goes
50 much deeper than those, in particular naturally covering every case where
51 someone has suggested "and it would be cool if list.sort() had a special
52 case for this too ... and for that ...").
53
54+ Here are exact comparison counts across all the tests in sortperf.py,
55 when run with arguments "15 20 1".
56
Skip Montanaroacb14242005-09-23 17:14:22 +000057 Column Key:
58 *sort: random data
59 \sort: descending data
60 /sort: ascending data
61 3sort: ascending, then 3 random exchanges
62 +sort: ascending, then 10 random at the end
Tim Peters01e75a62013-08-22 18:32:53 -050063 %sort: ascending, then randomly replace 1% of elements w/ random values
Skip Montanaroacb14242005-09-23 17:14:22 +000064 ~sort: many duplicates
65 =sort: all equal
66 !sort: worst case scenario
67
Tim Peters92f81f22002-08-01 00:59:42 +000068 First the trivial cases, trivial for samplesort because it special-cased
69 them, and trivial for timsort because it naturally works on runs. Within
70 an "n" block, the first line gives the # of compares done by samplesort,
71 the second line by timsort, and the third line is the percentage by
72 which the samplesort count exceeds the timsort count:
73
74 n \sort /sort =sort
75------- ------ ------ ------
76 32768 32768 32767 32767 samplesort
77 32767 32767 32767 timsort
78 0.00% 0.00% 0.00% (samplesort - timsort) / timsort
79
80 65536 65536 65535 65535
81 65535 65535 65535
82 0.00% 0.00% 0.00%
83
84 131072 131072 131071 131071
85 131071 131071 131071
86 0.00% 0.00% 0.00%
87
88 262144 262144 262143 262143
89 262143 262143 262143
90 0.00% 0.00% 0.00%
91
92 524288 524288 524287 524287
93 524287 524287 524287
94 0.00% 0.00% 0.00%
95
961048576 1048576 1048575 1048575
97 1048575 1048575 1048575
98 0.00% 0.00% 0.00%
99
100 The algorithms are effectively identical in these cases, except that
101 timsort does one less compare in \sort.
102
Tim Petersec8147b2013-08-24 15:15:19 -0500103 Now for the more interesting cases. Where lg(x) is the logarithm of x to
104 the base 2 (e.g., lg(8)=3), lg(n!) is the information-theoretic limit for
105 the best any comparison-based sorting algorithm can do on average (across
106 all permutations). When a method gets significantly below that, it's
107 either astronomically lucky, or is finding exploitable structure in the
108 data.
109
Tim Peters92f81f22002-08-01 00:59:42 +0000110
Tim Peterse05f65a2002-08-10 05:21:15 +0000111 n lg(n!) *sort 3sort +sort %sort ~sort !sort
112------- ------- ------ ------- ------- ------ ------- --------
113 32768 444255 453096 453614 32908 452871 130491 469141 old
114 448885 33016 33007 50426 182083 65534 new
115 0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new
Tim Peters92f81f22002-08-01 00:59:42 +0000116
Tim Petersb80595f2002-08-10 03:04:33 +0000117 65536 954037 972699 981940 65686 973104 260029 1004607
Tim Peterse05f65a2002-08-10 05:21:15 +0000118 962991 65821 65808 101667 364341 131070
119 1.01% 1391.83% -0.19% 857.15% -28.63% 666.47%
Tim Peters92f81f22002-08-01 00:59:42 +0000120
Tim Petersb80595f2002-08-10 03:04:33 +0000121 131072 2039137 2101881 2091491 131232 2092894 554790 2161379
Tim Peterse05f65a2002-08-10 05:21:15 +0000122 2057533 131410 131361 206193 728871 262142
123 2.16% 1491.58% -0.10% 915.02% -23.88% 724.51%
Tim Peters92f81f22002-08-01 00:59:42 +0000124
Tim Petersb80595f2002-08-10 03:04:33 +0000125 262144 4340409 4464460 4403233 262314 4445884 1107842 4584560
Tim Peterse05f65a2002-08-10 05:21:15 +0000126 4377402 262437 262459 416347 1457945 524286
127 1.99% 1577.82% -0.06% 967.83% -24.01% 774.44%
Tim Peters92f81f22002-08-01 00:59:42 +0000128
Tim Peters3ddb8562002-08-10 07:04:01 +0000129 524288 9205096 9453356 9408463 524468 9441930 2218577 9692015
130 9278734 524580 524633 837947 2916107 1048574
Tim Peterse05f65a2002-08-10 05:21:15 +0000131 1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30%
Tim Peters92f81f22002-08-01 00:59:42 +0000132
Tim Petersb80595f2002-08-10 03:04:33 +00001331048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212
Tim Peterse05f65a2002-08-10 05:21:15 +0000134 19606028 1048958 1048941 1694896 5832445 2097150
135 1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38%
Tim Peters92f81f22002-08-01 00:59:42 +0000136
137 Discussion of cases:
138
139 *sort: There's no structure in random data to exploit, so the theoretical
140 limit is lg(n!). Both methods get close to that, and timsort is hugging
141 it (indeed, in a *marginal* sense, it's a spectacular improvement --
142 there's only about 1% left before hitting the wall, and timsort knows
143 darned well it's doing compares that won't pay on random data -- but so
144 does the samplesort hybrid). For contrast, Hoare's original random-pivot
145 quicksort does about 39% more compares than the limit, and the median-of-3
146 variant about 19% more.
147
Tim Petersb80595f2002-08-10 03:04:33 +0000148 3sort, %sort, and !sort: No contest; there's structure in this data, but
149 not of the specific kinds samplesort special-cases. Note that structure
150 in !sort wasn't put there on purpose -- it was crafted as a worst case for
151 a previous quicksort implementation. That timsort nails it came as a
Tim Peters92f81f22002-08-01 00:59:42 +0000152 surprise to me (although it's obvious in retrospect).
153
154 +sort: samplesort special-cases this data, and does a few less compares
155 than timsort. However, timsort runs this case significantly faster on all
156 boxes we have timings for, because timsort is in the business of merging
157 runs efficiently, while samplesort does much more data movement in this
158 (for it) special case.
159
160 ~sort: samplesort's special cases for large masses of equal elements are
161 extremely effective on ~sort's specific data pattern, and timsort just
162 isn't going to get close to that, despite that it's clearly getting a
163 great deal of benefit out of the duplicates (the # of compares is much less
164 than lg(n!)). ~sort has a perfectly uniform distribution of just 4
165 distinct values, and as the distribution gets more skewed, samplesort's
166 equal-element gimmicks become less effective, while timsort's adaptive
167 strategies find more to exploit; in a database supplied by Kevin Altis, a
168 sort on its highly skewed "on which stock exchange does this company's
169 stock trade?" field ran over twice as fast under timsort.
170
171 However, despite that timsort does many more comparisons on ~sort, and
172 that on several platforms ~sort runs highly significantly slower under
173 timsort, on other platforms ~sort runs highly significantly faster under
174 timsort. No other kind of data has shown this wild x-platform behavior,
175 and we don't have an explanation for it. The only thing I can think of
176 that could transform what "should be" highly significant slowdowns into
177 highly significant speedups on some boxes are catastrophic cache effects
178 in samplesort.
179
180 But timsort "should be" slower than samplesort on ~sort, so it's hard
181 to count that it isn't on some boxes as a strike against it <wink>.
182
Tim Peters6c511e62002-08-08 01:55:16 +0000183+ Here's the highwater mark for the number of heap-based temp slots (4
184 bytes each on this box) needed by each test, again with arguments
185 "15 20 1":
186
Tim Peters6c511e62002-08-08 01:55:16 +0000187 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort
188 32768 16384 0 0 6256 0 10821 12288 0 16383
189 65536 32766 0 0 21652 0 31276 24576 0 32767
190 131072 65534 0 0 17258 0 58112 49152 0 65535
191 262144 131072 0 0 35660 0 123561 98304 0 131071
192 524288 262142 0 0 31302 0 212057 196608 0 262143
1931048576 524286 0 0 312438 0 484942 393216 0 524287
194
195 Discussion: The tests that end up doing (close to) perfectly balanced
196 merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort
197 also ends up doing balanced merges, but systematically benefits a lot from
198 the preliminary pre-merge searches described under "Merge Memory" later.
199 %sort approaches having a balanced merge at the end because the random
200 selection of elements to replace is expected to produce an out-of-order
201 element near the midpoint. \sort, /sort, =sort are the trivial one-run
202 cases, needing no merging at all. +sort ends up having one very long run
203 and one very short, and so gets all the temp space it needs from the small
204 temparray member of the MergeState struct (note that the same would be
205 true if the new random elements were prefixed to the sorted list instead,
206 but not if they appeared "in the middle"). 3sort approaches N//3 temp
207 slots twice, but the run lengths that remain after 3 random exchanges
208 clearly has very high variance.
209
Tim Peters92f81f22002-08-01 00:59:42 +0000210
211A detailed description of timsort follows.
212
213Runs
214----
215count_run() returns the # of elements in the next run. A run is either
216"ascending", which means non-decreasing:
217
218 a0 <= a1 <= a2 <= ...
219
220or "descending", which means strictly decreasing:
221
222 a0 > a1 > a2 > ...
223
224Note that a run is always at least 2 long, unless we start at the array's
225last element.
226
227The definition of descending is strict, because the main routine reverses
228a descending run in-place, transforming a descending run into an ascending
229run. Reversal is done via the obvious fast "swap elements starting at each
230end, and converge at the middle" method, and that can violate stability if
231the slice contains any equal elements. Using a strict definition of
232descending ensures that a descending run contains distinct elements.
233
234If an array is random, it's very unlikely we'll see long runs. If a natural
235run contains less than minrun elements (see next section), the main loop
236artificially boosts it to minrun elements, via a stable binary insertion sort
237applied to the right number of array elements following the short natural
238run. In a random array, *all* runs are likely to be minrun long as a
239result. This has two primary good effects:
240
2411. Random data strongly tends then toward perfectly balanced (both runs have
242 the same length) merges, which is the most efficient way to proceed when
243 data is random.
244
2452. Because runs are never very short, the rest of the code doesn't make
246 heroic efforts to shave a few cycles off per-merge overheads. For
247 example, reasonable use of function calls is made, rather than trying to
248 inline everything. Since there are no more than N/minrun runs to begin
249 with, a few "extra" function calls per merge is barely measurable.
250
251
252Computing minrun
253----------------
254If N < 64, minrun is N. IOW, binary insertion sort is used for the whole
255array then; it's hard to beat that given the overheads of trying something
Tim Petersec8147b2013-08-24 15:15:19 -0500256fancier (see note BINSORT).
Tim Peters92f81f22002-08-01 00:59:42 +0000257
258When N is a power of 2, testing on random data showed that minrun values of
25916, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
260in binary insertion sort clearly hurt, and at 8 the increase in the number
261of function calls clearly hurt. Picking *some* power of 2 is important
262here, so that the merges end up perfectly balanced (see next section). We
263pick 32 as a good value in the sweet range; picking a value at the low end
264allows the adaptive gimmicks more opportunity to exploit shorter natural
265runs.
266
267Because sortperf.py only tries powers of 2, it took a long time to notice
268that 32 isn't a good choice for the general case! Consider N=2112:
269
270>>> divmod(2112, 32)
271(66, 0)
272>>>
273
274If the data is randomly ordered, we're very likely to end up with 66 runs
275each of length 32. The first 64 of these trigger a sequence of perfectly
276balanced merges (see next section), leaving runs of lengths 2048 and 64 to
277merge at the end. The adaptive gimmicks can do that with fewer than 2048+64
278compares, but it's still more compares than necessary, and-- mergesort's
279bugaboo relative to samplesort --a lot more data movement (O(N) copies just
280to get 64 elements into place).
281
282If we take minrun=33 in this case, then we're very likely to end up with 64
283runs each of length 33, and then all merges are perfectly balanced. Better!
284
285What we want to avoid is picking minrun such that in
286
287 q, r = divmod(N, minrun)
288
289q is a power of 2 and r>0 (then the last merge only gets r elements into
Tim Peters671764b2002-08-09 05:06:44 +0000290place, and r < minrun is small compared to N), or q a little larger than a
291power of 2 regardless of r (then we've got a case similar to "2112", again
292leaving too little work for the last merge to do).
Tim Peters92f81f22002-08-01 00:59:42 +0000293
294Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a
295power of 2, or if that isn't possible, is close to, but strictly less than,
296a power of 2. This is easier to do than it may sound: take the first 6
297bits of N, and add 1 if any of the remaining bits are set. In fact, that
298rule covers every case in this section, including small N and exact powers
299of 2; merge_compute_minrun() is a deceptively simple function.
300
301
302The Merge Pattern
303-----------------
304In order to exploit regularities in the data, we're merging on natural
305run lengths, and they can become wildly unbalanced. That's a Good Thing
306for this sort! It means we have to find a way to manage an assortment of
307potentially very different run lengths, though.
308
309Stability constrains permissible merging patterns. For example, if we have
3103 consecutive runs of lengths
311
312 A:10000 B:20000 C:10000
313
314we dare not merge A with C first, because if A, B and C happen to contain
Mark Dickinson934896d2009-02-21 20:59:32 +0000315a common element, it would get out of order wrt its occurrence(s) in B. The
Tim Peters92f81f22002-08-01 00:59:42 +0000316merging must be done as (A+B)+C or A+(B+C) instead.
317
318So merging is always done on two consecutive runs at a time, and in-place,
319although this may require some temp memory (more on that later).
320
321When a run is identified, its base address and length are pushed on a stack
322in the MergeState struct. merge_collapse() is then called to see whether it
323should merge it with preceding run(s). We would like to delay merging as
324long as possible in order to exploit patterns that may come up later, but we
325like even more to do merging as soon as possible to exploit that the run just
326found is still high in the memory hierarchy. We also can't delay merging
327"too long" because it consumes memory to remember the runs that are still
328unmerged, and the stack has a fixed size.
329
330What turned out to be a good compromise maintains two invariants on the
331stack entries, where A, B and C are the lengths of the three righmost not-yet
332merged slices:
333
3341. A > B+C
3352. B > C
336
337Note that, by induction, #2 implies the lengths of pending runs form a
338decreasing sequence. #1 implies that, reading the lengths right to left,
339the pending-run lengths grow at least as fast as the Fibonacci numbers.
340Therefore the stack can never grow larger than about log_base_phi(N) entries,
341where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice
342for very large arrays.
343
344If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
345freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
346e.g., if the last 3 entries are
347
348 A:30 B:20 C:10
349
350then B is merged with C, leaving
351
352 A:30 BC:30
353
354on the stack. Or if they were
355
356 A:500 B:400: C:1000
357
358then A is merged with B, leaving
359
360 AB:900 C:1000
361
362on the stack.
363
364In both examples, the stack configuration after the merge still violates
365invariant #2, and merge_collapse() goes on to continue merging runs until
366both invariants are satisfied. As an extreme case, suppose we didn't do the
367minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
368and 2. Nothing would get merged until the final 2 was seen, and that would
369trigger 7 perfectly balanced merges.
370
371The thrust of these rules when they trigger merging is to balance the run
372lengths as closely as possible, while keeping a low bound on the number of
373runs we have to remember. This is maximally effective for random data,
374where all runs are likely to be of (artificially forced) length minrun, and
375then we get a sequence of perfectly balanced merges (with, perhaps, some
376oddballs at the end).
377
378OTOH, one reason this sort is so good for partly ordered data has to do
379with wildly unbalanced run lengths.
380
381
382Merge Memory
383------------
Tim Petersec8147b2013-08-24 15:15:19 -0500384Merging adjacent runs of lengths A and B in-place, and in linear time, is
385difficult. Theoretical constructions are known that can do it, but they're
386too difficult and slow for practical use. But if we have temp memory equal
387to min(A, B), it's easy.
Tim Peters92f81f22002-08-01 00:59:42 +0000388
389If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
390and then we can do the obvious merge algorithm left to right, from the temp
391area and B, starting the stores into where A used to live. There's always a
392free area in the original area comprising a number of elements equal to the
393number not yet merged from the temp array (trivially true at the start;
394proceed by induction). The only tricky bit is that if a comparison raises an
395exception, we have to remember to copy the remaining elements back in from
396the temp area, lest the array end up with duplicate entries from B. But
397that's exactly the same thing we need to do if we reach the end of B first,
398so the exit code is pleasantly common to both the normal and error cases.
399
400If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
401much the same, except that we need to merge right to left, copying B into a
402temp array and starting the stores at the right end of where B used to live.
403
404A refinement: When we're about to merge adjacent runs A and B, we first do
405a form of binary search (more on that later) to see where B[0] should end up
406in A. Elements in A preceding that point are already in their final
407positions, effectively shrinking the size of A. Likewise we also search to
408see where A[-1] should end up in B, and elements of B after that point can
409also be ignored. This cuts the amount of temp memory needed by the same
410amount.
411
412These preliminary searches may not pay off, and can be expected *not* to
413repay their cost if the data is random. But they can win huge in all of
414time, copying, and memory savings when they do pay, so this is one of the
415"per-merge overheads" mentioned above that we're happy to endure because
416there is at most one very short run. It's generally true in this algorithm
417that we're willing to gamble a little to win a lot, even though the net
418expectation is negative for random data.
419
420
421Merge Algorithms
422----------------
423merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo
424deals with runs where A <= B, and merge_hi where A > B. They don't know
425whether the data is clustered or uniform, but a lovely thing about merging
426is that many kinds of clustering "reveal themselves" by how many times in a
427row the winning merge element comes from the same run. We'll only discuss
428merge_lo here; merge_hi is exactly analogous.
429
430Merging begins in the usual, obvious way, comparing the first element of A
431to the first of B, and moving B[0] to the merge area if it's less than A[0],
432else moving A[0] to the merge area. Call that the "one pair at a time"
433mode. The only twist here is keeping track of how many times in a row "the
434winner" comes from the same run.
435
436If that count reaches MIN_GALLOP, we switch to "galloping mode". Here
437we *search* B for where A[0] belongs, and move over all the B's before
438that point in one chunk to the merge area, then move A[0] to the merge
439area. Then we search A for where B[0] belongs, and similarly move a
440slice of A in one chunk. Then back to searching B for where A[0] belongs,
441etc. We stay in galloping mode until both searches find slices to copy
442less than MIN_GALLOP elements long, at which point we go back to one-pair-
443at-a-time mode.
444
Tim Peterse05f65a2002-08-10 05:21:15 +0000445A refinement: The MergeState struct contains the value of min_gallop that
446controls when we enter galloping mode, initialized to MIN_GALLOP.
Tim Peters3ddb8562002-08-10 07:04:01 +0000447merge_lo() and merge_hi() adjust this higher when galloping isn't paying
Tim Peterse05f65a2002-08-10 05:21:15 +0000448off, and lower when it is.
449
Tim Peters92f81f22002-08-01 00:59:42 +0000450
451Galloping
452---------
453Still without loss of generality, assume A is the shorter run. In galloping
454mode, we first look for A[0] in B. We do this via "galloping", comparing
455A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
456the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most
457roughly lg(B) comparisons, and, unlike a straight binary search, favors
458finding the right spot early in B (more on that later).
459
460After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
461consecutive elements, and a straight binary search requires exactly k-1
Tim Petersec8147b2013-08-24 15:15:19 -0500462additional comparisons to nail it (see note REGION OF UNCERTAINTY). Then we
463copy all the B's up to that point in one chunk, and then copy A[0]. Note
464that no matter where A[0] belongs in B, the combination of galloping + binary
465search finds it in no more than about 2*lg(B) comparisons.
Tim Peters92f81f22002-08-01 00:59:42 +0000466
467If we did a straight binary search, we could find it in no more than
468ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
469comparisons no matter where A[0] belongs. Straight binary search thus loses
470to galloping unless the run is quite long, and we simply can't guess
471whether it is in advance.
472
473If data is random and runs have the same length, A[0] belongs at B[0] half
474the time, at B[1] a quarter of the time, and so on: a consecutive winning
475sub-run in B of length k occurs with probability 1/2**(k+1). So long
476winning sub-runs are extremely unlikely in random data, and guessing that a
477winning sub-run is going to be long is a dangerous game.
478
479OTOH, if data is lopsided or lumpy or contains many duplicates, long
480stretches of winning sub-runs are very likely, and cutting the number of
481comparisons needed to find one from O(B) to O(log B) is a huge win.
482
483Galloping compromises by getting out fast if there isn't a long winning
484sub-run, yet finding such very efficiently when they exist.
485
486I first learned about the galloping strategy in a related context; see:
487
488 "Adaptive Set Intersections, Unions, and Differences" (2000)
Serhiy Storchaka42bb1262015-12-18 13:23:33 +0200489 Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
Tim Peters92f81f22002-08-01 00:59:42 +0000490
491and its followup(s). An earlier paper called the same strategy
492"exponential search":
493
494 "Optimistic Sorting and Information Theoretic Complexity"
495 Peter McIlroy
496 SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
497 467-474, Austin, Texas, 25-27 January 1993.
498
499and it probably dates back to an earlier paper by Bentley and Yao. The
Thomas Wouters0e3f5912006-08-11 14:57:12 +0000500McIlroy paper in particular has good analysis of a mergesort that's
Tim Peters92f81f22002-08-01 00:59:42 +0000501probably strongly related to this one in its galloping strategy.
502
503
504Galloping with a Broken Leg
505---------------------------
506So why don't we always gallop? Because it can lose, on two counts:
507
Tim Peters6c511e62002-08-08 01:55:16 +00005081. While we're willing to endure small per-merge overheads, per-comparison
Tim Peters92f81f22002-08-01 00:59:42 +0000509 overheads are a different story. Calling Yet Another Function per
510 comparison is expensive, and gallop_left() and gallop_right() are
511 too long-winded for sane inlining.
512
Tim Peters6c511e62002-08-08 01:55:16 +00005132. Galloping can-- alas --require more comparisons than linear one-at-time
514 search, depending on the data.
Tim Peters92f81f22002-08-01 00:59:42 +0000515
516#2 requires details. If A[0] belongs before B[0], galloping requires 1
517compare to determine that, same as linear search, except it costs more
518to call the gallop function. If A[0] belongs right before B[1], galloping
519requires 2 compares, again same as linear search. On the third compare,
520galloping checks A[0] against B[3], and if it's <=, requires one more
521compare to determine whether A[0] belongs at B[2] or B[3]. That's a total
522of 4 compares, but if A[0] does belong at B[2], linear search would have
523discovered that in only 3 compares, and that's a huge loss! Really. It's
524an increase of 33% in the number of compares needed, and comparisons are
525expensive in Python.
526
527index in B where # compares linear # gallop # binary gallop
528A[0] belongs search needs compares compares total
529---------------- ----------------- -------- -------- ------
530 0 1 1 0 1
531
532 1 2 2 0 2
533
534 2 3 3 1 4
535 3 4 3 1 4
536
537 4 5 4 2 6
538 5 6 4 2 6
539 6 7 4 2 6
540 7 8 4 2 6
541
542 8 9 5 3 8
543 9 10 5 3 8
544 10 11 5 3 8
545 11 12 5 3 8
546 ...
547
548In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
549to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
550The advantage of galloping is unbounded as i grows, but it doesn't win at
551all until i=6. Before then, it loses twice (at i=2 and i=4), and ties
552at the other values. At and after i=6, galloping always wins.
553
554We can't guess in advance when it's going to win, though, so we do one pair
555at a time until the evidence seems strong that galloping may pay. MIN_GALLOP
Tim Peterse05f65a2002-08-10 05:21:15 +0000556is 7, and that's pretty strong evidence. However, if the data is random, it
557simply will trigger galloping mode purely by luck every now and again, and
558it's quite likely to hit one of the losing cases next. On the other hand,
559in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
560"should be" then. So the MergeState struct keeps a min_gallop variable
561that merge_lo and merge_hi adjust: the longer we stay in galloping mode,
562the smaller min_gallop gets, making it easier to transition back to
563galloping mode (if we ever leave it in the current merge, and at the
564start of the next merge). But whenever the gallop loop doesn't pay,
Tim Peters3ddb8562002-08-10 07:04:01 +0000565min_gallop is increased by one, making it harder to transition back
Tim Peterse05f65a2002-08-10 05:21:15 +0000566to galloping mode (and again both within a merge and across merges). For
567random data, this all but eliminates the gallop penalty: min_gallop grows
568large enough that we almost never get into galloping mode. And for cases
569like ~sort, min_gallop can fall to as low as 1. This seems to work well,
570but in all it's a minor improvement over using a fixed MIN_GALLOP value.
Tim Peters92f81f22002-08-01 00:59:42 +0000571
572
573Galloping Complication
574----------------------
575The description above was for merge_lo. merge_hi has to merge "from the
576other end", and really needs to gallop starting at the last element in a run
577instead of the first. Galloping from the first still works, but does more
Tim Petersec8147b2013-08-24 15:15:19 -0500578comparisons than it should (this is significant -- I timed it both ways). For
579this reason, the gallop_left() and gallop_right() (see note LEFT OR RIGHT)
580functions have a "hint" argument, which is the index at which galloping
581should begin. So galloping can actually start at any index, and proceed at
582offsets of 1, 3, 7, 15, ... or -1, -3, -7, -15, ... from the starting index.
Tim Peters92f81f22002-08-01 00:59:42 +0000583
584In the code as I type it's always called with either 0 or n-1 (where n is
585the # of elements in a run). It's tempting to try to do something fancier,
586melding galloping with some form of interpolation search; for example, if
587we're merging a run of length 1 with a run of length 10000, index 5000 is
588probably a better guess at the final result than either 0 or 9999. But
589it's unclear how to generalize that intuition usefully, and merging of
590wildly unbalanced runs already enjoys excellent performance.
591
Tim Peters3ddb8562002-08-10 07:04:01 +0000592~sort is a good example of when balanced runs could benefit from a better
593hint value: to the extent possible, this would like to use a starting
594offset equal to the previous value of acount/bcount. Doing so saves about
59510% of the compares in ~sort. However, doing so is also a mixed bag,
596hurting other cases.
597
Tim Peters92f81f22002-08-01 00:59:42 +0000598
599Comparing Average # of Compares on Random Arrays
600------------------------------------------------
Tim Peterse05f65a2002-08-10 05:21:15 +0000601[NOTE: This was done when the new algorithm used about 0.1% more compares
602 on random data than does its current incarnation.]
603
Tim Peters92f81f22002-08-01 00:59:42 +0000604Here list.sort() is samplesort, and list.msort() this sort:
605
606"""
607import random
608from time import clock as now
609
610def fill(n):
611 from random import random
Georg Brandlc9a5a0e2009-09-01 07:34:27 +0000612 return [random() for i in range(n)]
Tim Peters92f81f22002-08-01 00:59:42 +0000613
614def mycmp(x, y):
615 global ncmp
616 ncmp += 1
617 return cmp(x, y)
618
619def timeit(values, method):
620 global ncmp
621 X = values[:]
622 bound = getattr(X, method)
623 ncmp = 0
624 t1 = now()
625 bound(mycmp)
626 t2 = now()
627 return t2-t1, ncmp
628
629format = "%5s %9.2f %11d"
630f2 = "%5s %9.2f %11.2f"
631
632def drive():
633 count = sst = sscmp = mst = mscmp = nelts = 0
634 while True:
635 n = random.randrange(100000)
636 nelts += n
637 x = fill(n)
638
639 t, c = timeit(x, 'sort')
640 sst += t
641 sscmp += c
642
643 t, c = timeit(x, 'msort')
644 mst += t
645 mscmp += c
646
647 count += 1
648 if count % 10:
649 continue
650
651 print "count", count, "nelts", nelts
652 print format % ("sort", sst, sscmp)
653 print format % ("msort", mst, mscmp)
654 print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
655
656drive()
657"""
658
659I ran this on Windows and kept using the computer lightly while it was
660running. time.clock() is wall-clock time on Windows, with better than
661microsecond resolution. samplesort started with a 1.52% #-of-comparisons
662disadvantage, fell quickly to 1.48%, and then fluctuated within that small
663range. Here's the last chunk of output before I killed the job:
664
665count 2630 nelts 130906543
666 sort 6110.80 1937887573
667msort 6002.78 1909389381
668 1.80 1.49
669
670We've done nearly 2 billion comparisons apiece at Python speed there, and
671that's enough <wink>.
672
673For random arrays of size 2 (yes, there are only 2 interesting ones),
674samplesort has a 50%(!) comparison disadvantage. This is a consequence of
675samplesort special-casing at most one ascending run at the start, then
676falling back to the general case if it doesn't find an ascending run
677immediately. The consequence is that it ends up using two compares to sort
678[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be
679taught how to deal with mixtures of ascending and descending runs
680efficiently in all cases.
Tim Petersec8147b2013-08-24 15:15:19 -0500681
682
683NOTES
684-----
685
686BINSORT
687A "binary insertion sort" is just like a textbook insertion sort, but instead
688of locating the correct position of the next item via linear (one at a time)
689search, an equivalent to Python's bisect.bisect_right is used to find the
690correct position in logarithmic time. Most texts don't mention this
691variation, and those that do usually say it's not worth the bother: insertion
692sort remains quadratic (expected and worst cases) either way. Speeding the
693search doesn't reduce the quadratic data movement costs.
694
695But in CPython's case, comparisons are extraordinarily expensive compared to
696moving data, and the details matter. Moving objects is just copying
Raymond Hettinger15f44ab2016-08-30 10:47:49 -0700697pointers. Comparisons can be arbitrarily expensive (can invoke arbitrary
Tim Petersec8147b2013-08-24 15:15:19 -0500698user-supplied Python code), but even in simple cases (like 3 < 4) _all_
699decisions are made at runtime: what's the type of the left comparand? the
700type of the right? do they need to be coerced to a common type? where's the
701code to compare these types? And so on. Even the simplest Python comparison
702triggers a large pile of C-level pointer dereferences, conditionals, and
703function calls.
704
705So cutting the number of compares is almost always measurably helpful in
706CPython, and the savings swamp the quadratic-time data movement costs for
707reasonable minrun values.
708
709
710LEFT OR RIGHT
711gallop_left() and gallop_right() are akin to the Python bisect module's
712bisect_left() and bisect_right(): they're the same unless the slice they're
713searching contains a (at least one) value equal to the value being searched
714for. In that case, gallop_left() returns the position immediately before the
715leftmost equal value, and gallop_right() the position immediately after the
716rightmost equal value. The distinction is needed to preserve stability. In
717general, when merging adjacent runs A and B, gallop_left is used to search
718thru B for where an element from A belongs, and gallop_right to search thru A
719for where an element from B belongs.
720
721
722REGION OF UNCERTAINTY
723Two kinds of confusion seem to be common about the claim that after finding
724a k such that
725
726 B[2**(k-1) - 1] < A[0] <= B[2**k - 1]
727
728then a binary search requires exactly k-1 tries to find A[0]'s proper
729location. For concreteness, say k=3, so B[3] < A[0] <= B[7].
730
731The first confusion takes the form "OK, then the region of uncertainty is at
732indices 3, 4, 5, 6 and 7: that's 5 elements, not the claimed 2**(k-1) - 1 =
7333"; or the region is viewed as a Python slice and the objection is "but that's
734the slice B[3:7], so has 7-3 = 4 elements". Resolution: we've already
735compared A[0] against B[3] and against B[7], so A[0]'s correct location is
736already known wrt _both_ endpoints. What remains is to find A[0]'s correct
737location wrt B[4], B[5] and B[6], which spans 3 elements. Or in general, the
738slice (leaving off both endpoints) (2**(k-1)-1)+1 through (2**k-1)-1
739inclusive = 2**(k-1) through (2**k-1)-1 inclusive, which has
740 (2**k-1)-1 - 2**(k-1) + 1 =
741 2**k-1 - 2**(k-1) =
742 2*2**k-1 - 2**(k-1) =
743 (2-1)*2**(k-1) - 1 =
744 2**(k-1) - 1
745elements.
746
747The second confusion: "k-1 = 2 binary searches can find the correct location
748among 2**(k-1) = 4 elements, but you're only applying it to 3 elements: we
749could make this more efficient by arranging for the region of uncertainty to
750span 2**(k-1) elements." Resolution: that confuses "elements" with
751"locations". In a slice with N elements, there are N+1 _locations_. In the
752example, with the region of uncertainty B[4], B[5], B[6], there are 4
753locations: before B[4], between B[4] and B[5], between B[5] and B[6], and
754after B[6]. In general, across 2**(k-1)-1 elements, there are 2**(k-1)
755locations. That's why k-1 binary searches are necessary and sufficient.