blob: 139c17cf00e1f2043bdf60c794b8142b9a156e24 [file] [log] [blame]
Tim Peters92f81f22002-08-01 00:59:42 +00001Intro
2-----
3This describes an adaptive, stable, natural mergesort, modestly called
4timsort (hey, I earned it <wink>). It has supernatural performance on many
5kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
6as few as N-1), yet as fast as Python's previous highly tuned samplesort
7hybrid on random arrays.
8
9In a nutshell, the main routine marches over the array once, left to right,
10alternately identifying the next run, then merging it into the previous
11runs "intelligently". Everything else is complication for speed, and some
12hard-won measure of memory efficiency.
13
14
15Comparison with Python's Samplesort Hybrid
16------------------------------------------
17+ timsort can require a temp array containing as many as N//2 pointers,
18 which means as many as 2*N extra bytes on 32-bit boxes. It can be
19 expected to require a temp array this large when sorting random data; on
20 data with significant structure, it may get away without using any extra
21 heap memory. This appears to be the strongest argument against it, but
22 compared to the size of an object, 2 temp bytes worst-case (also expected-
23 case for random data) doesn't scare me much.
24
25 It turns out that Perl is moving to a stable mergesort, and the code for
26 that appears always to require a temp array with room for at least N
27 pointers. (Note that I wouldn't want to do that even if space weren't an
28 issue; I believe its efforts at memory frugality also save timsort
29 significant pointer-copying costs, and allow it to have a smaller working
30 set.)
31
32+ Across about four hours of generating random arrays, and sorting them
33 under both methods, samplesort required about 1.5% more comparisons
34 (the program is at the end of this file).
35
36+ In real life, this may be faster or slower on random arrays than
37 samplesort was, depending on platform quirks. Since it does fewer
38 comparisons on average, it can be expected to do better the more
39 expensive a comparison function is. OTOH, it does more data movement
40 (pointer copying) than samplesort, and that may negate its small
41 comparison advantage (depending on platform quirks) unless comparison
42 is very expensive.
43
44+ On arrays with many kinds of pre-existing order, this blows samplesort out
45 of the water. It's significantly faster than samplesort even on some
46 cases samplesort was special-casing the snot out of. I believe that lists
47 very often do have exploitable partial order in real life, and this is the
48 strongest argument in favor of timsort (indeed, samplesort's special cases
49 for extreme partial order are appreciated by real users, and timsort goes
50 much deeper than those, in particular naturally covering every case where
51 someone has suggested "and it would be cool if list.sort() had a special
52 case for this too ... and for that ...").
53
54+ Here are exact comparison counts across all the tests in sortperf.py,
55 when run with arguments "15 20 1".
56
57 First the trivial cases, trivial for samplesort because it special-cased
58 them, and trivial for timsort because it naturally works on runs. Within
59 an "n" block, the first line gives the # of compares done by samplesort,
60 the second line by timsort, and the third line is the percentage by
61 which the samplesort count exceeds the timsort count:
62
63 n \sort /sort =sort
64------- ------ ------ ------
65 32768 32768 32767 32767 samplesort
66 32767 32767 32767 timsort
67 0.00% 0.00% 0.00% (samplesort - timsort) / timsort
68
69 65536 65536 65535 65535
70 65535 65535 65535
71 0.00% 0.00% 0.00%
72
73 131072 131072 131071 131071
74 131071 131071 131071
75 0.00% 0.00% 0.00%
76
77 262144 262144 262143 262143
78 262143 262143 262143
79 0.00% 0.00% 0.00%
80
81 524288 524288 524287 524287
82 524287 524287 524287
83 0.00% 0.00% 0.00%
84
851048576 1048576 1048575 1048575
86 1048575 1048575 1048575
87 0.00% 0.00% 0.00%
88
89 The algorithms are effectively identical in these cases, except that
90 timsort does one less compare in \sort.
91
92 Now for the more interesting cases. lg(n!) is the information-theoretic
93 limit for the best any comparison-based sorting algorithm can do on
94 average (across all permutations). When a method gets significantly
95 below that, it's either astronomically lucky, or is finding exploitable
96 structure in the data.
97
Tim Peterse05f65a2002-08-10 05:21:15 +000098 n lg(n!) *sort 3sort +sort %sort ~sort !sort
99------- ------- ------ ------- ------- ------ ------- --------
100 32768 444255 453096 453614 32908 452871 130491 469141 old
101 448885 33016 33007 50426 182083 65534 new
102 0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new
Tim Peters92f81f22002-08-01 00:59:42 +0000103
Tim Petersb80595f2002-08-10 03:04:33 +0000104 65536 954037 972699 981940 65686 973104 260029 1004607
Tim Peterse05f65a2002-08-10 05:21:15 +0000105 962991 65821 65808 101667 364341 131070
106 1.01% 1391.83% -0.19% 857.15% -28.63% 666.47%
Tim Peters92f81f22002-08-01 00:59:42 +0000107
Tim Petersb80595f2002-08-10 03:04:33 +0000108 131072 2039137 2101881 2091491 131232 2092894 554790 2161379
Tim Peterse05f65a2002-08-10 05:21:15 +0000109 2057533 131410 131361 206193 728871 262142
110 2.16% 1491.58% -0.10% 915.02% -23.88% 724.51%
Tim Peters92f81f22002-08-01 00:59:42 +0000111
Tim Petersb80595f2002-08-10 03:04:33 +0000112 262144 4340409 4464460 4403233 262314 4445884 1107842 4584560
Tim Peterse05f65a2002-08-10 05:21:15 +0000113 4377402 262437 262459 416347 1457945 524286
114 1.99% 1577.82% -0.06% 967.83% -24.01% 774.44%
Tim Peters92f81f22002-08-01 00:59:42 +0000115
Tim Peters3ddb8562002-08-10 07:04:01 +0000116 524288 9205096 9453356 9408463 524468 9441930 2218577 9692015
117 9278734 524580 524633 837947 2916107 1048574
Tim Peterse05f65a2002-08-10 05:21:15 +0000118 1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30%
Tim Peters92f81f22002-08-01 00:59:42 +0000119
Tim Petersb80595f2002-08-10 03:04:33 +00001201048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212
Tim Peterse05f65a2002-08-10 05:21:15 +0000121 19606028 1048958 1048941 1694896 5832445 2097150
122 1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38%
Tim Peters92f81f22002-08-01 00:59:42 +0000123
124 Discussion of cases:
125
126 *sort: There's no structure in random data to exploit, so the theoretical
127 limit is lg(n!). Both methods get close to that, and timsort is hugging
128 it (indeed, in a *marginal* sense, it's a spectacular improvement --
129 there's only about 1% left before hitting the wall, and timsort knows
130 darned well it's doing compares that won't pay on random data -- but so
131 does the samplesort hybrid). For contrast, Hoare's original random-pivot
132 quicksort does about 39% more compares than the limit, and the median-of-3
133 variant about 19% more.
134
Tim Petersb80595f2002-08-10 03:04:33 +0000135 3sort, %sort, and !sort: No contest; there's structure in this data, but
136 not of the specific kinds samplesort special-cases. Note that structure
137 in !sort wasn't put there on purpose -- it was crafted as a worst case for
138 a previous quicksort implementation. That timsort nails it came as a
Tim Peters92f81f22002-08-01 00:59:42 +0000139 surprise to me (although it's obvious in retrospect).
140
141 +sort: samplesort special-cases this data, and does a few less compares
142 than timsort. However, timsort runs this case significantly faster on all
143 boxes we have timings for, because timsort is in the business of merging
144 runs efficiently, while samplesort does much more data movement in this
145 (for it) special case.
146
147 ~sort: samplesort's special cases for large masses of equal elements are
148 extremely effective on ~sort's specific data pattern, and timsort just
149 isn't going to get close to that, despite that it's clearly getting a
150 great deal of benefit out of the duplicates (the # of compares is much less
151 than lg(n!)). ~sort has a perfectly uniform distribution of just 4
152 distinct values, and as the distribution gets more skewed, samplesort's
153 equal-element gimmicks become less effective, while timsort's adaptive
154 strategies find more to exploit; in a database supplied by Kevin Altis, a
155 sort on its highly skewed "on which stock exchange does this company's
156 stock trade?" field ran over twice as fast under timsort.
157
158 However, despite that timsort does many more comparisons on ~sort, and
159 that on several platforms ~sort runs highly significantly slower under
160 timsort, on other platforms ~sort runs highly significantly faster under
161 timsort. No other kind of data has shown this wild x-platform behavior,
162 and we don't have an explanation for it. The only thing I can think of
163 that could transform what "should be" highly significant slowdowns into
164 highly significant speedups on some boxes are catastrophic cache effects
165 in samplesort.
166
167 But timsort "should be" slower than samplesort on ~sort, so it's hard
168 to count that it isn't on some boxes as a strike against it <wink>.
169
Tim Peters6c511e62002-08-08 01:55:16 +0000170+ Here's the highwater mark for the number of heap-based temp slots (4
171 bytes each on this box) needed by each test, again with arguments
172 "15 20 1":
173
Tim Peters6c511e62002-08-08 01:55:16 +0000174 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort
175 32768 16384 0 0 6256 0 10821 12288 0 16383
176 65536 32766 0 0 21652 0 31276 24576 0 32767
177 131072 65534 0 0 17258 0 58112 49152 0 65535
178 262144 131072 0 0 35660 0 123561 98304 0 131071
179 524288 262142 0 0 31302 0 212057 196608 0 262143
1801048576 524286 0 0 312438 0 484942 393216 0 524287
181
182 Discussion: The tests that end up doing (close to) perfectly balanced
183 merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort
184 also ends up doing balanced merges, but systematically benefits a lot from
185 the preliminary pre-merge searches described under "Merge Memory" later.
186 %sort approaches having a balanced merge at the end because the random
187 selection of elements to replace is expected to produce an out-of-order
188 element near the midpoint. \sort, /sort, =sort are the trivial one-run
189 cases, needing no merging at all. +sort ends up having one very long run
190 and one very short, and so gets all the temp space it needs from the small
191 temparray member of the MergeState struct (note that the same would be
192 true if the new random elements were prefixed to the sorted list instead,
193 but not if they appeared "in the middle"). 3sort approaches N//3 temp
194 slots twice, but the run lengths that remain after 3 random exchanges
195 clearly has very high variance.
196
Tim Peters92f81f22002-08-01 00:59:42 +0000197
198A detailed description of timsort follows.
199
200Runs
201----
202count_run() returns the # of elements in the next run. A run is either
203"ascending", which means non-decreasing:
204
205 a0 <= a1 <= a2 <= ...
206
207or "descending", which means strictly decreasing:
208
209 a0 > a1 > a2 > ...
210
211Note that a run is always at least 2 long, unless we start at the array's
212last element.
213
214The definition of descending is strict, because the main routine reverses
215a descending run in-place, transforming a descending run into an ascending
216run. Reversal is done via the obvious fast "swap elements starting at each
217end, and converge at the middle" method, and that can violate stability if
218the slice contains any equal elements. Using a strict definition of
219descending ensures that a descending run contains distinct elements.
220
221If an array is random, it's very unlikely we'll see long runs. If a natural
222run contains less than minrun elements (see next section), the main loop
223artificially boosts it to minrun elements, via a stable binary insertion sort
224applied to the right number of array elements following the short natural
225run. In a random array, *all* runs are likely to be minrun long as a
226result. This has two primary good effects:
227
2281. Random data strongly tends then toward perfectly balanced (both runs have
229 the same length) merges, which is the most efficient way to proceed when
230 data is random.
231
2322. Because runs are never very short, the rest of the code doesn't make
233 heroic efforts to shave a few cycles off per-merge overheads. For
234 example, reasonable use of function calls is made, rather than trying to
235 inline everything. Since there are no more than N/minrun runs to begin
236 with, a few "extra" function calls per merge is barely measurable.
237
238
239Computing minrun
240----------------
241If N < 64, minrun is N. IOW, binary insertion sort is used for the whole
242array then; it's hard to beat that given the overheads of trying something
243fancier.
244
245When N is a power of 2, testing on random data showed that minrun values of
24616, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
247in binary insertion sort clearly hurt, and at 8 the increase in the number
248of function calls clearly hurt. Picking *some* power of 2 is important
249here, so that the merges end up perfectly balanced (see next section). We
250pick 32 as a good value in the sweet range; picking a value at the low end
251allows the adaptive gimmicks more opportunity to exploit shorter natural
252runs.
253
254Because sortperf.py only tries powers of 2, it took a long time to notice
255that 32 isn't a good choice for the general case! Consider N=2112:
256
257>>> divmod(2112, 32)
258(66, 0)
259>>>
260
261If the data is randomly ordered, we're very likely to end up with 66 runs
262each of length 32. The first 64 of these trigger a sequence of perfectly
263balanced merges (see next section), leaving runs of lengths 2048 and 64 to
264merge at the end. The adaptive gimmicks can do that with fewer than 2048+64
265compares, but it's still more compares than necessary, and-- mergesort's
266bugaboo relative to samplesort --a lot more data movement (O(N) copies just
267to get 64 elements into place).
268
269If we take minrun=33 in this case, then we're very likely to end up with 64
270runs each of length 33, and then all merges are perfectly balanced. Better!
271
272What we want to avoid is picking minrun such that in
273
274 q, r = divmod(N, minrun)
275
276q is a power of 2 and r>0 (then the last merge only gets r elements into
Tim Peters671764b2002-08-09 05:06:44 +0000277place, and r < minrun is small compared to N), or q a little larger than a
278power of 2 regardless of r (then we've got a case similar to "2112", again
279leaving too little work for the last merge to do).
Tim Peters92f81f22002-08-01 00:59:42 +0000280
281Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a
282power of 2, or if that isn't possible, is close to, but strictly less than,
283a power of 2. This is easier to do than it may sound: take the first 6
284bits of N, and add 1 if any of the remaining bits are set. In fact, that
285rule covers every case in this section, including small N and exact powers
286of 2; merge_compute_minrun() is a deceptively simple function.
287
288
289The Merge Pattern
290-----------------
291In order to exploit regularities in the data, we're merging on natural
292run lengths, and they can become wildly unbalanced. That's a Good Thing
293for this sort! It means we have to find a way to manage an assortment of
294potentially very different run lengths, though.
295
296Stability constrains permissible merging patterns. For example, if we have
2973 consecutive runs of lengths
298
299 A:10000 B:20000 C:10000
300
301we dare not merge A with C first, because if A, B and C happen to contain
302a common element, it would get out of order wrt its occurence(s) in B. The
303merging must be done as (A+B)+C or A+(B+C) instead.
304
305So merging is always done on two consecutive runs at a time, and in-place,
306although this may require some temp memory (more on that later).
307
308When a run is identified, its base address and length are pushed on a stack
309in the MergeState struct. merge_collapse() is then called to see whether it
310should merge it with preceding run(s). We would like to delay merging as
311long as possible in order to exploit patterns that may come up later, but we
312like even more to do merging as soon as possible to exploit that the run just
313found is still high in the memory hierarchy. We also can't delay merging
314"too long" because it consumes memory to remember the runs that are still
315unmerged, and the stack has a fixed size.
316
317What turned out to be a good compromise maintains two invariants on the
318stack entries, where A, B and C are the lengths of the three righmost not-yet
319merged slices:
320
3211. A > B+C
3222. B > C
323
324Note that, by induction, #2 implies the lengths of pending runs form a
325decreasing sequence. #1 implies that, reading the lengths right to left,
326the pending-run lengths grow at least as fast as the Fibonacci numbers.
327Therefore the stack can never grow larger than about log_base_phi(N) entries,
328where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice
329for very large arrays.
330
331If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
332freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
333e.g., if the last 3 entries are
334
335 A:30 B:20 C:10
336
337then B is merged with C, leaving
338
339 A:30 BC:30
340
341on the stack. Or if they were
342
343 A:500 B:400: C:1000
344
345then A is merged with B, leaving
346
347 AB:900 C:1000
348
349on the stack.
350
351In both examples, the stack configuration after the merge still violates
352invariant #2, and merge_collapse() goes on to continue merging runs until
353both invariants are satisfied. As an extreme case, suppose we didn't do the
354minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
355and 2. Nothing would get merged until the final 2 was seen, and that would
356trigger 7 perfectly balanced merges.
357
358The thrust of these rules when they trigger merging is to balance the run
359lengths as closely as possible, while keeping a low bound on the number of
360runs we have to remember. This is maximally effective for random data,
361where all runs are likely to be of (artificially forced) length minrun, and
362then we get a sequence of perfectly balanced merges (with, perhaps, some
363oddballs at the end).
364
365OTOH, one reason this sort is so good for partly ordered data has to do
366with wildly unbalanced run lengths.
367
368
369Merge Memory
370------------
371Merging adjacent runs of lengths A and B in-place is very difficult.
372Theoretical constructions are known that can do it, but they're too difficult
373and slow for practical use. But if we have temp memory equal to min(A, B),
374it's easy.
375
376If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
377and then we can do the obvious merge algorithm left to right, from the temp
378area and B, starting the stores into where A used to live. There's always a
379free area in the original area comprising a number of elements equal to the
380number not yet merged from the temp array (trivially true at the start;
381proceed by induction). The only tricky bit is that if a comparison raises an
382exception, we have to remember to copy the remaining elements back in from
383the temp area, lest the array end up with duplicate entries from B. But
384that's exactly the same thing we need to do if we reach the end of B first,
385so the exit code is pleasantly common to both the normal and error cases.
386
387If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
388much the same, except that we need to merge right to left, copying B into a
389temp array and starting the stores at the right end of where B used to live.
390
391A refinement: When we're about to merge adjacent runs A and B, we first do
392a form of binary search (more on that later) to see where B[0] should end up
393in A. Elements in A preceding that point are already in their final
394positions, effectively shrinking the size of A. Likewise we also search to
395see where A[-1] should end up in B, and elements of B after that point can
396also be ignored. This cuts the amount of temp memory needed by the same
397amount.
398
399These preliminary searches may not pay off, and can be expected *not* to
400repay their cost if the data is random. But they can win huge in all of
401time, copying, and memory savings when they do pay, so this is one of the
402"per-merge overheads" mentioned above that we're happy to endure because
403there is at most one very short run. It's generally true in this algorithm
404that we're willing to gamble a little to win a lot, even though the net
405expectation is negative for random data.
406
407
408Merge Algorithms
409----------------
410merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo
411deals with runs where A <= B, and merge_hi where A > B. They don't know
412whether the data is clustered or uniform, but a lovely thing about merging
413is that many kinds of clustering "reveal themselves" by how many times in a
414row the winning merge element comes from the same run. We'll only discuss
415merge_lo here; merge_hi is exactly analogous.
416
417Merging begins in the usual, obvious way, comparing the first element of A
418to the first of B, and moving B[0] to the merge area if it's less than A[0],
419else moving A[0] to the merge area. Call that the "one pair at a time"
420mode. The only twist here is keeping track of how many times in a row "the
421winner" comes from the same run.
422
423If that count reaches MIN_GALLOP, we switch to "galloping mode". Here
424we *search* B for where A[0] belongs, and move over all the B's before
425that point in one chunk to the merge area, then move A[0] to the merge
426area. Then we search A for where B[0] belongs, and similarly move a
427slice of A in one chunk. Then back to searching B for where A[0] belongs,
428etc. We stay in galloping mode until both searches find slices to copy
429less than MIN_GALLOP elements long, at which point we go back to one-pair-
430at-a-time mode.
431
Tim Peterse05f65a2002-08-10 05:21:15 +0000432A refinement: The MergeState struct contains the value of min_gallop that
433controls when we enter galloping mode, initialized to MIN_GALLOP.
Tim Peters3ddb8562002-08-10 07:04:01 +0000434merge_lo() and merge_hi() adjust this higher when galloping isn't paying
Tim Peterse05f65a2002-08-10 05:21:15 +0000435off, and lower when it is.
436
Tim Peters92f81f22002-08-01 00:59:42 +0000437
438Galloping
439---------
440Still without loss of generality, assume A is the shorter run. In galloping
441mode, we first look for A[0] in B. We do this via "galloping", comparing
442A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding
443the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most
444roughly lg(B) comparisons, and, unlike a straight binary search, favors
445finding the right spot early in B (more on that later).
446
447After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1
448consecutive elements, and a straight binary search requires exactly k-1
449additional comparisons to nail it. Then we copy all the B's up to that
450point in one chunk, and then copy A[0]. Note that no matter where A[0]
451belongs in B, the combination of galloping + binary search finds it in no
452more than about 2*lg(B) comparisons.
453
454If we did a straight binary search, we could find it in no more than
455ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
456comparisons no matter where A[0] belongs. Straight binary search thus loses
457to galloping unless the run is quite long, and we simply can't guess
458whether it is in advance.
459
460If data is random and runs have the same length, A[0] belongs at B[0] half
461the time, at B[1] a quarter of the time, and so on: a consecutive winning
462sub-run in B of length k occurs with probability 1/2**(k+1). So long
463winning sub-runs are extremely unlikely in random data, and guessing that a
464winning sub-run is going to be long is a dangerous game.
465
466OTOH, if data is lopsided or lumpy or contains many duplicates, long
467stretches of winning sub-runs are very likely, and cutting the number of
468comparisons needed to find one from O(B) to O(log B) is a huge win.
469
470Galloping compromises by getting out fast if there isn't a long winning
471sub-run, yet finding such very efficiently when they exist.
472
473I first learned about the galloping strategy in a related context; see:
474
475 "Adaptive Set Intersections, Unions, and Differences" (2000)
476 Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
477
478and its followup(s). An earlier paper called the same strategy
479"exponential search":
480
481 "Optimistic Sorting and Information Theoretic Complexity"
482 Peter McIlroy
483 SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
484 467-474, Austin, Texas, 25-27 January 1993.
485
486and it probably dates back to an earlier paper by Bentley and Yao. The
487McIlory paper in particular has good analysis of a mergesort that's
488probably strongly related to this one in its galloping strategy.
489
490
491Galloping with a Broken Leg
492---------------------------
493So why don't we always gallop? Because it can lose, on two counts:
494
Tim Peters6c511e62002-08-08 01:55:16 +00004951. While we're willing to endure small per-merge overheads, per-comparison
Tim Peters92f81f22002-08-01 00:59:42 +0000496 overheads are a different story. Calling Yet Another Function per
497 comparison is expensive, and gallop_left() and gallop_right() are
498 too long-winded for sane inlining.
499
Tim Peters6c511e62002-08-08 01:55:16 +00005002. Galloping can-- alas --require more comparisons than linear one-at-time
501 search, depending on the data.
Tim Peters92f81f22002-08-01 00:59:42 +0000502
503#2 requires details. If A[0] belongs before B[0], galloping requires 1
504compare to determine that, same as linear search, except it costs more
505to call the gallop function. If A[0] belongs right before B[1], galloping
506requires 2 compares, again same as linear search. On the third compare,
507galloping checks A[0] against B[3], and if it's <=, requires one more
508compare to determine whether A[0] belongs at B[2] or B[3]. That's a total
509of 4 compares, but if A[0] does belong at B[2], linear search would have
510discovered that in only 3 compares, and that's a huge loss! Really. It's
511an increase of 33% in the number of compares needed, and comparisons are
512expensive in Python.
513
514index in B where # compares linear # gallop # binary gallop
515A[0] belongs search needs compares compares total
516---------------- ----------------- -------- -------- ------
517 0 1 1 0 1
518
519 1 2 2 0 2
520
521 2 3 3 1 4
522 3 4 3 1 4
523
524 4 5 4 2 6
525 5 6 4 2 6
526 6 7 4 2 6
527 7 8 4 2 6
528
529 8 9 5 3 8
530 9 10 5 3 8
531 10 11 5 3 8
532 11 12 5 3 8
533 ...
534
535In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
536to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
537The advantage of galloping is unbounded as i grows, but it doesn't win at
538all until i=6. Before then, it loses twice (at i=2 and i=4), and ties
539at the other values. At and after i=6, galloping always wins.
540
541We can't guess in advance when it's going to win, though, so we do one pair
542at a time until the evidence seems strong that galloping may pay. MIN_GALLOP
Tim Peterse05f65a2002-08-10 05:21:15 +0000543is 7, and that's pretty strong evidence. However, if the data is random, it
544simply will trigger galloping mode purely by luck every now and again, and
545it's quite likely to hit one of the losing cases next. On the other hand,
546in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it
547"should be" then. So the MergeState struct keeps a min_gallop variable
548that merge_lo and merge_hi adjust: the longer we stay in galloping mode,
549the smaller min_gallop gets, making it easier to transition back to
550galloping mode (if we ever leave it in the current merge, and at the
551start of the next merge). But whenever the gallop loop doesn't pay,
Tim Peters3ddb8562002-08-10 07:04:01 +0000552min_gallop is increased by one, making it harder to transition back
Tim Peterse05f65a2002-08-10 05:21:15 +0000553to galloping mode (and again both within a merge and across merges). For
554random data, this all but eliminates the gallop penalty: min_gallop grows
555large enough that we almost never get into galloping mode. And for cases
556like ~sort, min_gallop can fall to as low as 1. This seems to work well,
557but in all it's a minor improvement over using a fixed MIN_GALLOP value.
Tim Peters92f81f22002-08-01 00:59:42 +0000558
559
560Galloping Complication
561----------------------
562The description above was for merge_lo. merge_hi has to merge "from the
563other end", and really needs to gallop starting at the last element in a run
564instead of the first. Galloping from the first still works, but does more
565comparisons than it should (this is significant -- I timed it both ways).
566For this reason, the gallop_left() and gallop_right() functions have a
567"hint" argument, which is the index at which galloping should begin. So
568galloping can actually start at any index, and proceed at offsets of 1, 3,
5697, 15, ... or -1, -3, -7, -15, ... from the starting index.
570
571In the code as I type it's always called with either 0 or n-1 (where n is
572the # of elements in a run). It's tempting to try to do something fancier,
573melding galloping with some form of interpolation search; for example, if
574we're merging a run of length 1 with a run of length 10000, index 5000 is
575probably a better guess at the final result than either 0 or 9999. But
576it's unclear how to generalize that intuition usefully, and merging of
577wildly unbalanced runs already enjoys excellent performance.
578
Tim Peters3ddb8562002-08-10 07:04:01 +0000579~sort is a good example of when balanced runs could benefit from a better
580hint value: to the extent possible, this would like to use a starting
581offset equal to the previous value of acount/bcount. Doing so saves about
58210% of the compares in ~sort. However, doing so is also a mixed bag,
583hurting other cases.
584
Tim Peters92f81f22002-08-01 00:59:42 +0000585
586Comparing Average # of Compares on Random Arrays
587------------------------------------------------
Tim Peterse05f65a2002-08-10 05:21:15 +0000588[NOTE: This was done when the new algorithm used about 0.1% more compares
589 on random data than does its current incarnation.]
590
Tim Peters92f81f22002-08-01 00:59:42 +0000591Here list.sort() is samplesort, and list.msort() this sort:
592
593"""
594import random
595from time import clock as now
596
597def fill(n):
598 from random import random
599 return [random() for i in xrange(n)]
600
601def mycmp(x, y):
602 global ncmp
603 ncmp += 1
604 return cmp(x, y)
605
606def timeit(values, method):
607 global ncmp
608 X = values[:]
609 bound = getattr(X, method)
610 ncmp = 0
611 t1 = now()
612 bound(mycmp)
613 t2 = now()
614 return t2-t1, ncmp
615
616format = "%5s %9.2f %11d"
617f2 = "%5s %9.2f %11.2f"
618
619def drive():
620 count = sst = sscmp = mst = mscmp = nelts = 0
621 while True:
622 n = random.randrange(100000)
623 nelts += n
624 x = fill(n)
625
626 t, c = timeit(x, 'sort')
627 sst += t
628 sscmp += c
629
630 t, c = timeit(x, 'msort')
631 mst += t
632 mscmp += c
633
634 count += 1
635 if count % 10:
636 continue
637
638 print "count", count, "nelts", nelts
639 print format % ("sort", sst, sscmp)
640 print format % ("msort", mst, mscmp)
641 print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
642
643drive()
644"""
645
646I ran this on Windows and kept using the computer lightly while it was
647running. time.clock() is wall-clock time on Windows, with better than
648microsecond resolution. samplesort started with a 1.52% #-of-comparisons
649disadvantage, fell quickly to 1.48%, and then fluctuated within that small
650range. Here's the last chunk of output before I killed the job:
651
652count 2630 nelts 130906543
653 sort 6110.80 1937887573
654msort 6002.78 1909389381
655 1.80 1.49
656
657We've done nearly 2 billion comparisons apiece at Python speed there, and
658that's enough <wink>.
659
660For random arrays of size 2 (yes, there are only 2 interesting ones),
661samplesort has a 50%(!) comparison disadvantage. This is a consequence of
662samplesort special-casing at most one ascending run at the start, then
663falling back to the general case if it doesn't find an ascending run
664immediately. The consequence is that it ends up using two compares to sort
665[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be
666taught how to deal with mixtures of ascending and descending runs
667efficiently in all cases.