Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 1 | Intro |
| 2 | ----- |
| 3 | This describes an adaptive, stable, natural mergesort, modestly called |
| 4 | timsort (hey, I earned it <wink>). It has supernatural performance on many |
| 5 | kinds of partially ordered arrays (less than lg(N!) comparisons needed, and |
| 6 | as few as N-1), yet as fast as Python's previous highly tuned samplesort |
| 7 | hybrid on random arrays. |
| 8 | |
| 9 | In a nutshell, the main routine marches over the array once, left to right, |
| 10 | alternately identifying the next run, then merging it into the previous |
| 11 | runs "intelligently". Everything else is complication for speed, and some |
| 12 | hard-won measure of memory efficiency. |
| 13 | |
| 14 | |
| 15 | Comparison with Python's Samplesort Hybrid |
| 16 | ------------------------------------------ |
| 17 | + timsort can require a temp array containing as many as N//2 pointers, |
| 18 | which means as many as 2*N extra bytes on 32-bit boxes. It can be |
| 19 | expected to require a temp array this large when sorting random data; on |
| 20 | data with significant structure, it may get away without using any extra |
| 21 | heap memory. This appears to be the strongest argument against it, but |
| 22 | compared to the size of an object, 2 temp bytes worst-case (also expected- |
| 23 | case for random data) doesn't scare me much. |
| 24 | |
| 25 | It turns out that Perl is moving to a stable mergesort, and the code for |
| 26 | that appears always to require a temp array with room for at least N |
| 27 | pointers. (Note that I wouldn't want to do that even if space weren't an |
| 28 | issue; I believe its efforts at memory frugality also save timsort |
| 29 | significant pointer-copying costs, and allow it to have a smaller working |
| 30 | set.) |
| 31 | |
| 32 | + Across about four hours of generating random arrays, and sorting them |
| 33 | under both methods, samplesort required about 1.5% more comparisons |
| 34 | (the program is at the end of this file). |
| 35 | |
| 36 | + In real life, this may be faster or slower on random arrays than |
| 37 | samplesort was, depending on platform quirks. Since it does fewer |
| 38 | comparisons on average, it can be expected to do better the more |
| 39 | expensive a comparison function is. OTOH, it does more data movement |
| 40 | (pointer copying) than samplesort, and that may negate its small |
| 41 | comparison advantage (depending on platform quirks) unless comparison |
| 42 | is very expensive. |
| 43 | |
| 44 | + On arrays with many kinds of pre-existing order, this blows samplesort out |
| 45 | of the water. It's significantly faster than samplesort even on some |
| 46 | cases samplesort was special-casing the snot out of. I believe that lists |
| 47 | very often do have exploitable partial order in real life, and this is the |
| 48 | strongest argument in favor of timsort (indeed, samplesort's special cases |
| 49 | for extreme partial order are appreciated by real users, and timsort goes |
| 50 | much deeper than those, in particular naturally covering every case where |
| 51 | someone has suggested "and it would be cool if list.sort() had a special |
| 52 | case for this too ... and for that ..."). |
| 53 | |
| 54 | + Here are exact comparison counts across all the tests in sortperf.py, |
| 55 | when run with arguments "15 20 1". |
| 56 | |
| 57 | First the trivial cases, trivial for samplesort because it special-cased |
| 58 | them, and trivial for timsort because it naturally works on runs. Within |
| 59 | an "n" block, the first line gives the # of compares done by samplesort, |
| 60 | the second line by timsort, and the third line is the percentage by |
| 61 | which the samplesort count exceeds the timsort count: |
| 62 | |
| 63 | n \sort /sort =sort |
| 64 | ------- ------ ------ ------ |
| 65 | 32768 32768 32767 32767 samplesort |
| 66 | 32767 32767 32767 timsort |
| 67 | 0.00% 0.00% 0.00% (samplesort - timsort) / timsort |
| 68 | |
| 69 | 65536 65536 65535 65535 |
| 70 | 65535 65535 65535 |
| 71 | 0.00% 0.00% 0.00% |
| 72 | |
| 73 | 131072 131072 131071 131071 |
| 74 | 131071 131071 131071 |
| 75 | 0.00% 0.00% 0.00% |
| 76 | |
| 77 | 262144 262144 262143 262143 |
| 78 | 262143 262143 262143 |
| 79 | 0.00% 0.00% 0.00% |
| 80 | |
| 81 | 524288 524288 524287 524287 |
| 82 | 524287 524287 524287 |
| 83 | 0.00% 0.00% 0.00% |
| 84 | |
| 85 | 1048576 1048576 1048575 1048575 |
| 86 | 1048575 1048575 1048575 |
| 87 | 0.00% 0.00% 0.00% |
| 88 | |
| 89 | The algorithms are effectively identical in these cases, except that |
| 90 | timsort does one less compare in \sort. |
| 91 | |
| 92 | Now for the more interesting cases. lg(n!) is the information-theoretic |
| 93 | limit for the best any comparison-based sorting algorithm can do on |
| 94 | average (across all permutations). When a method gets significantly |
| 95 | below that, it's either astronomically lucky, or is finding exploitable |
| 96 | structure in the data. |
| 97 | |
Tim Peters | e05f65a | 2002-08-10 05:21:15 +0000 | [diff] [blame] | 98 | n lg(n!) *sort 3sort +sort %sort ~sort !sort |
| 99 | ------- ------- ------ ------- ------- ------ ------- -------- |
| 100 | 32768 444255 453096 453614 32908 452871 130491 469141 old |
| 101 | 448885 33016 33007 50426 182083 65534 new |
| 102 | 0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 103 | |
Tim Peters | b80595f | 2002-08-10 03:04:33 +0000 | [diff] [blame] | 104 | 65536 954037 972699 981940 65686 973104 260029 1004607 |
Tim Peters | e05f65a | 2002-08-10 05:21:15 +0000 | [diff] [blame] | 105 | 962991 65821 65808 101667 364341 131070 |
| 106 | 1.01% 1391.83% -0.19% 857.15% -28.63% 666.47% |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 107 | |
Tim Peters | b80595f | 2002-08-10 03:04:33 +0000 | [diff] [blame] | 108 | 131072 2039137 2101881 2091491 131232 2092894 554790 2161379 |
Tim Peters | e05f65a | 2002-08-10 05:21:15 +0000 | [diff] [blame] | 109 | 2057533 131410 131361 206193 728871 262142 |
| 110 | 2.16% 1491.58% -0.10% 915.02% -23.88% 724.51% |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 111 | |
Tim Peters | b80595f | 2002-08-10 03:04:33 +0000 | [diff] [blame] | 112 | 262144 4340409 4464460 4403233 262314 4445884 1107842 4584560 |
Tim Peters | e05f65a | 2002-08-10 05:21:15 +0000 | [diff] [blame] | 113 | 4377402 262437 262459 416347 1457945 524286 |
| 114 | 1.99% 1577.82% -0.06% 967.83% -24.01% 774.44% |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 115 | |
Tim Peters | 3ddb856 | 2002-08-10 07:04:01 +0000 | [diff] [blame] | 116 | 524288 9205096 9453356 9408463 524468 9441930 2218577 9692015 |
| 117 | 9278734 524580 524633 837947 2916107 1048574 |
Tim Peters | e05f65a | 2002-08-10 05:21:15 +0000 | [diff] [blame] | 118 | 1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30% |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 119 | |
Tim Peters | b80595f | 2002-08-10 03:04:33 +0000 | [diff] [blame] | 120 | 1048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212 |
Tim Peters | e05f65a | 2002-08-10 05:21:15 +0000 | [diff] [blame] | 121 | 19606028 1048958 1048941 1694896 5832445 2097150 |
| 122 | 1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38% |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 123 | |
| 124 | Discussion of cases: |
| 125 | |
| 126 | *sort: There's no structure in random data to exploit, so the theoretical |
| 127 | limit is lg(n!). Both methods get close to that, and timsort is hugging |
| 128 | it (indeed, in a *marginal* sense, it's a spectacular improvement -- |
| 129 | there's only about 1% left before hitting the wall, and timsort knows |
| 130 | darned well it's doing compares that won't pay on random data -- but so |
| 131 | does the samplesort hybrid). For contrast, Hoare's original random-pivot |
| 132 | quicksort does about 39% more compares than the limit, and the median-of-3 |
| 133 | variant about 19% more. |
| 134 | |
Tim Peters | b80595f | 2002-08-10 03:04:33 +0000 | [diff] [blame] | 135 | 3sort, %sort, and !sort: No contest; there's structure in this data, but |
| 136 | not of the specific kinds samplesort special-cases. Note that structure |
| 137 | in !sort wasn't put there on purpose -- it was crafted as a worst case for |
| 138 | a previous quicksort implementation. That timsort nails it came as a |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 139 | surprise to me (although it's obvious in retrospect). |
| 140 | |
| 141 | +sort: samplesort special-cases this data, and does a few less compares |
| 142 | than timsort. However, timsort runs this case significantly faster on all |
| 143 | boxes we have timings for, because timsort is in the business of merging |
| 144 | runs efficiently, while samplesort does much more data movement in this |
| 145 | (for it) special case. |
| 146 | |
| 147 | ~sort: samplesort's special cases for large masses of equal elements are |
| 148 | extremely effective on ~sort's specific data pattern, and timsort just |
| 149 | isn't going to get close to that, despite that it's clearly getting a |
| 150 | great deal of benefit out of the duplicates (the # of compares is much less |
| 151 | than lg(n!)). ~sort has a perfectly uniform distribution of just 4 |
| 152 | distinct values, and as the distribution gets more skewed, samplesort's |
| 153 | equal-element gimmicks become less effective, while timsort's adaptive |
| 154 | strategies find more to exploit; in a database supplied by Kevin Altis, a |
| 155 | sort on its highly skewed "on which stock exchange does this company's |
| 156 | stock trade?" field ran over twice as fast under timsort. |
| 157 | |
| 158 | However, despite that timsort does many more comparisons on ~sort, and |
| 159 | that on several platforms ~sort runs highly significantly slower under |
| 160 | timsort, on other platforms ~sort runs highly significantly faster under |
| 161 | timsort. No other kind of data has shown this wild x-platform behavior, |
| 162 | and we don't have an explanation for it. The only thing I can think of |
| 163 | that could transform what "should be" highly significant slowdowns into |
| 164 | highly significant speedups on some boxes are catastrophic cache effects |
| 165 | in samplesort. |
| 166 | |
| 167 | But timsort "should be" slower than samplesort on ~sort, so it's hard |
| 168 | to count that it isn't on some boxes as a strike against it <wink>. |
| 169 | |
Tim Peters | 6c511e6 | 2002-08-08 01:55:16 +0000 | [diff] [blame] | 170 | + Here's the highwater mark for the number of heap-based temp slots (4 |
| 171 | bytes each on this box) needed by each test, again with arguments |
| 172 | "15 20 1": |
| 173 | |
Tim Peters | 6c511e6 | 2002-08-08 01:55:16 +0000 | [diff] [blame] | 174 | 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort |
| 175 | 32768 16384 0 0 6256 0 10821 12288 0 16383 |
| 176 | 65536 32766 0 0 21652 0 31276 24576 0 32767 |
| 177 | 131072 65534 0 0 17258 0 58112 49152 0 65535 |
| 178 | 262144 131072 0 0 35660 0 123561 98304 0 131071 |
| 179 | 524288 262142 0 0 31302 0 212057 196608 0 262143 |
| 180 | 1048576 524286 0 0 312438 0 484942 393216 0 524287 |
| 181 | |
| 182 | Discussion: The tests that end up doing (close to) perfectly balanced |
| 183 | merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort |
| 184 | also ends up doing balanced merges, but systematically benefits a lot from |
| 185 | the preliminary pre-merge searches described under "Merge Memory" later. |
| 186 | %sort approaches having a balanced merge at the end because the random |
| 187 | selection of elements to replace is expected to produce an out-of-order |
| 188 | element near the midpoint. \sort, /sort, =sort are the trivial one-run |
| 189 | cases, needing no merging at all. +sort ends up having one very long run |
| 190 | and one very short, and so gets all the temp space it needs from the small |
| 191 | temparray member of the MergeState struct (note that the same would be |
| 192 | true if the new random elements were prefixed to the sorted list instead, |
| 193 | but not if they appeared "in the middle"). 3sort approaches N//3 temp |
| 194 | slots twice, but the run lengths that remain after 3 random exchanges |
| 195 | clearly has very high variance. |
| 196 | |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 197 | |
| 198 | A detailed description of timsort follows. |
| 199 | |
| 200 | Runs |
| 201 | ---- |
| 202 | count_run() returns the # of elements in the next run. A run is either |
| 203 | "ascending", which means non-decreasing: |
| 204 | |
| 205 | a0 <= a1 <= a2 <= ... |
| 206 | |
| 207 | or "descending", which means strictly decreasing: |
| 208 | |
| 209 | a0 > a1 > a2 > ... |
| 210 | |
| 211 | Note that a run is always at least 2 long, unless we start at the array's |
| 212 | last element. |
| 213 | |
| 214 | The definition of descending is strict, because the main routine reverses |
| 215 | a descending run in-place, transforming a descending run into an ascending |
| 216 | run. Reversal is done via the obvious fast "swap elements starting at each |
| 217 | end, and converge at the middle" method, and that can violate stability if |
| 218 | the slice contains any equal elements. Using a strict definition of |
| 219 | descending ensures that a descending run contains distinct elements. |
| 220 | |
| 221 | If an array is random, it's very unlikely we'll see long runs. If a natural |
| 222 | run contains less than minrun elements (see next section), the main loop |
| 223 | artificially boosts it to minrun elements, via a stable binary insertion sort |
| 224 | applied to the right number of array elements following the short natural |
| 225 | run. In a random array, *all* runs are likely to be minrun long as a |
| 226 | result. This has two primary good effects: |
| 227 | |
| 228 | 1. Random data strongly tends then toward perfectly balanced (both runs have |
| 229 | the same length) merges, which is the most efficient way to proceed when |
| 230 | data is random. |
| 231 | |
| 232 | 2. Because runs are never very short, the rest of the code doesn't make |
| 233 | heroic efforts to shave a few cycles off per-merge overheads. For |
| 234 | example, reasonable use of function calls is made, rather than trying to |
| 235 | inline everything. Since there are no more than N/minrun runs to begin |
| 236 | with, a few "extra" function calls per merge is barely measurable. |
| 237 | |
| 238 | |
| 239 | Computing minrun |
| 240 | ---------------- |
| 241 | If N < 64, minrun is N. IOW, binary insertion sort is used for the whole |
| 242 | array then; it's hard to beat that given the overheads of trying something |
| 243 | fancier. |
| 244 | |
| 245 | When N is a power of 2, testing on random data showed that minrun values of |
| 246 | 16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost |
| 247 | in binary insertion sort clearly hurt, and at 8 the increase in the number |
| 248 | of function calls clearly hurt. Picking *some* power of 2 is important |
| 249 | here, so that the merges end up perfectly balanced (see next section). We |
| 250 | pick 32 as a good value in the sweet range; picking a value at the low end |
| 251 | allows the adaptive gimmicks more opportunity to exploit shorter natural |
| 252 | runs. |
| 253 | |
| 254 | Because sortperf.py only tries powers of 2, it took a long time to notice |
| 255 | that 32 isn't a good choice for the general case! Consider N=2112: |
| 256 | |
| 257 | >>> divmod(2112, 32) |
| 258 | (66, 0) |
| 259 | >>> |
| 260 | |
| 261 | If the data is randomly ordered, we're very likely to end up with 66 runs |
| 262 | each of length 32. The first 64 of these trigger a sequence of perfectly |
| 263 | balanced merges (see next section), leaving runs of lengths 2048 and 64 to |
| 264 | merge at the end. The adaptive gimmicks can do that with fewer than 2048+64 |
| 265 | compares, but it's still more compares than necessary, and-- mergesort's |
| 266 | bugaboo relative to samplesort --a lot more data movement (O(N) copies just |
| 267 | to get 64 elements into place). |
| 268 | |
| 269 | If we take minrun=33 in this case, then we're very likely to end up with 64 |
| 270 | runs each of length 33, and then all merges are perfectly balanced. Better! |
| 271 | |
| 272 | What we want to avoid is picking minrun such that in |
| 273 | |
| 274 | q, r = divmod(N, minrun) |
| 275 | |
| 276 | q is a power of 2 and r>0 (then the last merge only gets r elements into |
Tim Peters | 671764b | 2002-08-09 05:06:44 +0000 | [diff] [blame] | 277 | place, and r < minrun is small compared to N), or q a little larger than a |
| 278 | power of 2 regardless of r (then we've got a case similar to "2112", again |
| 279 | leaving too little work for the last merge to do). |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 280 | |
| 281 | Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a |
| 282 | power of 2, or if that isn't possible, is close to, but strictly less than, |
| 283 | a power of 2. This is easier to do than it may sound: take the first 6 |
| 284 | bits of N, and add 1 if any of the remaining bits are set. In fact, that |
| 285 | rule covers every case in this section, including small N and exact powers |
| 286 | of 2; merge_compute_minrun() is a deceptively simple function. |
| 287 | |
| 288 | |
| 289 | The Merge Pattern |
| 290 | ----------------- |
| 291 | In order to exploit regularities in the data, we're merging on natural |
| 292 | run lengths, and they can become wildly unbalanced. That's a Good Thing |
| 293 | for this sort! It means we have to find a way to manage an assortment of |
| 294 | potentially very different run lengths, though. |
| 295 | |
| 296 | Stability constrains permissible merging patterns. For example, if we have |
| 297 | 3 consecutive runs of lengths |
| 298 | |
| 299 | A:10000 B:20000 C:10000 |
| 300 | |
| 301 | we dare not merge A with C first, because if A, B and C happen to contain |
| 302 | a common element, it would get out of order wrt its occurence(s) in B. The |
| 303 | merging must be done as (A+B)+C or A+(B+C) instead. |
| 304 | |
| 305 | So merging is always done on two consecutive runs at a time, and in-place, |
| 306 | although this may require some temp memory (more on that later). |
| 307 | |
| 308 | When a run is identified, its base address and length are pushed on a stack |
| 309 | in the MergeState struct. merge_collapse() is then called to see whether it |
| 310 | should merge it with preceding run(s). We would like to delay merging as |
| 311 | long as possible in order to exploit patterns that may come up later, but we |
| 312 | like even more to do merging as soon as possible to exploit that the run just |
| 313 | found is still high in the memory hierarchy. We also can't delay merging |
| 314 | "too long" because it consumes memory to remember the runs that are still |
| 315 | unmerged, and the stack has a fixed size. |
| 316 | |
| 317 | What turned out to be a good compromise maintains two invariants on the |
| 318 | stack entries, where A, B and C are the lengths of the three righmost not-yet |
| 319 | merged slices: |
| 320 | |
| 321 | 1. A > B+C |
| 322 | 2. B > C |
| 323 | |
| 324 | Note that, by induction, #2 implies the lengths of pending runs form a |
| 325 | decreasing sequence. #1 implies that, reading the lengths right to left, |
| 326 | the pending-run lengths grow at least as fast as the Fibonacci numbers. |
| 327 | Therefore the stack can never grow larger than about log_base_phi(N) entries, |
| 328 | where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice |
| 329 | for very large arrays. |
| 330 | |
| 331 | If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the |
| 332 | freshness-in-cache reason), and the new run replaces the A,B or B,C entries; |
| 333 | e.g., if the last 3 entries are |
| 334 | |
| 335 | A:30 B:20 C:10 |
| 336 | |
| 337 | then B is merged with C, leaving |
| 338 | |
| 339 | A:30 BC:30 |
| 340 | |
| 341 | on the stack. Or if they were |
| 342 | |
| 343 | A:500 B:400: C:1000 |
| 344 | |
| 345 | then A is merged with B, leaving |
| 346 | |
| 347 | AB:900 C:1000 |
| 348 | |
| 349 | on the stack. |
| 350 | |
| 351 | In both examples, the stack configuration after the merge still violates |
| 352 | invariant #2, and merge_collapse() goes on to continue merging runs until |
| 353 | both invariants are satisfied. As an extreme case, suppose we didn't do the |
| 354 | minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2, |
| 355 | and 2. Nothing would get merged until the final 2 was seen, and that would |
| 356 | trigger 7 perfectly balanced merges. |
| 357 | |
| 358 | The thrust of these rules when they trigger merging is to balance the run |
| 359 | lengths as closely as possible, while keeping a low bound on the number of |
| 360 | runs we have to remember. This is maximally effective for random data, |
| 361 | where all runs are likely to be of (artificially forced) length minrun, and |
| 362 | then we get a sequence of perfectly balanced merges (with, perhaps, some |
| 363 | oddballs at the end). |
| 364 | |
| 365 | OTOH, one reason this sort is so good for partly ordered data has to do |
| 366 | with wildly unbalanced run lengths. |
| 367 | |
| 368 | |
| 369 | Merge Memory |
| 370 | ------------ |
| 371 | Merging adjacent runs of lengths A and B in-place is very difficult. |
| 372 | Theoretical constructions are known that can do it, but they're too difficult |
| 373 | and slow for practical use. But if we have temp memory equal to min(A, B), |
| 374 | it's easy. |
| 375 | |
| 376 | If A is smaller (function merge_lo), copy A to a temp array, leave B alone, |
| 377 | and then we can do the obvious merge algorithm left to right, from the temp |
| 378 | area and B, starting the stores into where A used to live. There's always a |
| 379 | free area in the original area comprising a number of elements equal to the |
| 380 | number not yet merged from the temp array (trivially true at the start; |
| 381 | proceed by induction). The only tricky bit is that if a comparison raises an |
| 382 | exception, we have to remember to copy the remaining elements back in from |
| 383 | the temp area, lest the array end up with duplicate entries from B. But |
| 384 | that's exactly the same thing we need to do if we reach the end of B first, |
| 385 | so the exit code is pleasantly common to both the normal and error cases. |
| 386 | |
| 387 | If B is smaller (function merge_hi, which is merge_lo's "mirror image"), |
| 388 | much the same, except that we need to merge right to left, copying B into a |
| 389 | temp array and starting the stores at the right end of where B used to live. |
| 390 | |
| 391 | A refinement: When we're about to merge adjacent runs A and B, we first do |
| 392 | a form of binary search (more on that later) to see where B[0] should end up |
| 393 | in A. Elements in A preceding that point are already in their final |
| 394 | positions, effectively shrinking the size of A. Likewise we also search to |
| 395 | see where A[-1] should end up in B, and elements of B after that point can |
| 396 | also be ignored. This cuts the amount of temp memory needed by the same |
| 397 | amount. |
| 398 | |
| 399 | These preliminary searches may not pay off, and can be expected *not* to |
| 400 | repay their cost if the data is random. But they can win huge in all of |
| 401 | time, copying, and memory savings when they do pay, so this is one of the |
| 402 | "per-merge overheads" mentioned above that we're happy to endure because |
| 403 | there is at most one very short run. It's generally true in this algorithm |
| 404 | that we're willing to gamble a little to win a lot, even though the net |
| 405 | expectation is negative for random data. |
| 406 | |
| 407 | |
| 408 | Merge Algorithms |
| 409 | ---------------- |
| 410 | merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo |
| 411 | deals with runs where A <= B, and merge_hi where A > B. They don't know |
| 412 | whether the data is clustered or uniform, but a lovely thing about merging |
| 413 | is that many kinds of clustering "reveal themselves" by how many times in a |
| 414 | row the winning merge element comes from the same run. We'll only discuss |
| 415 | merge_lo here; merge_hi is exactly analogous. |
| 416 | |
| 417 | Merging begins in the usual, obvious way, comparing the first element of A |
| 418 | to the first of B, and moving B[0] to the merge area if it's less than A[0], |
| 419 | else moving A[0] to the merge area. Call that the "one pair at a time" |
| 420 | mode. The only twist here is keeping track of how many times in a row "the |
| 421 | winner" comes from the same run. |
| 422 | |
| 423 | If that count reaches MIN_GALLOP, we switch to "galloping mode". Here |
| 424 | we *search* B for where A[0] belongs, and move over all the B's before |
| 425 | that point in one chunk to the merge area, then move A[0] to the merge |
| 426 | area. Then we search A for where B[0] belongs, and similarly move a |
| 427 | slice of A in one chunk. Then back to searching B for where A[0] belongs, |
| 428 | etc. We stay in galloping mode until both searches find slices to copy |
| 429 | less than MIN_GALLOP elements long, at which point we go back to one-pair- |
| 430 | at-a-time mode. |
| 431 | |
Tim Peters | e05f65a | 2002-08-10 05:21:15 +0000 | [diff] [blame] | 432 | A refinement: The MergeState struct contains the value of min_gallop that |
| 433 | controls when we enter galloping mode, initialized to MIN_GALLOP. |
Tim Peters | 3ddb856 | 2002-08-10 07:04:01 +0000 | [diff] [blame] | 434 | merge_lo() and merge_hi() adjust this higher when galloping isn't paying |
Tim Peters | e05f65a | 2002-08-10 05:21:15 +0000 | [diff] [blame] | 435 | off, and lower when it is. |
| 436 | |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 437 | |
| 438 | Galloping |
| 439 | --------- |
| 440 | Still without loss of generality, assume A is the shorter run. In galloping |
| 441 | mode, we first look for A[0] in B. We do this via "galloping", comparing |
| 442 | A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding |
| 443 | the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most |
| 444 | roughly lg(B) comparisons, and, unlike a straight binary search, favors |
| 445 | finding the right spot early in B (more on that later). |
| 446 | |
| 447 | After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1 |
| 448 | consecutive elements, and a straight binary search requires exactly k-1 |
| 449 | additional comparisons to nail it. Then we copy all the B's up to that |
| 450 | point in one chunk, and then copy A[0]. Note that no matter where A[0] |
| 451 | belongs in B, the combination of galloping + binary search finds it in no |
| 452 | more than about 2*lg(B) comparisons. |
| 453 | |
| 454 | If we did a straight binary search, we could find it in no more than |
| 455 | ceiling(lg(B+1)) comparisons -- but straight binary search takes that many |
| 456 | comparisons no matter where A[0] belongs. Straight binary search thus loses |
| 457 | to galloping unless the run is quite long, and we simply can't guess |
| 458 | whether it is in advance. |
| 459 | |
| 460 | If data is random and runs have the same length, A[0] belongs at B[0] half |
| 461 | the time, at B[1] a quarter of the time, and so on: a consecutive winning |
| 462 | sub-run in B of length k occurs with probability 1/2**(k+1). So long |
| 463 | winning sub-runs are extremely unlikely in random data, and guessing that a |
| 464 | winning sub-run is going to be long is a dangerous game. |
| 465 | |
| 466 | OTOH, if data is lopsided or lumpy or contains many duplicates, long |
| 467 | stretches of winning sub-runs are very likely, and cutting the number of |
| 468 | comparisons needed to find one from O(B) to O(log B) is a huge win. |
| 469 | |
| 470 | Galloping compromises by getting out fast if there isn't a long winning |
| 471 | sub-run, yet finding such very efficiently when they exist. |
| 472 | |
| 473 | I first learned about the galloping strategy in a related context; see: |
| 474 | |
| 475 | "Adaptive Set Intersections, Unions, and Differences" (2000) |
| 476 | Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro |
| 477 | |
| 478 | and its followup(s). An earlier paper called the same strategy |
| 479 | "exponential search": |
| 480 | |
| 481 | "Optimistic Sorting and Information Theoretic Complexity" |
| 482 | Peter McIlroy |
| 483 | SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp |
| 484 | 467-474, Austin, Texas, 25-27 January 1993. |
| 485 | |
| 486 | and it probably dates back to an earlier paper by Bentley and Yao. The |
| 487 | McIlory paper in particular has good analysis of a mergesort that's |
| 488 | probably strongly related to this one in its galloping strategy. |
| 489 | |
| 490 | |
| 491 | Galloping with a Broken Leg |
| 492 | --------------------------- |
| 493 | So why don't we always gallop? Because it can lose, on two counts: |
| 494 | |
Tim Peters | 6c511e6 | 2002-08-08 01:55:16 +0000 | [diff] [blame] | 495 | 1. While we're willing to endure small per-merge overheads, per-comparison |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 496 | overheads are a different story. Calling Yet Another Function per |
| 497 | comparison is expensive, and gallop_left() and gallop_right() are |
| 498 | too long-winded for sane inlining. |
| 499 | |
Tim Peters | 6c511e6 | 2002-08-08 01:55:16 +0000 | [diff] [blame] | 500 | 2. Galloping can-- alas --require more comparisons than linear one-at-time |
| 501 | search, depending on the data. |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 502 | |
| 503 | #2 requires details. If A[0] belongs before B[0], galloping requires 1 |
| 504 | compare to determine that, same as linear search, except it costs more |
| 505 | to call the gallop function. If A[0] belongs right before B[1], galloping |
| 506 | requires 2 compares, again same as linear search. On the third compare, |
| 507 | galloping checks A[0] against B[3], and if it's <=, requires one more |
| 508 | compare to determine whether A[0] belongs at B[2] or B[3]. That's a total |
| 509 | of 4 compares, but if A[0] does belong at B[2], linear search would have |
| 510 | discovered that in only 3 compares, and that's a huge loss! Really. It's |
| 511 | an increase of 33% in the number of compares needed, and comparisons are |
| 512 | expensive in Python. |
| 513 | |
| 514 | index in B where # compares linear # gallop # binary gallop |
| 515 | A[0] belongs search needs compares compares total |
| 516 | ---------------- ----------------- -------- -------- ------ |
| 517 | 0 1 1 0 1 |
| 518 | |
| 519 | 1 2 2 0 2 |
| 520 | |
| 521 | 2 3 3 1 4 |
| 522 | 3 4 3 1 4 |
| 523 | |
| 524 | 4 5 4 2 6 |
| 525 | 5 6 4 2 6 |
| 526 | 6 7 4 2 6 |
| 527 | 7 8 4 2 6 |
| 528 | |
| 529 | 8 9 5 3 8 |
| 530 | 9 10 5 3 8 |
| 531 | 10 11 5 3 8 |
| 532 | 11 12 5 3 8 |
| 533 | ... |
| 534 | |
| 535 | In general, if A[0] belongs at B[i], linear search requires i+1 comparisons |
| 536 | to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons. |
| 537 | The advantage of galloping is unbounded as i grows, but it doesn't win at |
| 538 | all until i=6. Before then, it loses twice (at i=2 and i=4), and ties |
| 539 | at the other values. At and after i=6, galloping always wins. |
| 540 | |
| 541 | We can't guess in advance when it's going to win, though, so we do one pair |
| 542 | at a time until the evidence seems strong that galloping may pay. MIN_GALLOP |
Tim Peters | e05f65a | 2002-08-10 05:21:15 +0000 | [diff] [blame] | 543 | is 7, and that's pretty strong evidence. However, if the data is random, it |
| 544 | simply will trigger galloping mode purely by luck every now and again, and |
| 545 | it's quite likely to hit one of the losing cases next. On the other hand, |
| 546 | in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it |
| 547 | "should be" then. So the MergeState struct keeps a min_gallop variable |
| 548 | that merge_lo and merge_hi adjust: the longer we stay in galloping mode, |
| 549 | the smaller min_gallop gets, making it easier to transition back to |
| 550 | galloping mode (if we ever leave it in the current merge, and at the |
| 551 | start of the next merge). But whenever the gallop loop doesn't pay, |
Tim Peters | 3ddb856 | 2002-08-10 07:04:01 +0000 | [diff] [blame] | 552 | min_gallop is increased by one, making it harder to transition back |
Tim Peters | e05f65a | 2002-08-10 05:21:15 +0000 | [diff] [blame] | 553 | to galloping mode (and again both within a merge and across merges). For |
| 554 | random data, this all but eliminates the gallop penalty: min_gallop grows |
| 555 | large enough that we almost never get into galloping mode. And for cases |
| 556 | like ~sort, min_gallop can fall to as low as 1. This seems to work well, |
| 557 | but in all it's a minor improvement over using a fixed MIN_GALLOP value. |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 558 | |
| 559 | |
| 560 | Galloping Complication |
| 561 | ---------------------- |
| 562 | The description above was for merge_lo. merge_hi has to merge "from the |
| 563 | other end", and really needs to gallop starting at the last element in a run |
| 564 | instead of the first. Galloping from the first still works, but does more |
| 565 | comparisons than it should (this is significant -- I timed it both ways). |
| 566 | For this reason, the gallop_left() and gallop_right() functions have a |
| 567 | "hint" argument, which is the index at which galloping should begin. So |
| 568 | galloping can actually start at any index, and proceed at offsets of 1, 3, |
| 569 | 7, 15, ... or -1, -3, -7, -15, ... from the starting index. |
| 570 | |
| 571 | In the code as I type it's always called with either 0 or n-1 (where n is |
| 572 | the # of elements in a run). It's tempting to try to do something fancier, |
| 573 | melding galloping with some form of interpolation search; for example, if |
| 574 | we're merging a run of length 1 with a run of length 10000, index 5000 is |
| 575 | probably a better guess at the final result than either 0 or 9999. But |
| 576 | it's unclear how to generalize that intuition usefully, and merging of |
| 577 | wildly unbalanced runs already enjoys excellent performance. |
| 578 | |
Tim Peters | 3ddb856 | 2002-08-10 07:04:01 +0000 | [diff] [blame] | 579 | ~sort is a good example of when balanced runs could benefit from a better |
| 580 | hint value: to the extent possible, this would like to use a starting |
| 581 | offset equal to the previous value of acount/bcount. Doing so saves about |
| 582 | 10% of the compares in ~sort. However, doing so is also a mixed bag, |
| 583 | hurting other cases. |
| 584 | |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 585 | |
| 586 | Comparing Average # of Compares on Random Arrays |
| 587 | ------------------------------------------------ |
Tim Peters | e05f65a | 2002-08-10 05:21:15 +0000 | [diff] [blame] | 588 | [NOTE: This was done when the new algorithm used about 0.1% more compares |
| 589 | on random data than does its current incarnation.] |
| 590 | |
Tim Peters | 92f81f2 | 2002-08-01 00:59:42 +0000 | [diff] [blame] | 591 | Here list.sort() is samplesort, and list.msort() this sort: |
| 592 | |
| 593 | """ |
| 594 | import random |
| 595 | from time import clock as now |
| 596 | |
| 597 | def fill(n): |
| 598 | from random import random |
| 599 | return [random() for i in xrange(n)] |
| 600 | |
| 601 | def mycmp(x, y): |
| 602 | global ncmp |
| 603 | ncmp += 1 |
| 604 | return cmp(x, y) |
| 605 | |
| 606 | def timeit(values, method): |
| 607 | global ncmp |
| 608 | X = values[:] |
| 609 | bound = getattr(X, method) |
| 610 | ncmp = 0 |
| 611 | t1 = now() |
| 612 | bound(mycmp) |
| 613 | t2 = now() |
| 614 | return t2-t1, ncmp |
| 615 | |
| 616 | format = "%5s %9.2f %11d" |
| 617 | f2 = "%5s %9.2f %11.2f" |
| 618 | |
| 619 | def drive(): |
| 620 | count = sst = sscmp = mst = mscmp = nelts = 0 |
| 621 | while True: |
| 622 | n = random.randrange(100000) |
| 623 | nelts += n |
| 624 | x = fill(n) |
| 625 | |
| 626 | t, c = timeit(x, 'sort') |
| 627 | sst += t |
| 628 | sscmp += c |
| 629 | |
| 630 | t, c = timeit(x, 'msort') |
| 631 | mst += t |
| 632 | mscmp += c |
| 633 | |
| 634 | count += 1 |
| 635 | if count % 10: |
| 636 | continue |
| 637 | |
| 638 | print "count", count, "nelts", nelts |
| 639 | print format % ("sort", sst, sscmp) |
| 640 | print format % ("msort", mst, mscmp) |
| 641 | print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp) |
| 642 | |
| 643 | drive() |
| 644 | """ |
| 645 | |
| 646 | I ran this on Windows and kept using the computer lightly while it was |
| 647 | running. time.clock() is wall-clock time on Windows, with better than |
| 648 | microsecond resolution. samplesort started with a 1.52% #-of-comparisons |
| 649 | disadvantage, fell quickly to 1.48%, and then fluctuated within that small |
| 650 | range. Here's the last chunk of output before I killed the job: |
| 651 | |
| 652 | count 2630 nelts 130906543 |
| 653 | sort 6110.80 1937887573 |
| 654 | msort 6002.78 1909389381 |
| 655 | 1.80 1.49 |
| 656 | |
| 657 | We've done nearly 2 billion comparisons apiece at Python speed there, and |
| 658 | that's enough <wink>. |
| 659 | |
| 660 | For random arrays of size 2 (yes, there are only 2 interesting ones), |
| 661 | samplesort has a 50%(!) comparison disadvantage. This is a consequence of |
| 662 | samplesort special-casing at most one ascending run at the start, then |
| 663 | falling back to the general case if it doesn't find an ascending run |
| 664 | immediately. The consequence is that it ends up using two compares to sort |
| 665 | [2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be |
| 666 | taught how to deal with mixtures of ascending and descending runs |
| 667 | efficiently in all cases. |