lib: more scalable list_sort()
XFS and UBIFS can pass long lists to list_sort(); this alternative
implementation scales better, reaching ~3x performance gain when list
length exceeds the L2 cache size.
Stand-alone program timings were run on a Core 2 duo L1=32KB L2=4MB,
gcc-4.4, with flags extracted from an Ubuntu kernel build. Object size is
581 bytes compared to 455 for Mark J. Roberts' code.
Worst case for either implementation is a list length just over a power of
two, and to roughly the same degree, so here are timing results for a
range of 2^N+1 lengths. List elements were 16 bytes each including malloc
overhead; initial order was random.
time (msec)
Tatham-Roberts
| generic-Mullis-v2
loop_count length | | ratio
4000000 2 206 294 1.427
2000000 3 176 227 1.289
1000000 5 199 172 0.864
500000 9 235 178 0.757
250000 17 243 182 0.748
125000 33 261 196 0.750
62500 65 277 209 0.754
31250 129 292 219 0.75
15625 257 317 235 0.741
7812 513 340 252 0.741
3906 1025 362 267 0.737
1953 2049 388 283 0.729 ~ L1 size
976 4097 556 323 0.580
488 8193 678 361 0.532
244 16385 773 395 0.510
122 32769 844 418 0.495
61 65537 917 454 0.495
30 131073 1128 543 0.481
15 262145 2355 869 0.369 ~ L2 size
7 524289 5597 1714 0.306
3 1048577 6218 2022 0.325
Mark's code does not actually implement the usual or generic mergesort,
but rather a variant from Simon Tatham described here:
http://www.chiark.greenend.org.uk/~sgtatham/algorithms/listsort.html
Simon's algorithm performs O(log N) passes over the entire input list,
doing merges of sublists that double in size on each pass. The generic
algorithm instead merges pairs of equal length lists as early as possible,
in recursive order. For either algorithm, the elements that extend the
list beyond power-of-two length are a special case, handled as nearly as
possible as a "rounding-up" to a full POT.
Some intuition for the locality of reference implications of merge order
may be gotten by watching this animation:
http://www.sorting-algorithms.com/merge-sort
Simon's algorithm requires only O(1) extra space rather than the generic
algorithm's O(log N), but in my non-recursive implementation the actual
O(log N) data is merely a vector of ~20 pointers, which I've put on the
stack.
Long-running list_sort() calls: If the list passed in may be long, or the
client's cmp() callback function is slow, the client's cmp() may
periodically invoke cond_resched() to voluntarily yield the CPU. All
inner loops of list_sort() call back to cmp().
Stability of the sort: distinct elements that compare equal emerge from
the sort in the same order as with Mark's code, for simple test cases. A
boot-time test is provided to verify this and other correctness
requirements.
A kernel that uses drm.ko appears to run normally with this change; I have
no suitable hardware to similarly test the use by UBIFS.
[akpm@linux-foundation.org: style tweaks, fix comment, make list_sort_test __init]
Signed-off-by: Don Mullis <don.mullis@gmail.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1 file changed