Georg Brandl | c275e15 | 2010-11-05 07:10:41 +0000 | [diff] [blame] | 1 | .. _sortinghowto: |
| 2 | |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 3 | Sorting HOW TO |
| 4 | ************** |
| 5 | |
| 6 | :Author: Andrew Dalke and Raymond Hettinger |
Raymond Hettinger | b436b6c | 2011-01-12 01:16:57 +0000 | [diff] [blame] | 7 | :Release: 0.1 |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 8 | |
| 9 | |
| 10 | Python lists have a built-in :meth:`list.sort` method that modifies the list |
Benjamin Peterson | 1efb8dc | 2011-01-12 04:44:41 +0000 | [diff] [blame] | 11 | in-place. There is also a :func:`sorted` built-in function that builds a new |
| 12 | sorted list from an iterable. |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 13 | |
| 14 | In this document, we explore the various techniques for sorting data using Python. |
| 15 | |
| 16 | |
| 17 | Sorting Basics |
| 18 | ============== |
| 19 | |
| 20 | A simple ascending sort is very easy: just call the :func:`sorted` function. It |
Raymond Hettinger | b436b6c | 2011-01-12 01:16:57 +0000 | [diff] [blame] | 21 | returns a new sorted list:: |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 22 | |
| 23 | >>> sorted([5, 2, 3, 1, 4]) |
| 24 | [1, 2, 3, 4, 5] |
| 25 | |
Raymond Hettinger | 810cd34 | 2011-02-06 06:11:29 +0000 | [diff] [blame] | 26 | You can also use the :meth:`list.sort` method. It modifies the list |
Serhiy Storchaka | ecf41da | 2016-10-19 16:29:26 +0300 | [diff] [blame] | 27 | in-place (and returns ``None`` to avoid confusion). Usually it's less convenient |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 28 | than :func:`sorted` - but if you don't need the original list, it's slightly |
| 29 | more efficient. |
| 30 | |
| 31 | >>> a = [5, 2, 3, 1, 4] |
| 32 | >>> a.sort() |
| 33 | >>> a |
| 34 | [1, 2, 3, 4, 5] |
| 35 | |
| 36 | Another difference is that the :meth:`list.sort` method is only defined for |
| 37 | lists. In contrast, the :func:`sorted` function accepts any iterable. |
| 38 | |
| 39 | >>> sorted({1: 'D', 2: 'B', 3: 'B', 4: 'E', 5: 'A'}) |
| 40 | [1, 2, 3, 4, 5] |
| 41 | |
| 42 | Key Functions |
| 43 | ============= |
| 44 | |
Raymond Hettinger | 99a5638 | 2012-04-29 09:32:30 -0700 | [diff] [blame] | 45 | Both :meth:`list.sort` and :func:`sorted` have a *key* parameter to specify a |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 46 | function to be called on each list element prior to making comparisons. |
| 47 | |
| 48 | For example, here's a case-insensitive string comparison: |
| 49 | |
| 50 | >>> sorted("This is a test string from Andrew".split(), key=str.lower) |
| 51 | ['a', 'Andrew', 'from', 'is', 'string', 'test', 'This'] |
| 52 | |
| 53 | The value of the *key* parameter should be a function that takes a single argument |
| 54 | and returns a key to use for sorting purposes. This technique is fast because |
| 55 | the key function is called exactly once for each input record. |
| 56 | |
| 57 | A common pattern is to sort complex objects using some of the object's indices |
| 58 | as keys. For example: |
| 59 | |
| 60 | >>> student_tuples = [ |
Zachary Ware | 378a1d7 | 2016-08-09 16:47:04 -0500 | [diff] [blame] | 61 | ... ('john', 'A', 15), |
| 62 | ... ('jane', 'B', 12), |
| 63 | ... ('dave', 'B', 10), |
| 64 | ... ] |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 65 | >>> sorted(student_tuples, key=lambda student: student[2]) # sort by age |
| 66 | [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 67 | |
| 68 | The same technique works for objects with named attributes. For example: |
| 69 | |
| 70 | >>> class Student: |
Zachary Ware | 378a1d7 | 2016-08-09 16:47:04 -0500 | [diff] [blame] | 71 | ... def __init__(self, name, grade, age): |
| 72 | ... self.name = name |
| 73 | ... self.grade = grade |
| 74 | ... self.age = age |
| 75 | ... def __repr__(self): |
| 76 | ... return repr((self.name, self.grade, self.age)) |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 77 | |
| 78 | >>> student_objects = [ |
Zachary Ware | 378a1d7 | 2016-08-09 16:47:04 -0500 | [diff] [blame] | 79 | ... Student('john', 'A', 15), |
| 80 | ... Student('jane', 'B', 12), |
| 81 | ... Student('dave', 'B', 10), |
| 82 | ... ] |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 83 | >>> sorted(student_objects, key=lambda student: student.age) # sort by age |
| 84 | [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 85 | |
| 86 | Operator Module Functions |
| 87 | ========================= |
| 88 | |
| 89 | The key-function patterns shown above are very common, so Python provides |
Raymond Hettinger | 810cd34 | 2011-02-06 06:11:29 +0000 | [diff] [blame] | 90 | convenience functions to make accessor functions easier and faster. The |
| 91 | :mod:`operator` module has :func:`~operator.itemgetter`, |
Raymond Hettinger | 99a5638 | 2012-04-29 09:32:30 -0700 | [diff] [blame] | 92 | :func:`~operator.attrgetter`, and a :func:`~operator.methodcaller` function. |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 93 | |
| 94 | Using those functions, the above examples become simpler and faster: |
| 95 | |
| 96 | >>> from operator import itemgetter, attrgetter |
| 97 | |
| 98 | >>> sorted(student_tuples, key=itemgetter(2)) |
| 99 | [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 100 | |
| 101 | >>> sorted(student_objects, key=attrgetter('age')) |
| 102 | [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 103 | |
| 104 | The operator module functions allow multiple levels of sorting. For example, to |
| 105 | sort by *grade* then by *age*: |
| 106 | |
| 107 | >>> sorted(student_tuples, key=itemgetter(1,2)) |
| 108 | [('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)] |
| 109 | |
| 110 | >>> sorted(student_objects, key=attrgetter('grade', 'age')) |
| 111 | [('john', 'A', 15), ('dave', 'B', 10), ('jane', 'B', 12)] |
| 112 | |
| 113 | Ascending and Descending |
| 114 | ======================== |
| 115 | |
| 116 | Both :meth:`list.sort` and :func:`sorted` accept a *reverse* parameter with a |
Raymond Hettinger | 99a5638 | 2012-04-29 09:32:30 -0700 | [diff] [blame] | 117 | boolean value. This is used to flag descending sorts. For example, to get the |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 118 | student data in reverse *age* order: |
| 119 | |
| 120 | >>> sorted(student_tuples, key=itemgetter(2), reverse=True) |
| 121 | [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)] |
| 122 | |
| 123 | >>> sorted(student_objects, key=attrgetter('age'), reverse=True) |
| 124 | [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)] |
| 125 | |
| 126 | Sort Stability and Complex Sorts |
| 127 | ================================ |
| 128 | |
| 129 | Sorts are guaranteed to be `stable |
Georg Brandl | 5d94134 | 2016-02-26 19:37:12 +0100 | [diff] [blame] | 130 | <https://en.wikipedia.org/wiki/Sorting_algorithm#Stability>`_\. That means that |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 131 | when multiple records have the same key, their original order is preserved. |
| 132 | |
| 133 | >>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)] |
| 134 | >>> sorted(data, key=itemgetter(0)) |
| 135 | [('blue', 1), ('blue', 2), ('red', 1), ('red', 2)] |
| 136 | |
| 137 | Notice how the two records for *blue* retain their original order so that |
| 138 | ``('blue', 1)`` is guaranteed to precede ``('blue', 2)``. |
| 139 | |
| 140 | This wonderful property lets you build complex sorts in a series of sorting |
| 141 | steps. For example, to sort the student data by descending *grade* and then |
| 142 | ascending *age*, do the *age* sort first and then sort again using *grade*: |
| 143 | |
| 144 | >>> s = sorted(student_objects, key=attrgetter('age')) # sort on secondary key |
| 145 | >>> sorted(s, key=attrgetter('grade'), reverse=True) # now sort on primary key, descending |
| 146 | [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 147 | |
Georg Brandl | 5d94134 | 2016-02-26 19:37:12 +0100 | [diff] [blame] | 148 | The `Timsort <https://en.wikipedia.org/wiki/Timsort>`_ algorithm used in Python |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 149 | does multiple sorts efficiently because it can take advantage of any ordering |
| 150 | already present in a dataset. |
| 151 | |
| 152 | The Old Way Using Decorate-Sort-Undecorate |
| 153 | ========================================== |
| 154 | |
| 155 | This idiom is called Decorate-Sort-Undecorate after its three steps: |
| 156 | |
| 157 | * First, the initial list is decorated with new values that control the sort order. |
| 158 | |
| 159 | * Second, the decorated list is sorted. |
| 160 | |
| 161 | * Finally, the decorations are removed, creating a list that contains only the |
| 162 | initial values in the new order. |
| 163 | |
| 164 | For example, to sort the student data by *grade* using the DSU approach: |
| 165 | |
| 166 | >>> decorated = [(student.grade, i, student) for i, student in enumerate(student_objects)] |
| 167 | >>> decorated.sort() |
| 168 | >>> [student for grade, i, student in decorated] # undecorate |
| 169 | [('john', 'A', 15), ('jane', 'B', 12), ('dave', 'B', 10)] |
| 170 | |
| 171 | This idiom works because tuples are compared lexicographically; the first items |
| 172 | are compared; if they are the same then the second items are compared, and so |
| 173 | on. |
| 174 | |
| 175 | It is not strictly necessary in all cases to include the index *i* in the |
| 176 | decorated list, but including it gives two benefits: |
| 177 | |
| 178 | * The sort is stable -- if two items have the same key, their order will be |
| 179 | preserved in the sorted list. |
| 180 | |
| 181 | * The original items do not have to be comparable because the ordering of the |
| 182 | decorated tuples will be determined by at most the first two items. So for |
| 183 | example the original list could contain complex numbers which cannot be sorted |
| 184 | directly. |
| 185 | |
| 186 | Another name for this idiom is |
Georg Brandl | 5d94134 | 2016-02-26 19:37:12 +0100 | [diff] [blame] | 187 | `Schwartzian transform <https://en.wikipedia.org/wiki/Schwartzian_transform>`_\, |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 188 | after Randal L. Schwartz, who popularized it among Perl programmers. |
| 189 | |
| 190 | Now that Python sorting provides key-functions, this technique is not often needed. |
| 191 | |
| 192 | |
| 193 | The Old Way Using the *cmp* Parameter |
| 194 | ===================================== |
| 195 | |
| 196 | Many constructs given in this HOWTO assume Python 2.4 or later. Before that, |
| 197 | there was no :func:`sorted` builtin and :meth:`list.sort` took no keyword |
| 198 | arguments. Instead, all of the Py2.x versions supported a *cmp* parameter to |
| 199 | handle user specified comparison functions. |
| 200 | |
| 201 | In Py3.0, the *cmp* parameter was removed entirely (as part of a larger effort to |
| 202 | simplify and unify the language, eliminating the conflict between rich |
| 203 | comparisons and the :meth:`__cmp__` magic method). |
| 204 | |
| 205 | In Py2.x, sort allowed an optional function which can be called for doing the |
| 206 | comparisons. That function should take two arguments to be compared and then |
| 207 | return a negative value for less-than, return zero if they are equal, or return |
| 208 | a positive value for greater-than. For example, we can do: |
| 209 | |
| 210 | >>> def numeric_compare(x, y): |
Zachary Ware | 378a1d7 | 2016-08-09 16:47:04 -0500 | [diff] [blame] | 211 | ... return x - y |
| 212 | >>> sorted([5, 2, 4, 1, 3], cmp=numeric_compare) # doctest: +SKIP |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 213 | [1, 2, 3, 4, 5] |
| 214 | |
| 215 | Or you can reverse the order of comparison with: |
| 216 | |
| 217 | >>> def reverse_numeric(x, y): |
Zachary Ware | 378a1d7 | 2016-08-09 16:47:04 -0500 | [diff] [blame] | 218 | ... return y - x |
| 219 | >>> sorted([5, 2, 4, 1, 3], cmp=reverse_numeric) # doctest: +SKIP |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 220 | [5, 4, 3, 2, 1] |
| 221 | |
| 222 | When porting code from Python 2.x to 3.x, the situation can arise when you have |
| 223 | the user supplying a comparison function and you need to convert that to a key |
Raymond Hettinger | b436b6c | 2011-01-12 01:16:57 +0000 | [diff] [blame] | 224 | function. The following wrapper makes that easy to do:: |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 225 | |
Raymond Hettinger | b436b6c | 2011-01-12 01:16:57 +0000 | [diff] [blame] | 226 | def cmp_to_key(mycmp): |
| 227 | 'Convert a cmp= function into a key= function' |
Ezio Melotti | af8838f | 2013-03-11 09:30:21 +0200 | [diff] [blame] | 228 | class K: |
Raymond Hettinger | b436b6c | 2011-01-12 01:16:57 +0000 | [diff] [blame] | 229 | def __init__(self, obj, *args): |
| 230 | self.obj = obj |
| 231 | def __lt__(self, other): |
| 232 | return mycmp(self.obj, other.obj) < 0 |
| 233 | def __gt__(self, other): |
| 234 | return mycmp(self.obj, other.obj) > 0 |
| 235 | def __eq__(self, other): |
| 236 | return mycmp(self.obj, other.obj) == 0 |
| 237 | def __le__(self, other): |
| 238 | return mycmp(self.obj, other.obj) <= 0 |
| 239 | def __ge__(self, other): |
| 240 | return mycmp(self.obj, other.obj) >= 0 |
| 241 | def __ne__(self, other): |
| 242 | return mycmp(self.obj, other.obj) != 0 |
| 243 | return K |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 244 | |
| 245 | To convert to a key function, just wrap the old comparison function: |
| 246 | |
Zachary Ware | 378a1d7 | 2016-08-09 16:47:04 -0500 | [diff] [blame] | 247 | .. testsetup:: |
| 248 | |
| 249 | from functools import cmp_to_key |
| 250 | |
| 251 | .. doctest:: |
| 252 | |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 253 | >>> sorted([5, 2, 4, 1, 3], key=cmp_to_key(reverse_numeric)) |
| 254 | [5, 4, 3, 2, 1] |
| 255 | |
| 256 | In Python 3.2, the :func:`functools.cmp_to_key` function was added to the |
Raymond Hettinger | 810cd34 | 2011-02-06 06:11:29 +0000 | [diff] [blame] | 257 | :mod:`functools` module in the standard library. |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 258 | |
| 259 | Odd and Ends |
| 260 | ============ |
| 261 | |
| 262 | * For locale aware sorting, use :func:`locale.strxfrm` for a key function or |
| 263 | :func:`locale.strcoll` for a comparison function. |
| 264 | |
Raymond Hettinger | 810cd34 | 2011-02-06 06:11:29 +0000 | [diff] [blame] | 265 | * The *reverse* parameter still maintains sort stability (so that records with |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 266 | equal keys retain the original order). Interestingly, that effect can be |
| 267 | simulated without the parameter by using the builtin :func:`reversed` function |
| 268 | twice: |
| 269 | |
| 270 | >>> data = [('red', 1), ('blue', 1), ('red', 2), ('blue', 2)] |
Raymond Hettinger | b9531bc | 2016-04-26 01:11:10 -0700 | [diff] [blame] | 271 | >>> standard_way = sorted(data, key=itemgetter(0), reverse=True) |
| 272 | >>> double_reversed = list(reversed(sorted(reversed(data), key=itemgetter(0)))) |
| 273 | >>> assert standard_way == double_reversed |
| 274 | >>> standard_way |
| 275 | [('red', 1), ('red', 2), ('blue', 1), ('blue', 2)] |
Raymond Hettinger | 53c58f8 | 2010-09-01 09:15:42 +0000 | [diff] [blame] | 276 | |
| 277 | * The sort routines are guaranteed to use :meth:`__lt__` when making comparisons |
| 278 | between two objects. So, it is easy to add a standard sort order to a class by |
| 279 | defining an :meth:`__lt__` method:: |
| 280 | |
| 281 | >>> Student.__lt__ = lambda self, other: self.age < other.age |
| 282 | >>> sorted(student_objects) |
| 283 | [('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)] |
| 284 | |
| 285 | * Key functions need not depend directly on the objects being sorted. A key |
| 286 | function can also access external resources. For instance, if the student grades |
| 287 | are stored in a dictionary, they can be used to sort a separate list of student |
| 288 | names: |
| 289 | |
| 290 | >>> students = ['dave', 'john', 'jane'] |
| 291 | >>> newgrades = {'john': 'F', 'jane':'A', 'dave': 'C'} |
| 292 | >>> sorted(students, key=newgrades.__getitem__) |
| 293 | ['jane', 'dave', 'john'] |