blob: f6b14ca6d91d92adcef07102c80fc90e23179904 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`heapq` --- Heap queue algorithm
2=====================================
3
4.. module:: heapq
5 :synopsis: Heap queue algorithm (a.k.a. priority queue).
6.. moduleauthor:: Kevin O'Connor
7.. sectionauthor:: Guido van Rossum <guido@python.org>
8.. sectionauthor:: François Pinard
Raymond Hettinger0e833c32010-08-07 23:31:27 +00009.. sectionauthor:: Raymond Hettinger
Georg Brandl116aa622007-08-15 14:28:22 +000010
Georg Brandl116aa622007-08-15 14:28:22 +000011This module provides an implementation of the heap queue algorithm, also known
12as the priority queue algorithm.
13
Éric Araujo6e6cb8e2010-11-16 19:13:50 +000014.. seealso::
15
16 Latest version of the :source:`heapq Python source code
17 <Lib/heapq.py>`
18
Georg Brandl57410c12010-11-23 08:37:54 +000019Heaps are binary trees for which every parent node has a value less than or
20equal to any of its children. This implementation uses arrays for which
21``heap[k] <= heap[2*k+1]`` and ``heap[k] <= heap[2*k+2]`` for all *k*, counting
22elements from zero. For the sake of comparison, non-existing elements are
23considered to be infinite. The interesting property of a heap is that its
24smallest element is always the root, ``heap[0]``.
Georg Brandl116aa622007-08-15 14:28:22 +000025
26The API below differs from textbook heap algorithms in two aspects: (a) We use
27zero-based indexing. This makes the relationship between the index for a node
28and the indexes for its children slightly less obvious, but is more suitable
29since Python uses zero-based indexing. (b) Our pop method returns the smallest
30item, not the largest (called a "min heap" in textbooks; a "max heap" is more
31common in texts because of its suitability for in-place sorting).
32
33These two make it possible to view the heap as a regular Python list without
34surprises: ``heap[0]`` is the smallest item, and ``heap.sort()`` maintains the
35heap invariant!
36
37To create a heap, use a list initialized to ``[]``, or you can transform a
38populated list into a heap via function :func:`heapify`.
39
40The following functions are provided:
41
42
43.. function:: heappush(heap, item)
44
45 Push the value *item* onto the *heap*, maintaining the heap invariant.
46
47
48.. function:: heappop(heap)
49
50 Pop and return the smallest item from the *heap*, maintaining the heap
51 invariant. If the heap is empty, :exc:`IndexError` is raised.
52
Benjamin Peterson35e8c462008-04-24 02:34:53 +000053
Christian Heimesdd15f6c2008-03-16 00:07:10 +000054.. function:: heappushpop(heap, item)
55
56 Push *item* on the heap, then pop and return the smallest item from the
57 *heap*. The combined action runs more efficiently than :func:`heappush`
58 followed by a separate call to :func:`heappop`.
59
Georg Brandl116aa622007-08-15 14:28:22 +000060
61.. function:: heapify(x)
62
63 Transform list *x* into a heap, in-place, in linear time.
64
65
66.. function:: heapreplace(heap, item)
67
68 Pop and return the smallest item from the *heap*, and also push the new *item*.
69 The heap size doesn't change. If the heap is empty, :exc:`IndexError` is raised.
Georg Brandl116aa622007-08-15 14:28:22 +000070
Raymond Hettinger6f80b4c2010-09-01 21:27:31 +000071 This one step operation is more efficient than a :func:`heappop` followed by
72 :func:`heappush` and can be more appropriate when using a fixed-size heap.
73 The pop/push combination always returns an element from the heap and replaces
74 it with *item*.
Georg Brandl116aa622007-08-15 14:28:22 +000075
Raymond Hettinger6f80b4c2010-09-01 21:27:31 +000076 The value returned may be larger than the *item* added. If that isn't
77 desired, consider using :func:`heappushpop` instead. Its push/pop
78 combination returns the smaller of the two values, leaving the larger value
79 on the heap.
Georg Brandlaf265f42008-12-07 15:06:20 +000080
Georg Brandl48310cd2009-01-03 21:18:54 +000081
Georg Brandl116aa622007-08-15 14:28:22 +000082The module also offers three general purpose functions based on heaps.
83
84
85.. function:: merge(*iterables)
86
87 Merge multiple sorted inputs into a single sorted output (for example, merge
Georg Brandl9afde1c2007-11-01 20:32:30 +000088 timestamped entries from multiple log files). Returns an :term:`iterator`
Benjamin Peterson206e3072008-10-19 14:07:49 +000089 over the sorted values.
Georg Brandl116aa622007-08-15 14:28:22 +000090
91 Similar to ``sorted(itertools.chain(*iterables))`` but returns an iterable, does
92 not pull the data into memory all at once, and assumes that each of the input
93 streams is already sorted (smallest to largest).
94
Georg Brandl116aa622007-08-15 14:28:22 +000095
Georg Brandl036490d2009-05-17 13:00:36 +000096.. function:: nlargest(n, iterable, key=None)
Georg Brandl116aa622007-08-15 14:28:22 +000097
98 Return a list with the *n* largest elements from the dataset defined by
99 *iterable*. *key*, if provided, specifies a function of one argument that is
100 used to extract a comparison key from each element in the iterable:
101 ``key=str.lower`` Equivalent to: ``sorted(iterable, key=key,
102 reverse=True)[:n]``
103
Georg Brandl116aa622007-08-15 14:28:22 +0000104
Georg Brandl036490d2009-05-17 13:00:36 +0000105.. function:: nsmallest(n, iterable, key=None)
Georg Brandl116aa622007-08-15 14:28:22 +0000106
107 Return a list with the *n* smallest elements from the dataset defined by
108 *iterable*. *key*, if provided, specifies a function of one argument that is
109 used to extract a comparison key from each element in the iterable:
110 ``key=str.lower`` Equivalent to: ``sorted(iterable, key=key)[:n]``
111
Georg Brandl116aa622007-08-15 14:28:22 +0000112
113The latter two functions perform best for smaller values of *n*. For larger
114values, it is more efficient to use the :func:`sorted` function. Also, when
Georg Brandl22b34312009-07-26 14:54:51 +0000115``n==1``, it is more efficient to use the built-in :func:`min` and :func:`max`
Georg Brandl116aa622007-08-15 14:28:22 +0000116functions.
117
118
Raymond Hettinger6f80b4c2010-09-01 21:27:31 +0000119Basic Examples
120--------------
121
122A `heapsort <http://en.wikipedia.org/wiki/Heapsort>`_ can be implemented by
123pushing all values onto a heap and then popping off the smallest values one at a
124time::
125
126 >>> def heapsort(iterable):
127 ... 'Equivalent to sorted(iterable)'
128 ... h = []
129 ... for value in iterable:
130 ... heappush(h, value)
131 ... return [heappop(h) for i in range(len(h))]
132 ...
133 >>> heapsort([1, 3, 5, 7, 9, 2, 4, 6, 8, 0])
134 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
135
136Heap elements can be tuples. This is useful for assigning comparison values
137(such as task priorities) alongside the main record being tracked::
138
139 >>> h = []
140 >>> heappush(h, (5, 'write code'))
141 >>> heappush(h, (7, 'release product'))
142 >>> heappush(h, (1, 'write spec'))
143 >>> heappush(h, (3, 'create tests'))
144 >>> heappop(h)
145 (1, 'write spec')
146
147
Raymond Hettinger0e833c32010-08-07 23:31:27 +0000148Priority Queue Implementation Notes
149-----------------------------------
150
151A `priority queue <http://en.wikipedia.org/wiki/Priority_queue>`_ is common use
152for a heap, and it presents several implementation challenges:
153
154* Sort stability: how do you get two tasks with equal priorities to be returned
155 in the order they were originally added?
156
157* Tuple comparison breaks for (priority, task) pairs if the priorities are equal
158 and the tasks do not have a default comparison order.
159
Raymond Hettinger648e7252010-08-07 23:37:37 +0000160* If the priority of a task changes, how do you move it to a new position in
Raymond Hettinger0e833c32010-08-07 23:31:27 +0000161 the heap?
162
163* Or if a pending task needs to be deleted, how do you find it and remove it
164 from the queue?
165
166A solution to the first two challenges is to store entries as 3-element list
167including the priority, an entry count, and the task. The entry count serves as
168a tie-breaker so that two tasks with the same priority are returned in the order
169they were added. And since no two entry counts are the same, the tuple
170comparison will never attempt to directly compare two tasks.
171
172The remaining challenges revolve around finding a pending task and making
173changes to its priority or removing it entirely. Finding a task can be done
174with a dictionary pointing to an entry in the queue.
175
176Removing the entry or changing its priority is more difficult because it would
177break the heap structure invariants. So, a possible solution is to mark an
178entry as invalid and optionally add a new entry with the revised priority::
179
180 pq = [] # the priority queue list
181 counter = itertools.count(1) # unique sequence count
182 task_finder = {} # mapping of tasks to entries
183 INVALID = 0 # mark an entry as deleted
184
185 def add_task(priority, task, count=None):
186 if count is None:
187 count = next(counter)
188 entry = [priority, count, task]
189 task_finder[task] = entry
190 heappush(pq, entry)
191
192 def get_top_priority():
193 while True:
194 priority, count, task = heappop(pq)
195 del task_finder[task]
196 if count is not INVALID:
197 return task
198
199 def delete_task(task):
200 entry = task_finder[task]
201 entry[1] = INVALID
202
203 def reprioritize(priority, task):
204 entry = task_finder[task]
205 add_task(priority, task, entry[1])
206 entry[1] = INVALID
207
208
Georg Brandl116aa622007-08-15 14:28:22 +0000209Theory
210------
211
Georg Brandl116aa622007-08-15 14:28:22 +0000212Heaps are arrays for which ``a[k] <= a[2*k+1]`` and ``a[k] <= a[2*k+2]`` for all
213*k*, counting elements from 0. For the sake of comparison, non-existing
214elements are considered to be infinite. The interesting property of a heap is
215that ``a[0]`` is always its smallest element.
216
217The strange invariant above is meant to be an efficient memory representation
218for a tournament. The numbers below are *k*, not ``a[k]``::
219
220 0
221
222 1 2
223
224 3 4 5 6
225
226 7 8 9 10 11 12 13 14
227
228 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
229
230In the tree above, each cell *k* is topping ``2*k+1`` and ``2*k+2``. In an usual
231binary tournament we see in sports, each cell is the winner over the two cells
232it tops, and we can trace the winner down the tree to see all opponents s/he
233had. However, in many computer applications of such tournaments, we do not need
234to trace the history of a winner. To be more memory efficient, when a winner is
235promoted, we try to replace it by something else at a lower level, and the rule
236becomes that a cell and the two cells it tops contain three different items, but
237the top cell "wins" over the two topped cells.
238
239If this heap invariant is protected at all time, index 0 is clearly the overall
240winner. The simplest algorithmic way to remove it and find the "next" winner is
241to move some loser (let's say cell 30 in the diagram above) into the 0 position,
242and then percolate this new 0 down the tree, exchanging values, until the
243invariant is re-established. This is clearly logarithmic on the total number of
244items in the tree. By iterating over all items, you get an O(n log n) sort.
245
246A nice feature of this sort is that you can efficiently insert new items while
247the sort is going on, provided that the inserted items are not "better" than the
248last 0'th element you extracted. This is especially useful in simulation
249contexts, where the tree holds all incoming events, and the "win" condition
250means the smallest scheduled time. When an event schedule other events for
251execution, they are scheduled into the future, so they can easily go into the
252heap. So, a heap is a good structure for implementing schedulers (this is what
253I used for my MIDI sequencer :-).
254
255Various structures for implementing schedulers have been extensively studied,
256and heaps are good for this, as they are reasonably speedy, the speed is almost
257constant, and the worst case is not much different than the average case.
258However, there are other representations which are more efficient overall, yet
259the worst cases might be terrible.
260
261Heaps are also very useful in big disk sorts. You most probably all know that a
262big sort implies producing "runs" (which are pre-sorted sequences, which size is
263usually related to the amount of CPU memory), followed by a merging passes for
264these runs, which merging is often very cleverly organised [#]_. It is very
265important that the initial sort produces the longest runs possible. Tournaments
266are a good way to that. If, using all the memory available to hold a
267tournament, you replace and percolate items that happen to fit the current run,
268you'll produce runs which are twice the size of the memory for random input, and
269much better for input fuzzily ordered.
270
271Moreover, if you output the 0'th item on disk and get an input which may not fit
272in the current tournament (because the value "wins" over the last output value),
273it cannot fit in the heap, so the size of the heap decreases. The freed memory
274could be cleverly reused immediately for progressively building a second heap,
275which grows at exactly the same rate the first heap is melting. When the first
276heap completely vanishes, you switch heaps and start a new run. Clever and
277quite effective!
278
279In a word, heaps are useful memory structures to know. I use them in a few
280applications, and I think it is good to keep a 'heap' module around. :-)
281
282.. rubric:: Footnotes
283
284.. [#] The disk balancing algorithms which are current, nowadays, are more annoying
285 than clever, and this is a consequence of the seeking capabilities of the disks.
286 On devices which cannot seek, like big tape drives, the story was quite
287 different, and one had to be very clever to ensure (far in advance) that each
288 tape movement will be the most effective possible (that is, will best
289 participate at "progressing" the merge). Some tapes were even able to read
290 backwards, and this was also used to avoid the rewinding time. Believe me, real
291 good tape sorts were quite spectacular to watch! From all times, sorting has
292 always been a Great Art! :-)
293