blob: 278ba2458b15e0c87f9d09d9b36d02efac033a3e [file] [log] [blame]
Guido van Rossum4b48d6b2002-08-02 20:23:56 +00001# -*- coding: Latin-1 -*-
2
Guido van Rossum0a824382002-08-02 16:44:32 +00003"""Heap queue algorithm (a.k.a. priority queue).
4
5Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for
6all k, counting elements from 0. For the sake of comparison,
7non-existing elements are considered to be infinite. The interesting
8property of a heap is that a[0] is always its smallest element.
9
10Usage:
11
12heap = [] # creates an empty heap
13heappush(heap, item) # pushes a new item on the heap
14item = heappop(heap) # pops the smallest item from the heap
15item = heap[0] # smallest item on the heap without popping it
Tim Petersaa7d2432002-08-03 02:11:26 +000016heapify(x) # transforms list into a heap, in-place, in linear time
Guido van Rossum0a824382002-08-02 16:44:32 +000017
18Our API differs from textbook heap algorithms as follows:
19
20- We use 0-based indexing. This makes the relationship between the
21 index for a node and the indexes for its children slightly less
22 obvious, but is more suitable since Python uses 0-based indexing.
23
24- Our heappop() method returns the smallest item, not the largest.
25
26These two make it possible to view the heap as a regular Python list
27without surprises: heap[0] is the smallest item, and heap.sort()
28maintains the heap invariant!
29"""
30
Guido van Rossumfbb29922002-08-02 22:01:37 +000031# Original code by Kevin O'Connor, augmented by Tim Peters
Guido van Rossum37c3b272002-08-02 16:50:58 +000032
Guido van Rossum0a824382002-08-02 16:44:32 +000033__about__ = """Heap queues
34
35[explanation by François Pinard]
36
37Heaps are arrays for which a[k] <= a[2*k+1] and a[k] <= a[2*k+2] for
38all k, counting elements from 0. For the sake of comparison,
39non-existing elements are considered to be infinite. The interesting
40property of a heap is that a[0] is always its smallest element.
41
42The strange invariant above is meant to be an efficient memory
43representation for a tournament. The numbers below are `k', not a[k]:
44
45 0
46
47 1 2
48
49 3 4 5 6
50
51 7 8 9 10 11 12 13 14
52
53 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
54
55
56In the tree above, each cell `k' is topping `2*k+1' and `2*k+2'. In
57an usual binary tournament we see in sports, each cell is the winner
58over the two cells it tops, and we can trace the winner down the tree
59to see all opponents s/he had. However, in many computer applications
60of such tournaments, we do not need to trace the history of a winner.
61To be more memory efficient, when a winner is promoted, we try to
62replace it by something else at a lower level, and the rule becomes
63that a cell and the two cells it tops contain three different items,
64but the top cell "wins" over the two topped cells.
65
66If this heap invariant is protected at all time, index 0 is clearly
67the overall winner. The simplest algorithmic way to remove it and
68find the "next" winner is to move some loser (let's say cell 30 in the
69diagram above) into the 0 position, and then percolate this new 0 down
70the tree, exchanging values, until the invariant is re-established.
71This is clearly logarithmic on the total number of items in the tree.
72By iterating over all items, you get an O(n ln n) sort.
73
74A nice feature of this sort is that you can efficiently insert new
75items while the sort is going on, provided that the inserted items are
76not "better" than the last 0'th element you extracted. This is
77especially useful in simulation contexts, where the tree holds all
78incoming events, and the "win" condition means the smallest scheduled
79time. When an event schedule other events for execution, they are
80scheduled into the future, so they can easily go into the heap. So, a
81heap is a good structure for implementing schedulers (this is what I
82used for my MIDI sequencer :-).
83
84Various structures for implementing schedulers have been extensively
85studied, and heaps are good for this, as they are reasonably speedy,
86the speed is almost constant, and the worst case is not much different
87than the average case. However, there are other representations which
88are more efficient overall, yet the worst cases might be terrible.
89
90Heaps are also very useful in big disk sorts. You most probably all
91know that a big sort implies producing "runs" (which are pre-sorted
92sequences, which size is usually related to the amount of CPU memory),
93followed by a merging passes for these runs, which merging is often
94very cleverly organised[1]. It is very important that the initial
95sort produces the longest runs possible. Tournaments are a good way
96to that. If, using all the memory available to hold a tournament, you
97replace and percolate items that happen to fit the current run, you'll
98produce runs which are twice the size of the memory for random input,
99and much better for input fuzzily ordered.
100
101Moreover, if you output the 0'th item on disk and get an input which
102may not fit in the current tournament (because the value "wins" over
103the last output value), it cannot fit in the heap, so the size of the
104heap decreases. The freed memory could be cleverly reused immediately
105for progressively building a second heap, which grows at exactly the
106same rate the first heap is melting. When the first heap completely
107vanishes, you switch heaps and start a new run. Clever and quite
108effective!
109
110In a word, heaps are useful memory structures to know. I use them in
111a few applications, and I think it is good to keep a `heap' module
112around. :-)
113
114--------------------
115[1] The disk balancing algorithms which are current, nowadays, are
116more annoying than clever, and this is a consequence of the seeking
117capabilities of the disks. On devices which cannot seek, like big
118tape drives, the story was quite different, and one had to be very
119clever to ensure (far in advance) that each tape movement will be the
120most effective possible (that is, will best participate at
121"progressing" the merge). Some tapes were even able to read
122backwards, and this was also used to avoid the rewinding time.
123Believe me, real good tape sorts were quite spectacular to watch!
124From all times, sorting has always been a Great Art! :-)
125"""
126
127def heappush(heap, item):
128 """Push item onto heap, maintaining the heap invariant."""
Tim Peters657fe382002-08-03 09:56:52 +0000129 heap.append(item)
130 _siftdown(heap, 0, len(heap)-1)
Tim Peters28c25522002-08-02 21:48:06 +0000131
132def heappop(heap):
133 """Pop the smallest item off the heap, maintaining the heap invariant."""
134 lastelt = heap.pop() # raises appropriate IndexError if heap is empty
135 if heap:
136 returnitem = heap[0]
137 heap[0] = lastelt
Tim Peters657fe382002-08-03 09:56:52 +0000138 _siftup(heap, 0)
Tim Peters28c25522002-08-02 21:48:06 +0000139 else:
140 returnitem = lastelt
Guido van Rossum0a824382002-08-02 16:44:32 +0000141 return returnitem
142
Tim Petersaa7d2432002-08-03 02:11:26 +0000143def heapify(x):
144 """Transform list into a heap, in-place, in O(len(heap)) time."""
145 n = len(x)
Tim Peters28c25522002-08-02 21:48:06 +0000146 # Transform bottom-up. The largest index there's any point to looking at
147 # is the largest with a child index in-range, so must have 2*i + 1 < n,
148 # or i < (n-1)/2. If n is even = 2*j, this is (2*j-1)/2 = j-1/2 so
149 # j-1 is the largest, which is n//2 - 1. If n is odd = 2*j+1, this is
150 # (2*j+1-1)/2 = j so j-1 is the largest, and that's again n//2-1.
151 for i in xrange(n//2 - 1, -1, -1):
Tim Peters657fe382002-08-03 09:56:52 +0000152 _siftup(x, i)
153
154# 'heap' is a heap at all indices >= startpos, except possibly for pos. pos
155# is the index of a leaf with a possibly out-of-order value. Restore the
156# heap invariant.
157def _siftdown(heap, startpos, pos):
158 newitem = heap[pos]
159 # Follow the path to the root, moving parents down until finding a place
160 # newitem fits.
161 while pos > startpos:
162 parentpos = (pos - 1) >> 1
163 parent = heap[parentpos]
164 if parent <= newitem:
165 break
166 heap[pos] = parent
167 pos = parentpos
168 heap[pos] = newitem
169
170# The child indices of heap index pos are already heaps, and we want to make
171# a heap at index pos too. We do this by bubbling the smaller child of
172# pos up (and so on with that child's children, etc) until hitting a leaf,
173# then using _siftdown to move the oddball originally at index pos into place.
174#
175# We *could* break out of the loop as soon as we find a pos where newitem <=
176# both its children, but turns out that's not a good idea, and despite that
177# many books write the algorithm that way. During a heap pop, the last array
178# element is sifted in, and that tends to be large, so that comparing it
179# against values starting from the root usually doesn't pay (= usually doesn't
180# get us out of the loop early). See Knuth, Volume 3, where this is
181# explained and quantified in an exercise.
182#
183# Cutting the # of comparisons is important, since these routines have no
184# way to extract "the priority" from an array element, so that intelligence
185# is likely to be hiding in custom __cmp__ methods, or in array elements
186# storing (priority, record) tuples. Comparisons are thus potentially
187# expensive.
188#
189# On random arrays of length 1000, making this change cut the number of
190# comparisons made by heapify() a little, and those made by exhaustive
191# heappop() a lot, in accord with theory. Here are typical results from 3
192# runs (3 just to demonstrate how small the variance is):
193#
194# Compares needed by heapify Compares needed by 1000 heapppops
195# -------------------------- ---------------------------------
196# 1837 cut to 1663 14996 cut to 8680
197# 1855 cut to 1659 14966 cut to 8678
198# 1847 cut to 1660 15024 cut to 8703
199#
200# Building the heap by using heappush() 1000 times instead required
201# 2198, 2148, and 2219 compares: heapify() is more efficient, when
202# you can use it.
203#
204# The total compares needed by list.sort() on the same lists were 8627,
205# 8627, and 8632 (this should be compared to the sum of heapify() and
206# heappop() compares): list.sort() is (unsurprisingly!) more efficent
207# for sorting.
208
209def _siftup(heap, pos):
210 endpos = len(heap)
211 startpos = pos
212 newitem = heap[pos]
213 # Bubble up the smaller child until hitting a leaf.
214 childpos = 2*pos + 1 # leftmost child position
215 while childpos < endpos:
216 # Set childpos to index of smaller child.
217 rightpos = childpos + 1
218 if rightpos < endpos and heap[rightpos] < heap[childpos]:
219 childpos = rightpos
220 # Move the smaller child up.
221 heap[pos] = heap[childpos]
222 pos = childpos
223 childpos = 2*pos + 1
224 # The leaf at pos is empty now. Put newitem there, and and bubble it up
225 # to its final resting place (by sifting its parents down).
226 heap[pos] = newitem
227 _siftdown(heap, startpos, pos)
Tim Peters28c25522002-08-02 21:48:06 +0000228
Guido van Rossum0a824382002-08-02 16:44:32 +0000229if __name__ == "__main__":
230 # Simple sanity test
231 heap = []
232 data = [1, 3, 5, 7, 9, 2, 4, 6, 8, 0]
233 for item in data:
234 heappush(heap, item)
235 sort = []
236 while heap:
237 sort.append(heappop(heap))
238 print sort