blob: db254e3fc3ff35e7d3a184842224f387a8e76b9b [file] [log] [blame]
Tim Peters9ae21482001-02-10 08:00:53 +00001#! /usr/bin/env python
2
3"""
4Module difflib -- helpers for computing deltas between objects.
5
6Function get_close_matches(word, possibilities, n=3, cutoff=0.6):
Tim Peters9ae21482001-02-10 08:00:53 +00007 Use SequenceMatcher to return list of the best "good enough" matches.
8
Tim Peters5e824c32001-08-12 22:25:01 +00009Function ndiff(a, b):
10 Return a delta: the difference between `a` and `b` (lists of strings).
Tim Peters9ae21482001-02-10 08:00:53 +000011
Tim Peters5e824c32001-08-12 22:25:01 +000012Function restore(delta, which):
13 Return one of the two sequences that generated an ndiff delta.
Tim Peters9ae21482001-02-10 08:00:53 +000014
Tim Peters5e824c32001-08-12 22:25:01 +000015Class SequenceMatcher:
16 A flexible class for comparing pairs of sequences of any type.
Tim Peters9ae21482001-02-10 08:00:53 +000017
Tim Peters5e824c32001-08-12 22:25:01 +000018Class Differ:
19 For producing human-readable deltas from sequences of lines of text.
Tim Peters9ae21482001-02-10 08:00:53 +000020"""
21
Tim Peters5e824c32001-08-12 22:25:01 +000022__all__ = ['get_close_matches', 'ndiff', 'restore', 'SequenceMatcher',
23 'Differ']
24
Tim Peters9ae21482001-02-10 08:00:53 +000025class SequenceMatcher:
Tim Peters5e824c32001-08-12 22:25:01 +000026
27 """
28 SequenceMatcher is a flexible class for comparing pairs of sequences of
29 any type, so long as the sequence elements are hashable. The basic
30 algorithm predates, and is a little fancier than, an algorithm
31 published in the late 1980's by Ratcliff and Obershelp under the
32 hyperbolic name "gestalt pattern matching". The basic idea is to find
33 the longest contiguous matching subsequence that contains no "junk"
34 elements (R-O doesn't address junk). The same idea is then applied
35 recursively to the pieces of the sequences to the left and to the right
36 of the matching subsequence. This does not yield minimal edit
37 sequences, but does tend to yield matches that "look right" to people.
38
39 SequenceMatcher tries to compute a "human-friendly diff" between two
40 sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the
41 longest *contiguous* & junk-free matching subsequence. That's what
42 catches peoples' eyes. The Windows(tm) windiff has another interesting
43 notion, pairing up elements that appear uniquely in each sequence.
44 That, and the method here, appear to yield more intuitive difference
45 reports than does diff. This method appears to be the least vulnerable
46 to synching up on blocks of "junk lines", though (like blank lines in
47 ordinary text files, or maybe "<P>" lines in HTML files). That may be
48 because this is the only method of the 3 that has a *concept* of
49 "junk" <wink>.
50
51 Example, comparing two strings, and considering blanks to be "junk":
52
53 >>> s = SequenceMatcher(lambda x: x == " ",
54 ... "private Thread currentThread;",
55 ... "private volatile Thread currentThread;")
56 >>>
57
58 .ratio() returns a float in [0, 1], measuring the "similarity" of the
59 sequences. As a rule of thumb, a .ratio() value over 0.6 means the
60 sequences are close matches:
61
62 >>> print round(s.ratio(), 3)
63 0.866
64 >>>
65
66 If you're only interested in where the sequences match,
67 .get_matching_blocks() is handy:
68
69 >>> for block in s.get_matching_blocks():
70 ... print "a[%d] and b[%d] match for %d elements" % block
71 a[0] and b[0] match for 8 elements
72 a[8] and b[17] match for 6 elements
73 a[14] and b[23] match for 15 elements
74 a[29] and b[38] match for 0 elements
75
76 Note that the last tuple returned by .get_matching_blocks() is always a
77 dummy, (len(a), len(b), 0), and this is the only case in which the last
78 tuple element (number of elements matched) is 0.
79
80 If you want to know how to change the first sequence into the second,
81 use .get_opcodes():
82
83 >>> for opcode in s.get_opcodes():
84 ... print "%6s a[%d:%d] b[%d:%d]" % opcode
85 equal a[0:8] b[0:8]
86 insert a[8:8] b[8:17]
87 equal a[8:14] b[17:23]
88 equal a[14:29] b[23:38]
89
90 See the Differ class for a fancy human-friendly file differencer, which
91 uses SequenceMatcher both to compare sequences of lines, and to compare
92 sequences of characters within similar (near-matching) lines.
93
94 See also function get_close_matches() in this module, which shows how
95 simple code building on SequenceMatcher can be used to do useful work.
96
97 Timing: Basic R-O is cubic time worst case and quadratic time expected
98 case. SequenceMatcher is quadratic time for the worst case and has
99 expected-case behavior dependent in a complicated way on how many
100 elements the sequences have in common; best case time is linear.
101
102 Methods:
103
104 __init__(isjunk=None, a='', b='')
105 Construct a SequenceMatcher.
106
107 set_seqs(a, b)
108 Set the two sequences to be compared.
109
110 set_seq1(a)
111 Set the first sequence to be compared.
112
113 set_seq2(b)
114 Set the second sequence to be compared.
115
116 find_longest_match(alo, ahi, blo, bhi)
117 Find longest matching block in a[alo:ahi] and b[blo:bhi].
118
119 get_matching_blocks()
120 Return list of triples describing matching subsequences.
121
122 get_opcodes()
123 Return list of 5-tuples describing how to turn a into b.
124
125 ratio()
126 Return a measure of the sequences' similarity (float in [0,1]).
127
128 quick_ratio()
129 Return an upper bound on .ratio() relatively quickly.
130
131 real_quick_ratio()
132 Return an upper bound on ratio() very quickly.
133 """
134
Tim Peters9ae21482001-02-10 08:00:53 +0000135 def __init__(self, isjunk=None, a='', b=''):
136 """Construct a SequenceMatcher.
137
138 Optional arg isjunk is None (the default), or a one-argument
139 function that takes a sequence element and returns true iff the
Tim Peters5e824c32001-08-12 22:25:01 +0000140 element is junk. None is equivalent to passing "lambda x: 0", i.e.
Fred Drakef1da6282001-02-19 19:30:05 +0000141 no elements are considered to be junk. For example, pass
Tim Peters9ae21482001-02-10 08:00:53 +0000142 lambda x: x in " \\t"
143 if you're comparing lines as sequences of characters, and don't
144 want to synch up on blanks or hard tabs.
145
146 Optional arg a is the first of two sequences to be compared. By
147 default, an empty string. The elements of a must be hashable. See
148 also .set_seqs() and .set_seq1().
149
150 Optional arg b is the second of two sequences to be compared. By
Fred Drakef1da6282001-02-19 19:30:05 +0000151 default, an empty string. The elements of b must be hashable. See
Tim Peters9ae21482001-02-10 08:00:53 +0000152 also .set_seqs() and .set_seq2().
153 """
154
155 # Members:
156 # a
157 # first sequence
158 # b
159 # second sequence; differences are computed as "what do
160 # we need to do to 'a' to change it into 'b'?"
161 # b2j
162 # for x in b, b2j[x] is a list of the indices (into b)
163 # at which x appears; junk elements do not appear
164 # b2jhas
165 # b2j.has_key
166 # fullbcount
167 # for x in b, fullbcount[x] == the number of times x
168 # appears in b; only materialized if really needed (used
169 # only for computing quick_ratio())
170 # matching_blocks
171 # a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k];
172 # ascending & non-overlapping in i and in j; terminated by
173 # a dummy (len(a), len(b), 0) sentinel
174 # opcodes
175 # a list of (tag, i1, i2, j1, j2) tuples, where tag is
176 # one of
177 # 'replace' a[i1:i2] should be replaced by b[j1:j2]
178 # 'delete' a[i1:i2] should be deleted
179 # 'insert' b[j1:j2] should be inserted
180 # 'equal' a[i1:i2] == b[j1:j2]
181 # isjunk
182 # a user-supplied function taking a sequence element and
183 # returning true iff the element is "junk" -- this has
184 # subtle but helpful effects on the algorithm, which I'll
185 # get around to writing up someday <0.9 wink>.
186 # DON'T USE! Only __chain_b uses this. Use isbjunk.
187 # isbjunk
188 # for x in b, isbjunk(x) == isjunk(x) but much faster;
189 # it's really the has_key method of a hidden dict.
190 # DOES NOT WORK for x in a!
191
192 self.isjunk = isjunk
193 self.a = self.b = None
194 self.set_seqs(a, b)
195
196 def set_seqs(self, a, b):
197 """Set the two sequences to be compared.
198
199 >>> s = SequenceMatcher()
200 >>> s.set_seqs("abcd", "bcde")
201 >>> s.ratio()
202 0.75
203 """
204
205 self.set_seq1(a)
206 self.set_seq2(b)
207
208 def set_seq1(self, a):
209 """Set the first sequence to be compared.
210
211 The second sequence to be compared is not changed.
212
213 >>> s = SequenceMatcher(None, "abcd", "bcde")
214 >>> s.ratio()
215 0.75
216 >>> s.set_seq1("bcde")
217 >>> s.ratio()
218 1.0
219 >>>
220
221 SequenceMatcher computes and caches detailed information about the
222 second sequence, so if you want to compare one sequence S against
223 many sequences, use .set_seq2(S) once and call .set_seq1(x)
224 repeatedly for each of the other sequences.
225
226 See also set_seqs() and set_seq2().
227 """
228
229 if a is self.a:
230 return
231 self.a = a
232 self.matching_blocks = self.opcodes = None
233
234 def set_seq2(self, b):
235 """Set the second sequence to be compared.
236
237 The first sequence to be compared is not changed.
238
239 >>> s = SequenceMatcher(None, "abcd", "bcde")
240 >>> s.ratio()
241 0.75
242 >>> s.set_seq2("abcd")
243 >>> s.ratio()
244 1.0
245 >>>
246
247 SequenceMatcher computes and caches detailed information about the
248 second sequence, so if you want to compare one sequence S against
249 many sequences, use .set_seq2(S) once and call .set_seq1(x)
250 repeatedly for each of the other sequences.
251
252 See also set_seqs() and set_seq1().
253 """
254
255 if b is self.b:
256 return
257 self.b = b
258 self.matching_blocks = self.opcodes = None
259 self.fullbcount = None
260 self.__chain_b()
261
262 # For each element x in b, set b2j[x] to a list of the indices in
263 # b where x appears; the indices are in increasing order; note that
264 # the number of times x appears in b is len(b2j[x]) ...
265 # when self.isjunk is defined, junk elements don't show up in this
266 # map at all, which stops the central find_longest_match method
267 # from starting any matching block at a junk element ...
268 # also creates the fast isbjunk function ...
269 # note that this is only called when b changes; so for cross-product
270 # kinds of matches, it's best to call set_seq2 once, then set_seq1
271 # repeatedly
272
273 def __chain_b(self):
274 # Because isjunk is a user-defined (not C) function, and we test
275 # for junk a LOT, it's important to minimize the number of calls.
276 # Before the tricks described here, __chain_b was by far the most
277 # time-consuming routine in the whole module! If anyone sees
278 # Jim Roskind, thank him again for profile.py -- I never would
279 # have guessed that.
280 # The first trick is to build b2j ignoring the possibility
281 # of junk. I.e., we don't call isjunk at all yet. Throwing
282 # out the junk later is much cheaper than building b2j "right"
283 # from the start.
284 b = self.b
285 self.b2j = b2j = {}
286 self.b2jhas = b2jhas = b2j.has_key
287 for i in xrange(len(b)):
288 elt = b[i]
289 if b2jhas(elt):
290 b2j[elt].append(i)
291 else:
292 b2j[elt] = [i]
293
294 # Now b2j.keys() contains elements uniquely, and especially when
295 # the sequence is a string, that's usually a good deal smaller
296 # than len(string). The difference is the number of isjunk calls
297 # saved.
298 isjunk, junkdict = self.isjunk, {}
299 if isjunk:
300 for elt in b2j.keys():
301 if isjunk(elt):
302 junkdict[elt] = 1 # value irrelevant; it's a set
303 del b2j[elt]
304
305 # Now for x in b, isjunk(x) == junkdict.has_key(x), but the
306 # latter is much faster. Note too that while there may be a
307 # lot of junk in the sequence, the number of *unique* junk
308 # elements is probably small. So the memory burden of keeping
309 # this dict alive is likely trivial compared to the size of b2j.
310 self.isbjunk = junkdict.has_key
311
312 def find_longest_match(self, alo, ahi, blo, bhi):
313 """Find longest matching block in a[alo:ahi] and b[blo:bhi].
314
315 If isjunk is not defined:
316
317 Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
318 alo <= i <= i+k <= ahi
319 blo <= j <= j+k <= bhi
320 and for all (i',j',k') meeting those conditions,
321 k >= k'
322 i <= i'
323 and if i == i', j <= j'
324
325 In other words, of all maximal matching blocks, return one that
326 starts earliest in a, and of all those maximal matching blocks that
327 start earliest in a, return the one that starts earliest in b.
328
329 >>> s = SequenceMatcher(None, " abcd", "abcd abcd")
330 >>> s.find_longest_match(0, 5, 0, 9)
331 (0, 4, 5)
332
333 If isjunk is defined, first the longest matching block is
334 determined as above, but with the additional restriction that no
335 junk element appears in the block. Then that block is extended as
336 far as possible by matching (only) junk elements on both sides. So
337 the resulting block never matches on junk except as identical junk
338 happens to be adjacent to an "interesting" match.
339
340 Here's the same example as before, but considering blanks to be
341 junk. That prevents " abcd" from matching the " abcd" at the tail
342 end of the second sequence directly. Instead only the "abcd" can
343 match, and matches the leftmost "abcd" in the second sequence:
344
345 >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
346 >>> s.find_longest_match(0, 5, 0, 9)
347 (1, 0, 4)
348
349 If no blocks match, return (alo, blo, 0).
350
351 >>> s = SequenceMatcher(None, "ab", "c")
352 >>> s.find_longest_match(0, 2, 0, 1)
353 (0, 0, 0)
354 """
355
356 # CAUTION: stripping common prefix or suffix would be incorrect.
357 # E.g.,
358 # ab
359 # acab
360 # Longest matching block is "ab", but if common prefix is
361 # stripped, it's "a" (tied with "b"). UNIX(tm) diff does so
362 # strip, so ends up claiming that ab is changed to acab by
363 # inserting "ca" in the middle. That's minimal but unintuitive:
364 # "it's obvious" that someone inserted "ac" at the front.
365 # Windiff ends up at the same place as diff, but by pairing up
366 # the unique 'b's and then matching the first two 'a's.
367
368 a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.isbjunk
369 besti, bestj, bestsize = alo, blo, 0
370 # find longest junk-free match
371 # during an iteration of the loop, j2len[j] = length of longest
372 # junk-free match ending with a[i-1] and b[j]
373 j2len = {}
374 nothing = []
375 for i in xrange(alo, ahi):
376 # look at all instances of a[i] in b; note that because
377 # b2j has no junk keys, the loop is skipped if a[i] is junk
378 j2lenget = j2len.get
379 newj2len = {}
380 for j in b2j.get(a[i], nothing):
381 # a[i] matches b[j]
382 if j < blo:
383 continue
384 if j >= bhi:
385 break
386 k = newj2len[j] = j2lenget(j-1, 0) + 1
387 if k > bestsize:
388 besti, bestj, bestsize = i-k+1, j-k+1, k
389 j2len = newj2len
390
391 # Now that we have a wholly interesting match (albeit possibly
392 # empty!), we may as well suck up the matching junk on each
393 # side of it too. Can't think of a good reason not to, and it
394 # saves post-processing the (possibly considerable) expense of
395 # figuring out what to do with it. In the case of an empty
396 # interesting match, this is clearly the right thing to do,
397 # because no other kind of match is possible in the regions.
398 while besti > alo and bestj > blo and \
399 isbjunk(b[bestj-1]) and \
400 a[besti-1] == b[bestj-1]:
401 besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
402 while besti+bestsize < ahi and bestj+bestsize < bhi and \
403 isbjunk(b[bestj+bestsize]) and \
404 a[besti+bestsize] == b[bestj+bestsize]:
405 bestsize = bestsize + 1
406
Tim Peters9ae21482001-02-10 08:00:53 +0000407 return besti, bestj, bestsize
408
409 def get_matching_blocks(self):
410 """Return list of triples describing matching subsequences.
411
412 Each triple is of the form (i, j, n), and means that
413 a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in
414 i and in j.
415
416 The last triple is a dummy, (len(a), len(b), 0), and is the only
417 triple with n==0.
418
419 >>> s = SequenceMatcher(None, "abxcd", "abcd")
420 >>> s.get_matching_blocks()
421 [(0, 0, 2), (3, 2, 2), (5, 4, 0)]
422 """
423
424 if self.matching_blocks is not None:
425 return self.matching_blocks
426 self.matching_blocks = []
427 la, lb = len(self.a), len(self.b)
428 self.__helper(0, la, 0, lb, self.matching_blocks)
429 self.matching_blocks.append( (la, lb, 0) )
Tim Peters9ae21482001-02-10 08:00:53 +0000430 return self.matching_blocks
431
432 # builds list of matching blocks covering a[alo:ahi] and
433 # b[blo:bhi], appending them in increasing order to answer
434
435 def __helper(self, alo, ahi, blo, bhi, answer):
436 i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
437 # a[alo:i] vs b[blo:j] unknown
438 # a[i:i+k] same as b[j:j+k]
439 # a[i+k:ahi] vs b[j+k:bhi] unknown
440 if k:
441 if alo < i and blo < j:
442 self.__helper(alo, i, blo, j, answer)
443 answer.append(x)
444 if i+k < ahi and j+k < bhi:
445 self.__helper(i+k, ahi, j+k, bhi, answer)
446
447 def get_opcodes(self):
448 """Return list of 5-tuples describing how to turn a into b.
449
450 Each tuple is of the form (tag, i1, i2, j1, j2). The first tuple
451 has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
452 tuple preceding it, and likewise for j1 == the previous j2.
453
454 The tags are strings, with these meanings:
455
456 'replace': a[i1:i2] should be replaced by b[j1:j2]
457 'delete': a[i1:i2] should be deleted.
458 Note that j1==j2 in this case.
459 'insert': b[j1:j2] should be inserted at a[i1:i1].
460 Note that i1==i2 in this case.
461 'equal': a[i1:i2] == b[j1:j2]
462
463 >>> a = "qabxcd"
464 >>> b = "abycdf"
465 >>> s = SequenceMatcher(None, a, b)
466 >>> for tag, i1, i2, j1, j2 in s.get_opcodes():
467 ... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
468 ... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
469 delete a[0:1] (q) b[0:0] ()
470 equal a[1:3] (ab) b[0:2] (ab)
471 replace a[3:4] (x) b[2:3] (y)
472 equal a[4:6] (cd) b[3:5] (cd)
473 insert a[6:6] () b[5:6] (f)
474 """
475
476 if self.opcodes is not None:
477 return self.opcodes
478 i = j = 0
479 self.opcodes = answer = []
480 for ai, bj, size in self.get_matching_blocks():
481 # invariant: we've pumped out correct diffs to change
482 # a[:i] into b[:j], and the next matching block is
483 # a[ai:ai+size] == b[bj:bj+size]. So we need to pump
484 # out a diff to change a[i:ai] into b[j:bj], pump out
485 # the matching block, and move (i,j) beyond the match
486 tag = ''
487 if i < ai and j < bj:
488 tag = 'replace'
489 elif i < ai:
490 tag = 'delete'
491 elif j < bj:
492 tag = 'insert'
493 if tag:
494 answer.append( (tag, i, ai, j, bj) )
495 i, j = ai+size, bj+size
496 # the list of matching blocks is terminated by a
497 # sentinel with size 0
498 if size:
499 answer.append( ('equal', ai, i, bj, j) )
500 return answer
501
502 def ratio(self):
503 """Return a measure of the sequences' similarity (float in [0,1]).
504
505 Where T is the total number of elements in both sequences, and
506 M is the number of matches, this is 2,0*M / T.
507 Note that this is 1 if the sequences are identical, and 0 if
508 they have nothing in common.
509
510 .ratio() is expensive to compute if you haven't already computed
511 .get_matching_blocks() or .get_opcodes(), in which case you may
512 want to try .quick_ratio() or .real_quick_ratio() first to get an
513 upper bound.
514
515 >>> s = SequenceMatcher(None, "abcd", "bcde")
516 >>> s.ratio()
517 0.75
518 >>> s.quick_ratio()
519 0.75
520 >>> s.real_quick_ratio()
521 1.0
522 """
523
524 matches = reduce(lambda sum, triple: sum + triple[-1],
525 self.get_matching_blocks(), 0)
526 return 2.0 * matches / (len(self.a) + len(self.b))
527
528 def quick_ratio(self):
529 """Return an upper bound on ratio() relatively quickly.
530
531 This isn't defined beyond that it is an upper bound on .ratio(), and
532 is faster to compute.
533 """
534
535 # viewing a and b as multisets, set matches to the cardinality
536 # of their intersection; this counts the number of matches
537 # without regard to order, so is clearly an upper bound
538 if self.fullbcount is None:
539 self.fullbcount = fullbcount = {}
540 for elt in self.b:
541 fullbcount[elt] = fullbcount.get(elt, 0) + 1
542 fullbcount = self.fullbcount
543 # avail[x] is the number of times x appears in 'b' less the
544 # number of times we've seen it in 'a' so far ... kinda
545 avail = {}
546 availhas, matches = avail.has_key, 0
547 for elt in self.a:
548 if availhas(elt):
549 numb = avail[elt]
550 else:
551 numb = fullbcount.get(elt, 0)
552 avail[elt] = numb - 1
553 if numb > 0:
554 matches = matches + 1
555 return 2.0 * matches / (len(self.a) + len(self.b))
556
557 def real_quick_ratio(self):
558 """Return an upper bound on ratio() very quickly.
559
560 This isn't defined beyond that it is an upper bound on .ratio(), and
561 is faster to compute than either .ratio() or .quick_ratio().
562 """
563
564 la, lb = len(self.a), len(self.b)
565 # can't have more matches than the number of elements in the
566 # shorter sequence
567 return 2.0 * min(la, lb) / (la + lb)
568
569def get_close_matches(word, possibilities, n=3, cutoff=0.6):
570 """Use SequenceMatcher to return list of the best "good enough" matches.
571
572 word is a sequence for which close matches are desired (typically a
573 string).
574
575 possibilities is a list of sequences against which to match word
576 (typically a list of strings).
577
578 Optional arg n (default 3) is the maximum number of close matches to
579 return. n must be > 0.
580
581 Optional arg cutoff (default 0.6) is a float in [0, 1]. Possibilities
582 that don't score at least that similar to word are ignored.
583
584 The best (no more than n) matches among the possibilities are returned
585 in a list, sorted by similarity score, most similar first.
586
587 >>> get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
588 ['apple', 'ape']
Tim Peters5e824c32001-08-12 22:25:01 +0000589 >>> import keyword as _keyword
590 >>> get_close_matches("wheel", _keyword.kwlist)
Tim Peters9ae21482001-02-10 08:00:53 +0000591 ['while']
Tim Peters5e824c32001-08-12 22:25:01 +0000592 >>> get_close_matches("apple", _keyword.kwlist)
Tim Peters9ae21482001-02-10 08:00:53 +0000593 []
Tim Peters5e824c32001-08-12 22:25:01 +0000594 >>> get_close_matches("accept", _keyword.kwlist)
Tim Peters9ae21482001-02-10 08:00:53 +0000595 ['except']
596 """
597
598 if not n > 0:
Fred Drakef1da6282001-02-19 19:30:05 +0000599 raise ValueError("n must be > 0: " + `n`)
Tim Peters9ae21482001-02-10 08:00:53 +0000600 if not 0.0 <= cutoff <= 1.0:
Fred Drakef1da6282001-02-19 19:30:05 +0000601 raise ValueError("cutoff must be in [0.0, 1.0]: " + `cutoff`)
Tim Peters9ae21482001-02-10 08:00:53 +0000602 result = []
603 s = SequenceMatcher()
604 s.set_seq2(word)
605 for x in possibilities:
606 s.set_seq1(x)
607 if s.real_quick_ratio() >= cutoff and \
608 s.quick_ratio() >= cutoff and \
609 s.ratio() >= cutoff:
610 result.append((s.ratio(), x))
611 # Sort by score.
612 result.sort()
613 # Retain only the best n.
614 result = result[-n:]
615 # Move best-scorer to head of list.
616 result.reverse()
617 # Strip scores.
618 return [x for score, x in result]
619
Tim Peters5e824c32001-08-12 22:25:01 +0000620
621def _count_leading(line, ch):
622 """
623 Return number of `ch` characters at the start of `line`.
624
625 Example:
626
627 >>> _count_leading(' abc', ' ')
628 3
629 """
630
631 i, n = 0, len(line)
632 while i < n and line[i] == ch:
633 i += 1
634 return i
635
636class Differ:
637 r"""
638 Differ is a class for comparing sequences of lines of text, and
639 producing human-readable differences or deltas. Differ uses
640 SequenceMatcher both to compare sequences of lines, and to compare
641 sequences of characters within similar (near-matching) lines.
642
643 Each line of a Differ delta begins with a two-letter code:
644
645 '- ' line unique to sequence 1
646 '+ ' line unique to sequence 2
647 ' ' line common to both sequences
648 '? ' line not present in either input sequence
649
650 Lines beginning with '? ' attempt to guide the eye to intraline
651 differences, and were not present in either input sequence. These lines
652 can be confusing if the sequences contain tab characters.
653
654 Note that Differ makes no claim to produce a *minimal* diff. To the
655 contrary, minimal diffs are often counter-intuitive, because they synch
656 up anywhere possible, sometimes accidental matches 100 pages apart.
657 Restricting synch points to contiguous matches preserves some notion of
658 locality, at the occasional cost of producing a longer diff.
659
660 Example: Comparing two texts.
661
662 First we set up the texts, sequences of individual single-line strings
663 ending with newlines (such sequences can also be obtained from the
664 `readlines()` method of file-like objects):
665
666 >>> text1 = ''' 1. Beautiful is better than ugly.
667 ... 2. Explicit is better than implicit.
668 ... 3. Simple is better than complex.
669 ... 4. Complex is better than complicated.
670 ... '''.splitlines(1)
671 >>> len(text1)
672 4
673 >>> text1[0][-1]
674 '\n'
675 >>> text2 = ''' 1. Beautiful is better than ugly.
676 ... 3. Simple is better than complex.
677 ... 4. Complicated is better than complex.
678 ... 5. Flat is better than nested.
679 ... '''.splitlines(1)
680
681 Next we instantiate a Differ object:
682
683 >>> d = Differ()
684
685 Note that when instantiating a Differ object we may pass functions to
686 filter out line and character 'junk'. See Differ.__init__ for details.
687
688 Finally, we compare the two:
689
Tim Peters8a9c2842001-09-22 21:30:22 +0000690 >>> result = list(d.compare(text1, text2))
Tim Peters5e824c32001-08-12 22:25:01 +0000691
692 'result' is a list of strings, so let's pretty-print it:
693
694 >>> from pprint import pprint as _pprint
695 >>> _pprint(result)
696 [' 1. Beautiful is better than ugly.\n',
697 '- 2. Explicit is better than implicit.\n',
698 '- 3. Simple is better than complex.\n',
699 '+ 3. Simple is better than complex.\n',
700 '? ++\n',
701 '- 4. Complex is better than complicated.\n',
702 '? ^ ---- ^\n',
703 '+ 4. Complicated is better than complex.\n',
704 '? ++++ ^ ^\n',
705 '+ 5. Flat is better than nested.\n']
706
707 As a single multi-line string it looks like this:
708
709 >>> print ''.join(result),
710 1. Beautiful is better than ugly.
711 - 2. Explicit is better than implicit.
712 - 3. Simple is better than complex.
713 + 3. Simple is better than complex.
714 ? ++
715 - 4. Complex is better than complicated.
716 ? ^ ---- ^
717 + 4. Complicated is better than complex.
718 ? ++++ ^ ^
719 + 5. Flat is better than nested.
720
721 Methods:
722
723 __init__(linejunk=None, charjunk=None)
724 Construct a text differencer, with optional filters.
725
726 compare(a, b)
Tim Peters8a9c2842001-09-22 21:30:22 +0000727 Compare two sequences of lines; generate the resulting delta.
Tim Peters5e824c32001-08-12 22:25:01 +0000728 """
729
730 def __init__(self, linejunk=None, charjunk=None):
731 """
732 Construct a text differencer, with optional filters.
733
734 The two optional keyword parameters are for filter functions:
735
736 - `linejunk`: A function that should accept a single string argument,
737 and return true iff the string is junk. The module-level function
738 `IS_LINE_JUNK` may be used to filter out lines without visible
739 characters, except for at most one splat ('#').
740
741 - `charjunk`: A function that should accept a string of length 1. The
742 module-level function `IS_CHARACTER_JUNK` may be used to filter out
743 whitespace characters (a blank or tab; **note**: bad idea to include
744 newline in this!).
745 """
746
747 self.linejunk = linejunk
748 self.charjunk = charjunk
Tim Peters5e824c32001-08-12 22:25:01 +0000749
750 def compare(self, a, b):
751 r"""
Tim Peters8a9c2842001-09-22 21:30:22 +0000752 Compare two sequences of lines; generate the resulting delta.
Tim Peters5e824c32001-08-12 22:25:01 +0000753
754 Each sequence must contain individual single-line strings ending with
755 newlines. Such sequences can be obtained from the `readlines()` method
Tim Peters8a9c2842001-09-22 21:30:22 +0000756 of file-like objects. The delta generated also consists of newline-
757 terminated strings, ready to be printed as-is via the writeline()
Tim Peters5e824c32001-08-12 22:25:01 +0000758 method of a file-like object.
759
760 Example:
761
762 >>> print ''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
763 ... 'ore\ntree\nemu\n'.splitlines(1))),
764 - one
765 ? ^
766 + ore
767 ? ^
768 - two
769 - three
770 ? -
771 + tree
772 + emu
773 """
774
775 cruncher = SequenceMatcher(self.linejunk, a, b)
776 for tag, alo, ahi, blo, bhi in cruncher.get_opcodes():
777 if tag == 'replace':
Tim Peters8a9c2842001-09-22 21:30:22 +0000778 g = self._fancy_replace(a, alo, ahi, b, blo, bhi)
Tim Peters5e824c32001-08-12 22:25:01 +0000779 elif tag == 'delete':
Tim Peters8a9c2842001-09-22 21:30:22 +0000780 g = self._dump('-', a, alo, ahi)
Tim Peters5e824c32001-08-12 22:25:01 +0000781 elif tag == 'insert':
Tim Peters8a9c2842001-09-22 21:30:22 +0000782 g = self._dump('+', b, blo, bhi)
Tim Peters5e824c32001-08-12 22:25:01 +0000783 elif tag == 'equal':
Tim Peters8a9c2842001-09-22 21:30:22 +0000784 g = self._dump(' ', a, alo, ahi)
Tim Peters5e824c32001-08-12 22:25:01 +0000785 else:
786 raise ValueError, 'unknown tag ' + `tag`
Tim Peters8a9c2842001-09-22 21:30:22 +0000787
788 for line in g:
789 yield line
Tim Peters5e824c32001-08-12 22:25:01 +0000790
791 def _dump(self, tag, x, lo, hi):
Tim Peters8a9c2842001-09-22 21:30:22 +0000792 """Generate comparison results for a same-tagged range."""
Tim Peters5e824c32001-08-12 22:25:01 +0000793 for i in xrange(lo, hi):
Tim Peters8a9c2842001-09-22 21:30:22 +0000794 yield '%s %s' % (tag, x[i])
Tim Peters5e824c32001-08-12 22:25:01 +0000795
796 def _plain_replace(self, a, alo, ahi, b, blo, bhi):
797 assert alo < ahi and blo < bhi
798 # dump the shorter block first -- reduces the burden on short-term
799 # memory if the blocks are of very different sizes
800 if bhi - blo < ahi - alo:
Tim Peters8a9c2842001-09-22 21:30:22 +0000801 first = self._dump('+', b, blo, bhi)
802 second = self._dump('-', a, alo, ahi)
Tim Peters5e824c32001-08-12 22:25:01 +0000803 else:
Tim Peters8a9c2842001-09-22 21:30:22 +0000804 first = self._dump('-', a, alo, ahi)
805 second = self._dump('+', b, blo, bhi)
806
807 for g in first, second:
808 for line in g:
809 yield line
Tim Peters5e824c32001-08-12 22:25:01 +0000810
811 def _fancy_replace(self, a, alo, ahi, b, blo, bhi):
812 r"""
813 When replacing one block of lines with another, search the blocks
814 for *similar* lines; the best-matching pair (if any) is used as a
815 synch point, and intraline difference marking is done on the
816 similar pair. Lots of work, but often worth it.
817
818 Example:
819
820 >>> d = Differ()
821 >>> d._fancy_replace(['abcDefghiJkl\n'], 0, 1, ['abcdefGhijkl\n'], 0, 1)
822 >>> print ''.join(d.results),
823 - abcDefghiJkl
824 ? ^ ^ ^
825 + abcdefGhijkl
826 ? ^ ^ ^
827 """
828
Tim Peters5e824c32001-08-12 22:25:01 +0000829 # don't synch up unless the lines have a similarity score of at
830 # least cutoff; best_ratio tracks the best score seen so far
831 best_ratio, cutoff = 0.74, 0.75
832 cruncher = SequenceMatcher(self.charjunk)
833 eqi, eqj = None, None # 1st indices of equal lines (if any)
834
835 # search for the pair that matches best without being identical
836 # (identical lines must be junk lines, & we don't want to synch up
837 # on junk -- unless we have to)
838 for j in xrange(blo, bhi):
839 bj = b[j]
840 cruncher.set_seq2(bj)
841 for i in xrange(alo, ahi):
842 ai = a[i]
843 if ai == bj:
844 if eqi is None:
845 eqi, eqj = i, j
846 continue
847 cruncher.set_seq1(ai)
848 # computing similarity is expensive, so use the quick
849 # upper bounds first -- have seen this speed up messy
850 # compares by a factor of 3.
851 # note that ratio() is only expensive to compute the first
852 # time it's called on a sequence pair; the expensive part
853 # of the computation is cached by cruncher
854 if cruncher.real_quick_ratio() > best_ratio and \
855 cruncher.quick_ratio() > best_ratio and \
856 cruncher.ratio() > best_ratio:
857 best_ratio, best_i, best_j = cruncher.ratio(), i, j
858 if best_ratio < cutoff:
859 # no non-identical "pretty close" pair
860 if eqi is None:
861 # no identical pair either -- treat it as a straight replace
Tim Peters8a9c2842001-09-22 21:30:22 +0000862 for line in self._plain_replace(a, alo, ahi, b, blo, bhi):
863 yield line
Tim Peters5e824c32001-08-12 22:25:01 +0000864 return
865 # no close pair, but an identical pair -- synch up on that
866 best_i, best_j, best_ratio = eqi, eqj, 1.0
867 else:
868 # there's a close pair, so forget the identical pair (if any)
869 eqi = None
870
871 # a[best_i] very similar to b[best_j]; eqi is None iff they're not
872 # identical
Tim Peters5e824c32001-08-12 22:25:01 +0000873
874 # pump out diffs from before the synch point
Tim Peters8a9c2842001-09-22 21:30:22 +0000875 for line in self._fancy_helper(a, alo, best_i, b, blo, best_j):
876 yield line
Tim Peters5e824c32001-08-12 22:25:01 +0000877
878 # do intraline marking on the synch pair
879 aelt, belt = a[best_i], b[best_j]
880 if eqi is None:
881 # pump out a '-', '?', '+', '?' quad for the synched lines
882 atags = btags = ""
883 cruncher.set_seqs(aelt, belt)
884 for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes():
885 la, lb = ai2 - ai1, bj2 - bj1
886 if tag == 'replace':
887 atags += '^' * la
888 btags += '^' * lb
889 elif tag == 'delete':
890 atags += '-' * la
891 elif tag == 'insert':
892 btags += '+' * lb
893 elif tag == 'equal':
894 atags += ' ' * la
895 btags += ' ' * lb
896 else:
897 raise ValueError, 'unknown tag ' + `tag`
Tim Peters8a9c2842001-09-22 21:30:22 +0000898 for line in self._qformat(aelt, belt, atags, btags):
899 yield line
Tim Peters5e824c32001-08-12 22:25:01 +0000900 else:
901 # the synch pair is identical
Tim Peters8a9c2842001-09-22 21:30:22 +0000902 yield ' ' + aelt
Tim Peters5e824c32001-08-12 22:25:01 +0000903
904 # pump out diffs from after the synch point
Tim Peters8a9c2842001-09-22 21:30:22 +0000905 for line in self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi):
906 yield line
Tim Peters5e824c32001-08-12 22:25:01 +0000907
908 def _fancy_helper(self, a, alo, ahi, b, blo, bhi):
Tim Peters8a9c2842001-09-22 21:30:22 +0000909 g = []
Tim Peters5e824c32001-08-12 22:25:01 +0000910 if alo < ahi:
911 if blo < bhi:
Tim Peters8a9c2842001-09-22 21:30:22 +0000912 g = self._fancy_replace(a, alo, ahi, b, blo, bhi)
Tim Peters5e824c32001-08-12 22:25:01 +0000913 else:
Tim Peters8a9c2842001-09-22 21:30:22 +0000914 g = self._dump('-', a, alo, ahi)
Tim Peters5e824c32001-08-12 22:25:01 +0000915 elif blo < bhi:
Tim Peters8a9c2842001-09-22 21:30:22 +0000916 g = self._dump('+', b, blo, bhi)
917
918 for line in g:
919 yield line
Tim Peters5e824c32001-08-12 22:25:01 +0000920
921 def _qformat(self, aline, bline, atags, btags):
922 r"""
923 Format "?" output and deal with leading tabs.
924
925 Example:
926
927 >>> d = Differ()
928 >>> d._qformat('\tabcDefghiJkl\n', '\t\tabcdefGhijkl\n',
929 ... ' ^ ^ ^ ', '+ ^ ^ ^ ')
930 >>> for line in d.results: print repr(line)
931 ...
932 '- \tabcDefghiJkl\n'
933 '? \t ^ ^ ^\n'
934 '+ \t\tabcdefGhijkl\n'
935 '? \t ^ ^ ^\n'
936 """
937
938 # Can hurt, but will probably help most of the time.
939 common = min(_count_leading(aline, "\t"),
940 _count_leading(bline, "\t"))
941 common = min(common, _count_leading(atags[:common], " "))
942 atags = atags[common:].rstrip()
943 btags = btags[common:].rstrip()
944
Tim Peters8a9c2842001-09-22 21:30:22 +0000945 yield "- " + aline
Tim Peters5e824c32001-08-12 22:25:01 +0000946 if atags:
Tim Peters527e64f2001-10-04 05:36:56 +0000947 yield "? %s%s\n" % ("\t" * common, atags)
Tim Peters5e824c32001-08-12 22:25:01 +0000948
Tim Peters8a9c2842001-09-22 21:30:22 +0000949 yield "+ " + bline
Tim Peters5e824c32001-08-12 22:25:01 +0000950 if btags:
Tim Peters8a9c2842001-09-22 21:30:22 +0000951 yield "? %s%s\n" % ("\t" * common, btags)
Tim Peters5e824c32001-08-12 22:25:01 +0000952
953# With respect to junk, an earlier version of ndiff simply refused to
954# *start* a match with a junk element. The result was cases like this:
955# before: private Thread currentThread;
956# after: private volatile Thread currentThread;
957# If you consider whitespace to be junk, the longest contiguous match
958# not starting with junk is "e Thread currentThread". So ndiff reported
959# that "e volatil" was inserted between the 't' and the 'e' in "private".
960# While an accurate view, to people that's absurd. The current version
961# looks for matching blocks that are entirely junk-free, then extends the
962# longest one of those as far as possible but only with matching junk.
963# So now "currentThread" is matched, then extended to suck up the
964# preceding blank; then "private" is matched, and extended to suck up the
965# following blank; then "Thread" is matched; and finally ndiff reports
966# that "volatile " was inserted before "Thread". The only quibble
967# remaining is that perhaps it was really the case that " volatile"
968# was inserted after "private". I can live with that <wink>.
969
970import re
971
972def IS_LINE_JUNK(line, pat=re.compile(r"\s*#?\s*$").match):
973 r"""
974 Return 1 for ignorable line: iff `line` is blank or contains a single '#'.
975
976 Examples:
977
978 >>> IS_LINE_JUNK('\n')
979 1
980 >>> IS_LINE_JUNK(' # \n')
981 1
982 >>> IS_LINE_JUNK('hello\n')
983 0
984 """
985
986 return pat(line) is not None
987
988def IS_CHARACTER_JUNK(ch, ws=" \t"):
989 r"""
990 Return 1 for ignorable character: iff `ch` is a space or tab.
991
992 Examples:
993
994 >>> IS_CHARACTER_JUNK(' ')
995 1
996 >>> IS_CHARACTER_JUNK('\t')
997 1
998 >>> IS_CHARACTER_JUNK('\n')
999 0
1000 >>> IS_CHARACTER_JUNK('x')
1001 0
1002 """
1003
1004 return ch in ws
1005
1006del re
1007
1008def ndiff(a, b, linejunk=IS_LINE_JUNK, charjunk=IS_CHARACTER_JUNK):
1009 r"""
1010 Compare `a` and `b` (lists of strings); return a `Differ`-style delta.
1011
1012 Optional keyword parameters `linejunk` and `charjunk` are for filter
1013 functions (or None):
1014
1015 - linejunk: A function that should accept a single string argument, and
1016 return true iff the string is junk. The default is module-level function
1017 IS_LINE_JUNK, which filters out lines without visible characters, except
1018 for at most one splat ('#').
1019
1020 - charjunk: A function that should accept a string of length 1. The
1021 default is module-level function IS_CHARACTER_JUNK, which filters out
1022 whitespace characters (a blank or tab; note: bad idea to include newline
1023 in this!).
1024
1025 Tools/scripts/ndiff.py is a command-line front-end to this function.
1026
1027 Example:
1028
1029 >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
1030 ... 'ore\ntree\nemu\n'.splitlines(1))
1031 >>> print ''.join(diff),
1032 - one
1033 ? ^
1034 + ore
1035 ? ^
1036 - two
1037 - three
1038 ? -
1039 + tree
1040 + emu
1041 """
1042 return Differ(linejunk, charjunk).compare(a, b)
1043
1044def restore(delta, which):
1045 r"""
Tim Peters8a9c2842001-09-22 21:30:22 +00001046 Generate one of the two sequences that generated a delta.
Tim Peters5e824c32001-08-12 22:25:01 +00001047
1048 Given a `delta` produced by `Differ.compare()` or `ndiff()`, extract
1049 lines originating from file 1 or 2 (parameter `which`), stripping off line
1050 prefixes.
1051
1052 Examples:
1053
1054 >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
1055 ... 'ore\ntree\nemu\n'.splitlines(1))
Tim Peters8a9c2842001-09-22 21:30:22 +00001056 >>> diff = list(diff)
Tim Peters5e824c32001-08-12 22:25:01 +00001057 >>> print ''.join(restore(diff, 1)),
1058 one
1059 two
1060 three
1061 >>> print ''.join(restore(diff, 2)),
1062 ore
1063 tree
1064 emu
1065 """
1066 try:
1067 tag = {1: "- ", 2: "+ "}[int(which)]
1068 except KeyError:
1069 raise ValueError, ('unknown delta choice (must be 1 or 2): %r'
1070 % which)
1071 prefixes = (" ", tag)
Tim Peters5e824c32001-08-12 22:25:01 +00001072 for line in delta:
1073 if line[:2] in prefixes:
Tim Peters8a9c2842001-09-22 21:30:22 +00001074 yield line[2:]
Tim Peters5e824c32001-08-12 22:25:01 +00001075
Tim Peters9ae21482001-02-10 08:00:53 +00001076def _test():
1077 import doctest, difflib
1078 return doctest.testmod(difflib)
1079
1080if __name__ == "__main__":
1081 _test()