blob: 6a1bd0724576f99982999c011dea26bfd4718211 [file] [log] [blame]
Guido van Rossum83b85181998-05-06 17:43:30 +00001#! /usr/bin/env python
2
Tim Peters2f1aeb92000-12-09 05:03:22 +00003# Module ndiff version 1.6.0
4# Released to the public domain 08-Dec-2000,
5# by Tim Peters (tim.one@home.com).
Guido van Rossum83b85181998-05-06 17:43:30 +00006
Guido van Rossuma3433e81999-03-27 13:34:01 +00007# Provided as-is; use at your own risk; no warranty; no promises; enjoy!
Guido van Rossum83b85181998-05-06 17:43:30 +00008
Guido van Rossuma3433e81999-03-27 13:34:01 +00009"""ndiff [-q] file1 file2
Guido van Rossum02ef28b1999-03-28 17:55:32 +000010 or
11ndiff (-r1 | -r2) < ndiff_output > file1_or_file2
Guido van Rossuma3433e81999-03-27 13:34:01 +000012
13Print a human-friendly file difference report to stdout. Both inter-
Guido van Rossum02ef28b1999-03-28 17:55:32 +000014and intra-line differences are noted. In the second form, recreate file1
15(-r1) or file2 (-r2) on stdout, from an ndiff report on stdin.
Guido van Rossuma3433e81999-03-27 13:34:01 +000016
Guido van Rossum02ef28b1999-03-28 17:55:32 +000017In the first form, if -q ("quiet") is not specified, the first two lines
18of output are
Guido van Rossuma3433e81999-03-27 13:34:01 +000019
20-: file1
21+: file2
22
23Each remaining line begins with a two-letter code:
24
25 "- " line unique to file1
26 "+ " line unique to file2
27 " " line common to both files
28 "? " line not present in either input file
29
30Lines beginning with "? " attempt to guide the eye to intraline
Tim Peters0d430e22000-11-01 02:51:27 +000031differences, and were not present in either input file. These lines can be
32confusing if the source files contain tab characters.
Guido van Rossuma3433e81999-03-27 13:34:01 +000033
34The first file can be recovered by retaining only lines that begin with
Guido van Rossum02ef28b1999-03-28 17:55:32 +000035" " or "- ", and deleting those 2-character prefixes; use ndiff with -r1.
Guido van Rossuma3433e81999-03-27 13:34:01 +000036
Tim Peters0d430e22000-11-01 02:51:27 +000037The second file can be recovered similarly, but by retaining only " " and
38"+ " lines; use ndiff with -r2; or, on Unix, the second file can be
Guido van Rossum02ef28b1999-03-28 17:55:32 +000039recovered by piping the output through
40
Guido van Rossuma3433e81999-03-27 13:34:01 +000041 sed -n '/^[+ ] /s/^..//p'
Guido van Rossuma3433e81999-03-27 13:34:01 +000042
43See module comments for details and programmatic interface.
44"""
45
Tim Peters0d430e22000-11-01 02:51:27 +000046__version__ = 1, 5, 0
Guido van Rossum83b85181998-05-06 17:43:30 +000047
48# SequenceMatcher tries to compute a "human-friendly diff" between
49# two sequences (chiefly picturing a file as a sequence of lines,
Guido van Rossuma3433e81999-03-27 13:34:01 +000050# and a line as a sequence of characters, here). Unlike e.g. UNIX(tm)
51# diff, the fundamental notion is the longest *contiguous* & junk-free
Guido van Rossum83b85181998-05-06 17:43:30 +000052# matching subsequence. That's what catches peoples' eyes. The
53# Windows(tm) windiff has another interesting notion, pairing up elements
54# that appear uniquely in each sequence. That, and the method here,
55# appear to yield more intuitive difference reports than does diff. This
56# method appears to be the least vulnerable to synching up on blocks
57# of "junk lines", though (like blank lines in ordinary text files,
58# or maybe "<P>" lines in HTML files). That may be because this is
59# the only method of the 3 that has a *concept* of "junk" <wink>.
60#
61# Note that ndiff makes no claim to produce a *minimal* diff. To the
62# contrary, minimal diffs are often counter-intuitive, because they
63# synch up anywhere possible, sometimes accidental matches 100 pages
64# apart. Restricting synch points to contiguous matches preserves some
65# notion of locality, at the occasional cost of producing a longer diff.
66#
Guido van Rossuma3433e81999-03-27 13:34:01 +000067# With respect to junk, an earlier version of ndiff simply refused to
Guido van Rossum83b85181998-05-06 17:43:30 +000068# *start* a match with a junk element. The result was cases like this:
69# before: private Thread currentThread;
70# after: private volatile Thread currentThread;
Guido van Rossuma3433e81999-03-27 13:34:01 +000071# If you consider whitespace to be junk, the longest contiguous match
Guido van Rossum83b85181998-05-06 17:43:30 +000072# not starting with junk is "e Thread currentThread". So ndiff reported
73# that "e volatil" was inserted between the 't' and the 'e' in "private".
74# While an accurate view, to people that's absurd. The current version
75# looks for matching blocks that are entirely junk-free, then extends the
76# longest one of those as far as possible but only with matching junk.
77# So now "currentThread" is matched, then extended to suck up the
78# preceding blank; then "private" is matched, and extended to suck up the
79# following blank; then "Thread" is matched; and finally ndiff reports
80# that "volatile " was inserted before "Thread". The only quibble
Guido van Rossuma3433e81999-03-27 13:34:01 +000081# remaining is that perhaps it was really the case that " volatile"
Guido van Rossum83b85181998-05-06 17:43:30 +000082# was inserted after "private". I can live with that <wink>.
83#
Guido van Rossum83b85181998-05-06 17:43:30 +000084# NOTE on junk: the module-level names
85# IS_LINE_JUNK
86# IS_CHARACTER_JUNK
87# can be set to any functions you like. The first one should accept
88# a single string argument, and return true iff the string is junk.
89# The default is whether the regexp r"\s*#?\s*$" matches (i.e., a
90# line without visible characters, except for at most one splat).
91# The second should accept a string of length 1 etc. The default is
92# whether the character is a blank or tab (note: bad idea to include
93# newline in this!).
94#
95# After setting those, you can call fcompare(f1name, f2name) with the
96# names of the files you want to compare. The difference report
Guido van Rossuma3433e81999-03-27 13:34:01 +000097# is sent to stdout. Or you can call main(args), passing what would
98# have been in sys.argv[1:] had the cmd-line form been used.
Guido van Rossum83b85181998-05-06 17:43:30 +000099
100import string
101TRACE = 0
102
103# define what "junk" means
104import re
105
106def IS_LINE_JUNK(line, pat=re.compile(r"\s*#?\s*$").match):
107 return pat(line) is not None
108
109def IS_CHARACTER_JUNK(ch, ws=" \t"):
110 return ch in ws
111
112del re
113
114class SequenceMatcher:
115 def __init__(self, isjunk=None, a='', b=''):
116 # Members:
117 # a
118 # first sequence
119 # b
120 # second sequence; differences are computed as "what do
121 # we need to do to 'a' to change it into 'b'?"
122 # b2j
123 # for x in b, b2j[x] is a list of the indices (into b)
124 # at which x appears; junk elements do not appear
125 # b2jhas
126 # b2j.has_key
127 # fullbcount
128 # for x in b, fullbcount[x] == the number of times x
129 # appears in b; only materialized if really needed (used
130 # only for computing quick_ratio())
131 # matching_blocks
132 # a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k];
133 # ascending & non-overlapping in i and in j; terminated by
134 # a dummy (len(a), len(b), 0) sentinel
135 # opcodes
136 # a list of (tag, i1, i2, j1, j2) tuples, where tag is
137 # one of
138 # 'replace' a[i1:i2] should be replaced by b[j1:j2]
139 # 'delete' a[i1:i2] should be deleted
140 # 'insert' b[j1:j2] should be inserted
141 # 'equal' a[i1:i2] == b[j1:j2]
142 # isjunk
143 # a user-supplied function taking a sequence element and
144 # returning true iff the element is "junk" -- this has
145 # subtle but helpful effects on the algorithm, which I'll
146 # get around to writing up someday <0.9 wink>.
147 # DON'T USE! Only __chain_b uses this. Use isbjunk.
148 # isbjunk
149 # for x in b, isbjunk(x) == isjunk(x) but much faster;
150 # it's really the has_key method of a hidden dict.
151 # DOES NOT WORK for x in a!
152
153 self.isjunk = isjunk
154 self.a = self.b = None
155 self.set_seqs(a, b)
156
157 def set_seqs(self, a, b):
158 self.set_seq1(a)
159 self.set_seq2(b)
160
161 def set_seq1(self, a):
162 if a is self.a:
163 return
164 self.a = a
165 self.matching_blocks = self.opcodes = None
166
167 def set_seq2(self, b):
168 if b is self.b:
169 return
170 self.b = b
171 self.matching_blocks = self.opcodes = None
172 self.fullbcount = None
173 self.__chain_b()
174
Guido van Rossuma3433e81999-03-27 13:34:01 +0000175 # For each element x in b, set b2j[x] to a list of the indices in
Guido van Rossum83b85181998-05-06 17:43:30 +0000176 # b where x appears; the indices are in increasing order; note that
177 # the number of times x appears in b is len(b2j[x]) ...
178 # when self.isjunk is defined, junk elements don't show up in this
179 # map at all, which stops the central find_longest_match method
180 # from starting any matching block at a junk element ...
181 # also creates the fast isbjunk function ...
182 # note that this is only called when b changes; so for cross-product
183 # kinds of matches, it's best to call set_seq2 once, then set_seq1
184 # repeatedly
185
186 def __chain_b(self):
187 # Because isjunk is a user-defined (not C) function, and we test
188 # for junk a LOT, it's important to minimize the number of calls.
189 # Before the tricks described here, __chain_b was by far the most
190 # time-consuming routine in the whole module! If anyone sees
191 # Jim Roskind, thank him again for profile.py -- I never would
192 # have guessed that.
193 # The first trick is to build b2j ignoring the possibility
194 # of junk. I.e., we don't call isjunk at all yet. Throwing
195 # out the junk later is much cheaper than building b2j "right"
196 # from the start.
197 b = self.b
198 self.b2j = b2j = {}
199 self.b2jhas = b2jhas = b2j.has_key
Guido van Rossuma3433e81999-03-27 13:34:01 +0000200 for i in xrange(len(b)):
Guido van Rossum83b85181998-05-06 17:43:30 +0000201 elt = b[i]
202 if b2jhas(elt):
203 b2j[elt].append(i)
204 else:
205 b2j[elt] = [i]
206
207 # Now b2j.keys() contains elements uniquely, and especially when
208 # the sequence is a string, that's usually a good deal smaller
209 # than len(string). The difference is the number of isjunk calls
210 # saved.
211 isjunk, junkdict = self.isjunk, {}
212 if isjunk:
213 for elt in b2j.keys():
214 if isjunk(elt):
215 junkdict[elt] = 1 # value irrelevant; it's a set
216 del b2j[elt]
217
218 # Now for x in b, isjunk(x) == junkdict.has_key(x), but the
219 # latter is much faster. Note too that while there may be a
220 # lot of junk in the sequence, the number of *unique* junk
221 # elements is probably small. So the memory burden of keeping
222 # this dict alive is likely trivial compared to the size of b2j.
223 self.isbjunk = junkdict.has_key
224
225 def find_longest_match(self, alo, ahi, blo, bhi):
226 """Find longest matching block in a[alo:ahi] and b[blo:bhi].
227
228 If isjunk is not defined:
229
230 Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
231 alo <= i <= i+k <= ahi
232 blo <= j <= j+k <= bhi
233 and for all (i',j',k') meeting those conditions,
234 k >= k'
235 i <= i'
236 and if i == i', j <= j'
Guido van Rossuma3433e81999-03-27 13:34:01 +0000237 In other words, of all maximal matching blocks, return one
Guido van Rossum83b85181998-05-06 17:43:30 +0000238 that starts earliest in a, and of all those maximal matching
Guido van Rossuma3433e81999-03-27 13:34:01 +0000239 blocks that start earliest in a, return the one that starts
Guido van Rossum83b85181998-05-06 17:43:30 +0000240 earliest in b.
241
242 If isjunk is defined, first the longest matching block is
243 determined as above, but with the additional restriction that
244 no junk element appears in the block. Then that block is
245 extended as far as possible by matching (only) junk elements on
246 both sides. So the resulting block never matches on junk except
247 as identical junk happens to be adjacent to an "interesting"
248 match.
249
Guido van Rossuma3433e81999-03-27 13:34:01 +0000250 If no blocks match, return (alo, blo, 0).
Guido van Rossum83b85181998-05-06 17:43:30 +0000251 """
252
253 # CAUTION: stripping common prefix or suffix would be incorrect.
254 # E.g.,
255 # ab
256 # acab
257 # Longest matching block is "ab", but if common prefix is
258 # stripped, it's "a" (tied with "b"). UNIX(tm) diff does so
259 # strip, so ends up claiming that ab is changed to acab by
260 # inserting "ca" in the middle. That's minimal but unintuitive:
261 # "it's obvious" that someone inserted "ac" at the front.
262 # Windiff ends up at the same place as diff, but by pairing up
263 # the unique 'b's and then matching the first two 'a's.
264
Guido van Rossum83b85181998-05-06 17:43:30 +0000265 a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.isbjunk
266 besti, bestj, bestsize = alo, blo, 0
Guido van Rossuma3433e81999-03-27 13:34:01 +0000267 # find longest junk-free match
268 # during an iteration of the loop, j2len[j] = length of longest
269 # junk-free match ending with a[i-1] and b[j]
270 j2len = {}
271 nothing = []
Guido van Rossum83b85181998-05-06 17:43:30 +0000272 for i in xrange(alo, ahi):
Guido van Rossum83b85181998-05-06 17:43:30 +0000273 # look at all instances of a[i] in b; note that because
274 # b2j has no junk keys, the loop is skipped if a[i] is junk
Guido van Rossuma3433e81999-03-27 13:34:01 +0000275 j2lenget = j2len.get
276 newj2len = {}
277 for j in b2j.get(a[i], nothing):
Guido van Rossum83b85181998-05-06 17:43:30 +0000278 # a[i] matches b[j]
279 if j < blo:
280 continue
Guido van Rossuma3433e81999-03-27 13:34:01 +0000281 if j >= bhi:
Guido van Rossum83b85181998-05-06 17:43:30 +0000282 break
Guido van Rossuma3433e81999-03-27 13:34:01 +0000283 k = newj2len[j] = j2lenget(j-1, 0) + 1
Guido van Rossum83b85181998-05-06 17:43:30 +0000284 if k > bestsize:
Guido van Rossuma3433e81999-03-27 13:34:01 +0000285 besti, bestj, bestsize = i-k+1, j-k+1, k
286 j2len = newj2len
Guido van Rossum83b85181998-05-06 17:43:30 +0000287
288 # Now that we have a wholly interesting match (albeit possibly
289 # empty!), we may as well suck up the matching junk on each
290 # side of it too. Can't think of a good reason not to, and it
291 # saves post-processing the (possibly considerable) expense of
292 # figuring out what to do with it. In the case of an empty
293 # interesting match, this is clearly the right thing to do,
294 # because no other kind of match is possible in the regions.
295 while besti > alo and bestj > blo and \
296 isbjunk(b[bestj-1]) and \
297 a[besti-1] == b[bestj-1]:
298 besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
299 while besti+bestsize < ahi and bestj+bestsize < bhi and \
300 isbjunk(b[bestj+bestsize]) and \
301 a[besti+bestsize] == b[bestj+bestsize]:
302 bestsize = bestsize + 1
303
304 if TRACE:
305 print "get_matching_blocks", alo, ahi, blo, bhi
306 print " returns", besti, bestj, bestsize
307 return besti, bestj, bestsize
308
Guido van Rossum83b85181998-05-06 17:43:30 +0000309 def get_matching_blocks(self):
310 if self.matching_blocks is not None:
311 return self.matching_blocks
312 self.matching_blocks = []
313 la, lb = len(self.a), len(self.b)
314 self.__helper(0, la, 0, lb, self.matching_blocks)
315 self.matching_blocks.append( (la, lb, 0) )
316 if TRACE:
317 print '*** matching blocks', self.matching_blocks
318 return self.matching_blocks
319
320 # builds list of matching blocks covering a[alo:ahi] and
321 # b[blo:bhi], appending them in increasing order to answer
322
323 def __helper(self, alo, ahi, blo, bhi, answer):
324 i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
325 # a[alo:i] vs b[blo:j] unknown
326 # a[i:i+k] same as b[j:j+k]
327 # a[i+k:ahi] vs b[j+k:bhi] unknown
328 if k:
329 if alo < i and blo < j:
330 self.__helper(alo, i, blo, j, answer)
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000331 answer.append(x)
Guido van Rossum83b85181998-05-06 17:43:30 +0000332 if i+k < ahi and j+k < bhi:
333 self.__helper(i+k, ahi, j+k, bhi, answer)
334
335 def ratio(self):
336 """Return a measure of the sequences' similarity (float in [0,1]).
337
338 Where T is the total number of elements in both sequences, and
339 M is the number of matches, this is 2*M / T.
340 Note that this is 1 if the sequences are identical, and 0 if
341 they have nothing in common.
342 """
343
344 matches = reduce(lambda sum, triple: sum + triple[-1],
345 self.get_matching_blocks(), 0)
346 return 2.0 * matches / (len(self.a) + len(self.b))
347
348 def quick_ratio(self):
349 """Return an upper bound on ratio() relatively quickly."""
350 # viewing a and b as multisets, set matches to the cardinality
351 # of their intersection; this counts the number of matches
352 # without regard to order, so is clearly an upper bound
353 if self.fullbcount is None:
354 self.fullbcount = fullbcount = {}
355 for elt in self.b:
356 fullbcount[elt] = fullbcount.get(elt, 0) + 1
357 fullbcount = self.fullbcount
358 # avail[x] is the number of times x appears in 'b' less the
359 # number of times we've seen it in 'a' so far ... kinda
360 avail = {}
361 availhas, matches = avail.has_key, 0
362 for elt in self.a:
363 if availhas(elt):
364 numb = avail[elt]
365 else:
366 numb = fullbcount.get(elt, 0)
367 avail[elt] = numb - 1
368 if numb > 0:
369 matches = matches + 1
370 return 2.0 * matches / (len(self.a) + len(self.b))
371
372 def real_quick_ratio(self):
373 """Return an upper bound on ratio() very quickly"""
374 la, lb = len(self.a), len(self.b)
375 # can't have more matches than the number of elements in the
376 # shorter sequence
377 return 2.0 * min(la, lb) / (la + lb)
378
379 def get_opcodes(self):
380 if self.opcodes is not None:
381 return self.opcodes
382 i = j = 0
383 self.opcodes = answer = []
384 for ai, bj, size in self.get_matching_blocks():
385 # invariant: we've pumped out correct diffs to change
386 # a[:i] into b[:j], and the next matching block is
387 # a[ai:ai+size] == b[bj:bj+size]. So we need to pump
388 # out a diff to change a[i:ai] into b[j:bj], pump out
389 # the matching block, and move (i,j) beyond the match
390 tag = ''
391 if i < ai and j < bj:
392 tag = 'replace'
393 elif i < ai:
394 tag = 'delete'
395 elif j < bj:
396 tag = 'insert'
397 if tag:
398 answer.append( (tag, i, ai, j, bj) )
399 i, j = ai+size, bj+size
400 # the list of matching blocks is terminated by a
401 # sentinel with size 0
402 if size:
403 answer.append( ('equal', ai, i, bj, j) )
404 return answer
405
406# meant for dumping lines
407def dump(tag, x, lo, hi):
408 for i in xrange(lo, hi):
409 print tag, x[i],
410
Guido van Rossum83b85181998-05-06 17:43:30 +0000411def plain_replace(a, alo, ahi, b, blo, bhi):
412 assert alo < ahi and blo < bhi
413 # dump the shorter block first -- reduces the burden on short-term
414 # memory if the blocks are of very different sizes
415 if bhi - blo < ahi - alo:
416 dump('+', b, blo, bhi)
417 dump('-', a, alo, ahi)
418 else:
419 dump('-', a, alo, ahi)
420 dump('+', b, blo, bhi)
421
422# When replacing one block of lines with another, this guy searches
423# the blocks for *similar* lines; the best-matching pair (if any) is
424# used as a synch point, and intraline difference marking is done on
425# the similar pair. Lots of work, but often worth it.
426
427def fancy_replace(a, alo, ahi, b, blo, bhi):
428 if TRACE:
429 print '*** fancy_replace', alo, ahi, blo, bhi
430 dump('>', a, alo, ahi)
431 dump('<', b, blo, bhi)
432
433 # don't synch up unless the lines have a similarity score of at
434 # least cutoff; best_ratio tracks the best score seen so far
435 best_ratio, cutoff = 0.74, 0.75
436 cruncher = SequenceMatcher(IS_CHARACTER_JUNK)
437 eqi, eqj = None, None # 1st indices of equal lines (if any)
438
439 # search for the pair that matches best without being identical
440 # (identical lines must be junk lines, & we don't want to synch up
441 # on junk -- unless we have to)
442 for j in xrange(blo, bhi):
443 bj = b[j]
444 cruncher.set_seq2(bj)
445 for i in xrange(alo, ahi):
446 ai = a[i]
447 if ai == bj:
448 if eqi is None:
449 eqi, eqj = i, j
450 continue
451 cruncher.set_seq1(ai)
452 # computing similarity is expensive, so use the quick
453 # upper bounds first -- have seen this speed up messy
454 # compares by a factor of 3.
455 # note that ratio() is only expensive to compute the first
456 # time it's called on a sequence pair; the expensive part
457 # of the computation is cached by cruncher
458 if cruncher.real_quick_ratio() > best_ratio and \
459 cruncher.quick_ratio() > best_ratio and \
460 cruncher.ratio() > best_ratio:
461 best_ratio, best_i, best_j = cruncher.ratio(), i, j
462 if best_ratio < cutoff:
463 # no non-identical "pretty close" pair
464 if eqi is None:
465 # no identical pair either -- treat it as a straight replace
466 plain_replace(a, alo, ahi, b, blo, bhi)
467 return
468 # no close pair, but an identical pair -- synch up on that
469 best_i, best_j, best_ratio = eqi, eqj, 1.0
470 else:
471 # there's a close pair, so forget the identical pair (if any)
472 eqi = None
473
474 # a[best_i] very similar to b[best_j]; eqi is None iff they're not
475 # identical
476 if TRACE:
477 print '*** best_ratio', best_ratio, best_i, best_j
478 dump('>', a, best_i, best_i+1)
479 dump('<', b, best_j, best_j+1)
480
481 # pump out diffs from before the synch point
482 fancy_helper(a, alo, best_i, b, blo, best_j)
483
484 # do intraline marking on the synch pair
485 aelt, belt = a[best_i], b[best_j]
486 if eqi is None:
Tim Peters2f1aeb92000-12-09 05:03:22 +0000487 # pump out a '-', '?', '+', '?' quad for the synched lines
Guido van Rossum83b85181998-05-06 17:43:30 +0000488 atags = btags = ""
489 cruncher.set_seqs(aelt, belt)
490 for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes():
491 la, lb = ai2 - ai1, bj2 - bj1
492 if tag == 'replace':
Tim Peters2f1aeb92000-12-09 05:03:22 +0000493 atags += '^' * la
494 btags += '^' * lb
Guido van Rossum83b85181998-05-06 17:43:30 +0000495 elif tag == 'delete':
Tim Peters2f1aeb92000-12-09 05:03:22 +0000496 atags += '-' * la
Guido van Rossum83b85181998-05-06 17:43:30 +0000497 elif tag == 'insert':
Tim Peters2f1aeb92000-12-09 05:03:22 +0000498 btags += '+' * lb
Guido van Rossum83b85181998-05-06 17:43:30 +0000499 elif tag == 'equal':
Tim Peters2f1aeb92000-12-09 05:03:22 +0000500 atags += ' ' * la
501 btags += ' ' * lb
Guido van Rossum83b85181998-05-06 17:43:30 +0000502 else:
503 raise ValueError, 'unknown tag ' + `tag`
Tim Peters2f1aeb92000-12-09 05:03:22 +0000504 printq(aelt, belt, atags, btags)
Guido van Rossum83b85181998-05-06 17:43:30 +0000505 else:
506 # the synch pair is identical
507 print ' ', aelt,
508
509 # pump out diffs from after the synch point
510 fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi)
511
512def fancy_helper(a, alo, ahi, b, blo, bhi):
513 if alo < ahi:
514 if blo < bhi:
515 fancy_replace(a, alo, ahi, b, blo, bhi)
516 else:
517 dump('-', a, alo, ahi)
518 elif blo < bhi:
519 dump('+', b, blo, bhi)
520
Tim Peters0d430e22000-11-01 02:51:27 +0000521# Crap to deal with leading tabs in "?" output. Can hurt, but will
522# probably help most of the time.
523
Tim Peters2f1aeb92000-12-09 05:03:22 +0000524def printq(aline, bline, atags, btags):
Tim Peters0d430e22000-11-01 02:51:27 +0000525 common = min(count_leading(aline, "\t"),
526 count_leading(bline, "\t"))
Tim Peters2f1aeb92000-12-09 05:03:22 +0000527 common = min(common, count_leading(atags[:common], " "))
528 print "-", aline,
529 if count_leading(atags, " ") < len(atags):
530 print "?", "\t" * common + atags[common:]
531 print "+", bline,
532 if count_leading(btags, " ") < len(btags):
533 print "?", "\t" * common + btags[common:]
Tim Peters0d430e22000-11-01 02:51:27 +0000534
535def count_leading(line, ch):
536 i, n = 0, len(line)
537 while i < n and line[i] == ch:
538 i += 1
539 return i
540
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000541def fail(msg):
542 import sys
543 out = sys.stderr.write
544 out(msg + "\n\n")
545 out(__doc__)
546 return 0
547
Guido van Rossum83b85181998-05-06 17:43:30 +0000548# open a file & return the file object; gripe and return 0 if it
549# couldn't be opened
550def fopen(fname):
551 try:
552 return open(fname, 'r')
553 except IOError, detail:
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000554 return fail("couldn't open " + fname + ": " + str(detail))
Guido van Rossum83b85181998-05-06 17:43:30 +0000555
556# open two files & spray the diff to stdout; return false iff a problem
557def fcompare(f1name, f2name):
558 f1 = fopen(f1name)
559 f2 = fopen(f2name)
560 if not f1 or not f2:
561 return 0
562
563 a = f1.readlines(); f1.close()
564 b = f2.readlines(); f2.close()
565
566 cruncher = SequenceMatcher(IS_LINE_JUNK, a, b)
567 for tag, alo, ahi, blo, bhi in cruncher.get_opcodes():
568 if tag == 'replace':
569 fancy_replace(a, alo, ahi, b, blo, bhi)
570 elif tag == 'delete':
571 dump('-', a, alo, ahi)
572 elif tag == 'insert':
573 dump('+', b, blo, bhi)
574 elif tag == 'equal':
575 dump(' ', a, alo, ahi)
576 else:
577 raise ValueError, 'unknown tag ' + `tag`
578
579 return 1
580
Guido van Rossuma3433e81999-03-27 13:34:01 +0000581# crack args (sys.argv[1:] is normal) & compare;
582# return false iff a problem
583
584def main(args):
585 import getopt
586 try:
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000587 opts, args = getopt.getopt(args, "qr:")
Guido van Rossuma3433e81999-03-27 13:34:01 +0000588 except getopt.error, detail:
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000589 return fail(str(detail))
Guido van Rossuma3433e81999-03-27 13:34:01 +0000590 noisy = 1
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000591 qseen = rseen = 0
Guido van Rossuma3433e81999-03-27 13:34:01 +0000592 for opt, val in opts:
593 if opt == "-q":
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000594 qseen = 1
Guido van Rossuma3433e81999-03-27 13:34:01 +0000595 noisy = 0
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000596 elif opt == "-r":
597 rseen = 1
598 whichfile = val
599 if qseen and rseen:
600 return fail("can't specify both -q and -r")
601 if rseen:
602 if args:
603 return fail("no args allowed with -r option")
604 if whichfile in "12":
605 restore(whichfile)
606 return 1
607 return fail("-r value must be 1 or 2")
Guido van Rossuma3433e81999-03-27 13:34:01 +0000608 if len(args) != 2:
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000609 return fail("need 2 filename args")
Guido van Rossuma3433e81999-03-27 13:34:01 +0000610 f1name, f2name = args
611 if noisy:
612 print '-:', f1name
613 print '+:', f2name
Guido van Rossum83b85181998-05-06 17:43:30 +0000614 return fcompare(f1name, f2name)
615
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000616def restore(which):
617 import sys
618 tag = {"1": "- ", "2": "+ "}[which]
619 prefixes = (" ", tag)
620 for line in sys.stdin.readlines():
621 if line[:2] in prefixes:
622 print line[2:],
623
Guido van Rossum83b85181998-05-06 17:43:30 +0000624if __name__ == '__main__':
Guido van Rossuma3433e81999-03-27 13:34:01 +0000625 import sys
626 args = sys.argv[1:]
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000627 if "-profile" in args:
Guido van Rossum83b85181998-05-06 17:43:30 +0000628 import profile, pstats
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000629 args.remove("-profile")
Guido van Rossum83b85181998-05-06 17:43:30 +0000630 statf = "ndiff.pro"
Guido van Rossuma3433e81999-03-27 13:34:01 +0000631 profile.run("main(args)", statf)
Guido van Rossum83b85181998-05-06 17:43:30 +0000632 stats = pstats.Stats(statf)
633 stats.strip_dirs().sort_stats('time').print_stats()
Guido van Rossum02ef28b1999-03-28 17:55:32 +0000634 else:
635 main(args)