Blame - Lib/difflib.py - platform/external/python/cpython3

blob: eb0eccfd38a64d812e93f511c02cdf8f219d9711 [file] [log] [blame]

Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	1	#! /usr/bin/env python
				2
				3	"""
				4	Module difflib -- helpers for computing deltas between objects.
				5
				6	Function get_close_matches(word, possibilities, n=3, cutoff=0.6):
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	7	Use SequenceMatcher to return list of the best "good enough" matches.
				8
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	9	Function context_diff(a, b):
				10	For two lists of strings, return a delta in context diff format.
				11
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	12	Function ndiff(a, b):
				13	Return a delta: the difference between `a` and `b` (lists of strings).
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	14
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	15	Function restore(delta, which):
				16	Return one of the two sequences that generated an ndiff delta.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	17
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	18	Function unified_diff(a, b):
				19	For two lists of strings, return a delta in unified diff format.
				20
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	21	Class SequenceMatcher:
				22	A flexible class for comparing pairs of sequences of any type.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	23
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	24	Class Differ:
				25	For producing human-readable deltas from sequences of lines of text.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	26	"""
				27
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	28	__all__ = ['get_close_matches', 'ndiff', 'restore', 'SequenceMatcher',
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	29	'Differ','IS_CHARACTER_JUNK', 'IS_LINE_JUNK', 'context_diff',
				30	'unified_diff']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	31
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	32	def _calculate_ratio(matches, length):
				33	if length:
				34	return 2.0 * matches / length
				35	return 1.0
				36
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	37	class SequenceMatcher:
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	38
				39	"""
				40	SequenceMatcher is a flexible class for comparing pairs of sequences of
				41	any type, so long as the sequence elements are hashable. The basic
				42	algorithm predates, and is a little fancier than, an algorithm
				43	published in the late 1980's by Ratcliff and Obershelp under the
				44	hyperbolic name "gestalt pattern matching". The basic idea is to find
				45	the longest contiguous matching subsequence that contains no "junk"
				46	elements (R-O doesn't address junk). The same idea is then applied
				47	recursively to the pieces of the sequences to the left and to the right
				48	of the matching subsequence. This does not yield minimal edit
				49	sequences, but does tend to yield matches that "look right" to people.
				50
				51	SequenceMatcher tries to compute a "human-friendly diff" between two
				52	sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the
				53	longest contiguous & junk-free matching subsequence. That's what
				54	catches peoples' eyes. The Windows(tm) windiff has another interesting
				55	notion, pairing up elements that appear uniquely in each sequence.
				56	That, and the method here, appear to yield more intuitive difference
				57	reports than does diff. This method appears to be the least vulnerable
				58	to synching up on blocks of "junk lines", though (like blank lines in
				59	ordinary text files, or maybe "<P>" lines in HTML files). That may be
				60	because this is the only method of the 3 that has a concept of
				61	"junk" <wink>.
				62
				63	Example, comparing two strings, and considering blanks to be "junk":
				64
				65	>>> s = SequenceMatcher(lambda x: x == " ",
				66	... "private Thread currentThread;",
				67	... "private volatile Thread currentThread;")
				68	>>>
				69
				70	.ratio() returns a float in [0, 1], measuring the "similarity" of the
				71	sequences. As a rule of thumb, a .ratio() value over 0.6 means the
				72	sequences are close matches:
				73
				74	>>> print round(s.ratio(), 3)
				75	0.866
				76	>>>
				77
				78	If you're only interested in where the sequences match,
				79	.get_matching_blocks() is handy:
				80
				81	>>> for block in s.get_matching_blocks():
				82	... print "a[%d] and b[%d] match for %d elements" % block
				83	a[0] and b[0] match for 8 elements
				84	a[8] and b[17] match for 6 elements
				85	a[14] and b[23] match for 15 elements
				86	a[29] and b[38] match for 0 elements
				87
				88	Note that the last tuple returned by .get_matching_blocks() is always a
				89	dummy, (len(a), len(b), 0), and this is the only case in which the last
				90	tuple element (number of elements matched) is 0.
				91
				92	If you want to know how to change the first sequence into the second,
				93	use .get_opcodes():
				94
				95	>>> for opcode in s.get_opcodes():
				96	... print "%6s a[%d:%d] b[%d:%d]" % opcode
				97	equal a[0:8] b[0:8]
				98	insert a[8:8] b[8:17]
				99	equal a[8:14] b[17:23]
				100	equal a[14:29] b[23:38]
				101
				102	See the Differ class for a fancy human-friendly file differencer, which
				103	uses SequenceMatcher both to compare sequences of lines, and to compare
				104	sequences of characters within similar (near-matching) lines.
				105
				106	See also function get_close_matches() in this module, which shows how
				107	simple code building on SequenceMatcher can be used to do useful work.
				108
				109	Timing: Basic R-O is cubic time worst case and quadratic time expected
				110	case. SequenceMatcher is quadratic time for the worst case and has
				111	expected-case behavior dependent in a complicated way on how many
				112	elements the sequences have in common; best case time is linear.
				113
				114	Methods:
				115
				116	__init__(isjunk=None, a='', b='')
				117	Construct a SequenceMatcher.
				118
				119	set_seqs(a, b)
				120	Set the two sequences to be compared.
				121
				122	set_seq1(a)
				123	Set the first sequence to be compared.
				124
				125	set_seq2(b)
				126	Set the second sequence to be compared.
				127
				128	find_longest_match(alo, ahi, blo, bhi)
				129	Find longest matching block in a[alo:ahi] and b[blo:bhi].
				130
				131	get_matching_blocks()
				132	Return list of triples describing matching subsequences.
				133
				134	get_opcodes()
				135	Return list of 5-tuples describing how to turn a into b.
				136
				137	ratio()
				138	Return a measure of the sequences' similarity (float in [0,1]).
				139
				140	quick_ratio()
				141	Return an upper bound on .ratio() relatively quickly.
				142
				143	real_quick_ratio()
				144	Return an upper bound on ratio() very quickly.
				145	"""
				146
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	147	def __init__(self, isjunk=None, a='', b=''):
				148	"""Construct a SequenceMatcher.
				149
				150	Optional arg isjunk is None (the default), or a one-argument
				151	function that takes a sequence element and returns true iff the
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	152	element is junk. None is equivalent to passing "lambda x: 0", i.e.
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	153	no elements are considered to be junk. For example, pass
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	154	lambda x: x in " \\t"
				155	if you're comparing lines as sequences of characters, and don't
				156	want to synch up on blanks or hard tabs.
				157
				158	Optional arg a is the first of two sequences to be compared. By
				159	default, an empty string. The elements of a must be hashable. See
				160	also .set_seqs() and .set_seq1().
				161
				162	Optional arg b is the second of two sequences to be compared. By
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	163	default, an empty string. The elements of b must be hashable. See
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	164	also .set_seqs() and .set_seq2().
				165	"""
				166
				167	# Members:
				168	# a
				169	# first sequence
				170	# b
				171	# second sequence; differences are computed as "what do
				172	# we need to do to 'a' to change it into 'b'?"
				173	# b2j
				174	# for x in b, b2j[x] is a list of the indices (into b)
				175	# at which x appears; junk elements do not appear
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	176	# fullbcount
				177	# for x in b, fullbcount[x] == the number of times x
				178	# appears in b; only materialized if really needed (used
				179	# only for computing quick_ratio())
				180	# matching_blocks
				181	# a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k];
				182	# ascending & non-overlapping in i and in j; terminated by
				183	# a dummy (len(a), len(b), 0) sentinel
				184	# opcodes
				185	# a list of (tag, i1, i2, j1, j2) tuples, where tag is
				186	# one of
				187	# 'replace' a[i1:i2] should be replaced by b[j1:j2]
				188	# 'delete' a[i1:i2] should be deleted
				189	# 'insert' b[j1:j2] should be inserted
				190	# 'equal' a[i1:i2] == b[j1:j2]
				191	# isjunk
				192	# a user-supplied function taking a sequence element and
				193	# returning true iff the element is "junk" -- this has
				194	# subtle but helpful effects on the algorithm, which I'll
				195	# get around to writing up someday <0.9 wink>.
				196	# DON'T USE! Only __chain_b uses this. Use isbjunk.
				197	# isbjunk
				198	# for x in b, isbjunk(x) == isjunk(x) but much faster;
				199	# it's really the has_key method of a hidden dict.
				200	# DOES NOT WORK for x in a!
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	201	# isbpopular
				202	# for x in b, isbpopular(x) is true iff b is reasonably long
				203	# (at least 200 elements) and x accounts for more than 1% of
				204	# its elements. DOES NOT WORK for x in a!
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	205
				206	self.isjunk = isjunk
				207	self.a = self.b = None
				208	self.set_seqs(a, b)
				209
				210	def set_seqs(self, a, b):
				211	"""Set the two sequences to be compared.
				212
				213	>>> s = SequenceMatcher()
				214	>>> s.set_seqs("abcd", "bcde")
				215	>>> s.ratio()
				216	0.75
				217	"""
				218
				219	self.set_seq1(a)
				220	self.set_seq2(b)
				221
				222	def set_seq1(self, a):
				223	"""Set the first sequence to be compared.
				224
				225	The second sequence to be compared is not changed.
				226
				227	>>> s = SequenceMatcher(None, "abcd", "bcde")
				228	>>> s.ratio()
				229	0.75
				230	>>> s.set_seq1("bcde")
				231	>>> s.ratio()
				232	1.0
				233	>>>
				234
				235	SequenceMatcher computes and caches detailed information about the
				236	second sequence, so if you want to compare one sequence S against
				237	many sequences, use .set_seq2(S) once and call .set_seq1(x)
				238	repeatedly for each of the other sequences.
				239
				240	See also set_seqs() and set_seq2().
				241	"""
				242
				243	if a is self.a:
				244	return
				245	self.a = a
				246	self.matching_blocks = self.opcodes = None
				247
				248	def set_seq2(self, b):
				249	"""Set the second sequence to be compared.
				250
				251	The first sequence to be compared is not changed.
				252
				253	>>> s = SequenceMatcher(None, "abcd", "bcde")
				254	>>> s.ratio()
				255	0.75
				256	>>> s.set_seq2("abcd")
				257	>>> s.ratio()
				258	1.0
				259	>>>
				260
				261	SequenceMatcher computes and caches detailed information about the
				262	second sequence, so if you want to compare one sequence S against
				263	many sequences, use .set_seq2(S) once and call .set_seq1(x)
				264	repeatedly for each of the other sequences.
				265
				266	See also set_seqs() and set_seq1().
				267	"""
				268
				269	if b is self.b:
				270	return
				271	self.b = b
				272	self.matching_blocks = self.opcodes = None
				273	self.fullbcount = None
				274	self.__chain_b()
				275
				276	# For each element x in b, set b2j[x] to a list of the indices in
				277	# b where x appears; the indices are in increasing order; note that
				278	# the number of times x appears in b is len(b2j[x]) ...
				279	# when self.isjunk is defined, junk elements don't show up in this
				280	# map at all, which stops the central find_longest_match method
				281	# from starting any matching block at a junk element ...
				282	# also creates the fast isbjunk function ...
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	283	# b2j also does not contain entries for "popular" elements, meaning
				284	# elements that account for more than 1% of the total elements, and
				285	# when the sequence is reasonably large (>= 200 elements); this can
				286	# be viewed as an adaptive notion of semi-junk, and yields an enormous
				287	# speedup when, e.g., comparing program files with hundreds of
				288	# instances of "return NULL;" ...
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	289	# note that this is only called when b changes; so for cross-product
				290	# kinds of matches, it's best to call set_seq2 once, then set_seq1
				291	# repeatedly
				292
				293	def __chain_b(self):
				294	# Because isjunk is a user-defined (not C) function, and we test
				295	# for junk a LOT, it's important to minimize the number of calls.
				296	# Before the tricks described here, __chain_b was by far the most
				297	# time-consuming routine in the whole module! If anyone sees
				298	# Jim Roskind, thank him again for profile.py -- I never would
				299	# have guessed that.
				300	# The first trick is to build b2j ignoring the possibility
				301	# of junk. I.e., we don't call isjunk at all yet. Throwing
				302	# out the junk later is much cheaper than building b2j "right"
				303	# from the start.
				304	b = self.b
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	305	n = len(b)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	306	self.b2j = b2j = {}
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	307	populardict = {}
				308	for i, elt in enumerate(b):
				309	if elt in b2j:
				310	indices = b2j[elt]
				311	if n >= 200 and len(indices) * 100 > n:
				312	populardict[elt] = 1
				313	del indices[:]
				314	else:
				315	indices.append(i)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	316	else:
				317	b2j[elt] = [i]
				318
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	319	# Purge leftover indices for popular elements.
				320	for elt in populardict:
				321	del b2j[elt]
				322
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	323	# Now b2j.keys() contains elements uniquely, and especially when
				324	# the sequence is a string, that's usually a good deal smaller
				325	# than len(string). The difference is the number of isjunk calls
				326	# saved.
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	327	isjunk = self.isjunk
				328	junkdict = {}
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	329	if isjunk:
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	330	for d in populardict, b2j:
				331	for elt in d.keys():
				332	if isjunk(elt):
				333	junkdict[elt] = 1
				334	del d[elt]
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	335
Raymond Hettinger	54f0222	2002-06-01 14:18:47 +0000	[diff] [blame]	336	# Now for x in b, isjunk(x) == x in junkdict, but the
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	337	# latter is much faster. Note too that while there may be a
				338	# lot of junk in the sequence, the number of unique junk
				339	# elements is probably small. So the memory burden of keeping
				340	# this dict alive is likely trivial compared to the size of b2j.
				341	self.isbjunk = junkdict.has_key
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	342	self.isbpopular = populardict.has_key
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	343
				344	def find_longest_match(self, alo, ahi, blo, bhi):
				345	"""Find longest matching block in a[alo:ahi] and b[blo:bhi].
				346
				347	If isjunk is not defined:
				348
				349	Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
				350	alo <= i <= i+k <= ahi
				351	blo <= j <= j+k <= bhi
				352	and for all (i',j',k') meeting those conditions,
				353	k >= k'
				354	i <= i'
				355	and if i == i', j <= j'
				356
				357	In other words, of all maximal matching blocks, return one that
				358	starts earliest in a, and of all those maximal matching blocks that
				359	start earliest in a, return the one that starts earliest in b.
				360
				361	>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
				362	>>> s.find_longest_match(0, 5, 0, 9)
				363	(0, 4, 5)
				364
				365	If isjunk is defined, first the longest matching block is
				366	determined as above, but with the additional restriction that no
				367	junk element appears in the block. Then that block is extended as
				368	far as possible by matching (only) junk elements on both sides. So
				369	the resulting block never matches on junk except as identical junk
				370	happens to be adjacent to an "interesting" match.
				371
				372	Here's the same example as before, but considering blanks to be
				373	junk. That prevents " abcd" from matching the " abcd" at the tail
				374	end of the second sequence directly. Instead only the "abcd" can
				375	match, and matches the leftmost "abcd" in the second sequence:
				376
				377	>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
				378	>>> s.find_longest_match(0, 5, 0, 9)
				379	(1, 0, 4)
				380
				381	If no blocks match, return (alo, blo, 0).
				382
				383	>>> s = SequenceMatcher(None, "ab", "c")
				384	>>> s.find_longest_match(0, 2, 0, 1)
				385	(0, 0, 0)
				386	"""
				387
				388	# CAUTION: stripping common prefix or suffix would be incorrect.
				389	# E.g.,
				390	# ab
				391	# acab
				392	# Longest matching block is "ab", but if common prefix is
				393	# stripped, it's "a" (tied with "b"). UNIX(tm) diff does so
				394	# strip, so ends up claiming that ab is changed to acab by
				395	# inserting "ca" in the middle. That's minimal but unintuitive:
				396	# "it's obvious" that someone inserted "ac" at the front.
				397	# Windiff ends up at the same place as diff, but by pairing up
				398	# the unique 'b's and then matching the first two 'a's.
				399
				400	a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.isbjunk
				401	besti, bestj, bestsize = alo, blo, 0
				402	# find longest junk-free match
				403	# during an iteration of the loop, j2len[j] = length of longest
				404	# junk-free match ending with a[i-1] and b[j]
				405	j2len = {}
				406	nothing = []
				407	for i in xrange(alo, ahi):
				408	# look at all instances of a[i] in b; note that because
				409	# b2j has no junk keys, the loop is skipped if a[i] is junk
				410	j2lenget = j2len.get
				411	newj2len = {}
				412	for j in b2j.get(a[i], nothing):
				413	# a[i] matches b[j]
				414	if j < blo:
				415	continue
				416	if j >= bhi:
				417	break
				418	k = newj2len[j] = j2lenget(j-1, 0) + 1
				419	if k > bestsize:
				420	besti, bestj, bestsize = i-k+1, j-k+1, k
				421	j2len = newj2len
				422
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	423	# Extend the best by non-junk elements on each end. In particular,
				424	# "popular" non-junk elements aren't in b2j, which greatly speeds
				425	# the inner loop above, but also means "the best" match so far
				426	# doesn't contain any junk or popular non-junk elements.
				427	while besti > alo and bestj > blo and \
				428	not isbjunk(b[bestj-1]) and \
				429	a[besti-1] == b[bestj-1]:
				430	besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
				431	while besti+bestsize < ahi and bestj+bestsize < bhi and \
				432	not isbjunk(b[bestj+bestsize]) and \
				433	a[besti+bestsize] == b[bestj+bestsize]:
				434	bestsize += 1
				435
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	436	# Now that we have a wholly interesting match (albeit possibly
				437	# empty!), we may as well suck up the matching junk on each
				438	# side of it too. Can't think of a good reason not to, and it
				439	# saves post-processing the (possibly considerable) expense of
				440	# figuring out what to do with it. In the case of an empty
				441	# interesting match, this is clearly the right thing to do,
				442	# because no other kind of match is possible in the regions.
				443	while besti > alo and bestj > blo and \
				444	isbjunk(b[bestj-1]) and \
				445	a[besti-1] == b[bestj-1]:
				446	besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
				447	while besti+bestsize < ahi and bestj+bestsize < bhi and \
				448	isbjunk(b[bestj+bestsize]) and \
				449	a[besti+bestsize] == b[bestj+bestsize]:
				450	bestsize = bestsize + 1
				451
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	452	return besti, bestj, bestsize
				453
				454	def get_matching_blocks(self):
				455	"""Return list of triples describing matching subsequences.
				456
				457	Each triple is of the form (i, j, n), and means that
				458	a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in
				459	i and in j.
				460
				461	The last triple is a dummy, (len(a), len(b), 0), and is the only
				462	triple with n==0.
				463
				464	>>> s = SequenceMatcher(None, "abxcd", "abcd")
				465	>>> s.get_matching_blocks()
				466	[(0, 0, 2), (3, 2, 2), (5, 4, 0)]
				467	"""
				468
				469	if self.matching_blocks is not None:
				470	return self.matching_blocks
				471	self.matching_blocks = []
				472	la, lb = len(self.a), len(self.b)
				473	self.__helper(0, la, 0, lb, self.matching_blocks)
				474	self.matching_blocks.append( (la, lb, 0) )
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	475	return self.matching_blocks
				476
				477	# builds list of matching blocks covering a[alo:ahi] and
				478	# b[blo:bhi], appending them in increasing order to answer
				479
				480	def __helper(self, alo, ahi, blo, bhi, answer):
				481	i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
				482	# a[alo:i] vs b[blo:j] unknown
				483	# a[i:i+k] same as b[j:j+k]
				484	# a[i+k:ahi] vs b[j+k:bhi] unknown
				485	if k:
				486	if alo < i and blo < j:
				487	self.__helper(alo, i, blo, j, answer)
				488	answer.append(x)
				489	if i+k < ahi and j+k < bhi:
				490	self.__helper(i+k, ahi, j+k, bhi, answer)
				491
				492	def get_opcodes(self):
				493	"""Return list of 5-tuples describing how to turn a into b.
				494
				495	Each tuple is of the form (tag, i1, i2, j1, j2). The first tuple
				496	has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
				497	tuple preceding it, and likewise for j1 == the previous j2.
				498
				499	The tags are strings, with these meanings:
				500
				501	'replace': a[i1:i2] should be replaced by b[j1:j2]
				502	'delete': a[i1:i2] should be deleted.
				503	Note that j1==j2 in this case.
				504	'insert': b[j1:j2] should be inserted at a[i1:i1].
				505	Note that i1==i2 in this case.
				506	'equal': a[i1:i2] == b[j1:j2]
				507
				508	>>> a = "qabxcd"
				509	>>> b = "abycdf"
				510	>>> s = SequenceMatcher(None, a, b)
				511	>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
				512	... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
				513	... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
				514	delete a[0:1] (q) b[0:0] ()
				515	equal a[1:3] (ab) b[0:2] (ab)
				516	replace a[3:4] (x) b[2:3] (y)
				517	equal a[4:6] (cd) b[3:5] (cd)
				518	insert a[6:6] () b[5:6] (f)
				519	"""
				520
				521	if self.opcodes is not None:
				522	return self.opcodes
				523	i = j = 0
				524	self.opcodes = answer = []
				525	for ai, bj, size in self.get_matching_blocks():
				526	# invariant: we've pumped out correct diffs to change
				527	# a[:i] into b[:j], and the next matching block is
				528	# a[ai:ai+size] == b[bj:bj+size]. So we need to pump
				529	# out a diff to change a[i:ai] into b[j:bj], pump out
				530	# the matching block, and move (i,j) beyond the match
				531	tag = ''
				532	if i < ai and j < bj:
				533	tag = 'replace'
				534	elif i < ai:
				535	tag = 'delete'
				536	elif j < bj:
				537	tag = 'insert'
				538	if tag:
				539	answer.append( (tag, i, ai, j, bj) )
				540	i, j = ai+size, bj+size
				541	# the list of matching blocks is terminated by a
				542	# sentinel with size 0
				543	if size:
				544	answer.append( ('equal', ai, i, bj, j) )
				545	return answer
				546
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	547	def get_grouped_opcodes(self, n=3):
				548	""" Isolate change clusters by eliminating ranges with no changes.
				549
				550	Return a generator of groups with upto n lines of context.
				551	Each group is in the same format as returned by get_opcodes().
				552
				553	>>> from pprint import pprint
				554	>>> a = map(str, range(1,40))
				555	>>> b = a[:]
				556	>>> b[8:8] = ['i'] # Make an insertion
				557	>>> b[20] += 'x' # Make a replacement
				558	>>> b[23:28] = [] # Make a deletion
				559	>>> b[30] += 'y' # Make another replacement
				560	>>> pprint(list(SequenceMatcher(None,a,b).get_grouped_opcodes()))
				561	[[('equal', 5, 8, 5, 8), ('insert', 8, 8, 8, 9), ('equal', 8, 11, 9, 12)],
				562	[('equal', 16, 19, 17, 20),
				563	('replace', 19, 20, 20, 21),
				564	('equal', 20, 22, 21, 23),
				565	('delete', 22, 27, 23, 23),
				566	('equal', 27, 30, 23, 26)],
				567	[('equal', 31, 34, 27, 30),
				568	('replace', 34, 35, 30, 31),
				569	('equal', 35, 38, 31, 34)]]
				570	"""
				571
				572	codes = self.get_opcodes()
				573	# Fixup leading and trailing groups if they show no changes.
				574	if codes[0][0] == 'equal':
				575	tag, i1, i2, j1, j2 = codes[0]
				576	codes[0] = tag, max(i1, i2-n), i2, max(j1, j2-n), j2
				577	if codes[-1][0] == 'equal':
				578	tag, i1, i2, j1, j2 = codes[-1]
				579	codes[-1] = tag, i1, min(i2, i1+n), j1, min(j2, j1+n)
				580
				581	nn = n + n
				582	group = []
				583	for tag, i1, i2, j1, j2 in codes:
				584	# End the current group and start a new one whenever
				585	# there is a large range with no changes.
				586	if tag == 'equal' and i2-i1 > nn:
				587	group.append((tag, i1, min(i2, i1+n), j1, min(j2, j1+n)))
				588	yield group
				589	group = []
				590	i1, j1 = max(i1, i2-n), max(j1, j2-n)
				591	group.append((tag, i1, i2, j1 ,j2))
				592	if group and not (len(group)==1 and group[0][0] == 'equal'):
				593	yield group
				594
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	595	def ratio(self):
				596	"""Return a measure of the sequences' similarity (float in [0,1]).
				597
				598	Where T is the total number of elements in both sequences, and
				599	M is the number of matches, this is 2,0*M / T.
				600	Note that this is 1 if the sequences are identical, and 0 if
				601	they have nothing in common.
				602
				603	.ratio() is expensive to compute if you haven't already computed
				604	.get_matching_blocks() or .get_opcodes(), in which case you may
				605	want to try .quick_ratio() or .real_quick_ratio() first to get an
				606	upper bound.
				607
				608	>>> s = SequenceMatcher(None, "abcd", "bcde")
				609	>>> s.ratio()
				610	0.75
				611	>>> s.quick_ratio()
				612	0.75
				613	>>> s.real_quick_ratio()
				614	1.0
				615	"""
				616
				617	matches = reduce(lambda sum, triple: sum + triple[-1],
				618	self.get_matching_blocks(), 0)
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	619	return _calculate_ratio(matches, len(self.a) + len(self.b))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	620
				621	def quick_ratio(self):
				622	"""Return an upper bound on ratio() relatively quickly.
				623
				624	This isn't defined beyond that it is an upper bound on .ratio(), and
				625	is faster to compute.
				626	"""
				627
				628	# viewing a and b as multisets, set matches to the cardinality
				629	# of their intersection; this counts the number of matches
				630	# without regard to order, so is clearly an upper bound
				631	if self.fullbcount is None:
				632	self.fullbcount = fullbcount = {}
				633	for elt in self.b:
				634	fullbcount[elt] = fullbcount.get(elt, 0) + 1
				635	fullbcount = self.fullbcount
				636	# avail[x] is the number of times x appears in 'b' less the
				637	# number of times we've seen it in 'a' so far ... kinda
				638	avail = {}
				639	availhas, matches = avail.has_key, 0
				640	for elt in self.a:
				641	if availhas(elt):
				642	numb = avail[elt]
				643	else:
				644	numb = fullbcount.get(elt, 0)
				645	avail[elt] = numb - 1
				646	if numb > 0:
				647	matches = matches + 1
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	648	return _calculate_ratio(matches, len(self.a) + len(self.b))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	649
				650	def real_quick_ratio(self):
				651	"""Return an upper bound on ratio() very quickly.
				652
				653	This isn't defined beyond that it is an upper bound on .ratio(), and
				654	is faster to compute than either .ratio() or .quick_ratio().
				655	"""
				656
				657	la, lb = len(self.a), len(self.b)
				658	# can't have more matches than the number of elements in the
				659	# shorter sequence
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	660	return _calculate_ratio(min(la, lb), la + lb)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	661
				662	def get_close_matches(word, possibilities, n=3, cutoff=0.6):
				663	"""Use SequenceMatcher to return list of the best "good enough" matches.
				664
				665	word is a sequence for which close matches are desired (typically a
				666	string).
				667
				668	possibilities is a list of sequences against which to match word
				669	(typically a list of strings).
				670
				671	Optional arg n (default 3) is the maximum number of close matches to
				672	return. n must be > 0.
				673
				674	Optional arg cutoff (default 0.6) is a float in [0, 1]. Possibilities
				675	that don't score at least that similar to word are ignored.
				676
				677	The best (no more than n) matches among the possibilities are returned
				678	in a list, sorted by similarity score, most similar first.
				679
				680	>>> get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
				681	['apple', 'ape']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	682	>>> import keyword as _keyword
				683	>>> get_close_matches("wheel", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	684	['while']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	685	>>> get_close_matches("apple", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	686	[]
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	687	>>> get_close_matches("accept", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	688	['except']
				689	"""
				690
				691	if not n > 0:
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	692	raise ValueError("n must be > 0: " + `n`)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	693	if not 0.0 <= cutoff <= 1.0:
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	694	raise ValueError("cutoff must be in [0.0, 1.0]: " + `cutoff`)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	695	result = []
				696	s = SequenceMatcher()
				697	s.set_seq2(word)
				698	for x in possibilities:
				699	s.set_seq1(x)
				700	if s.real_quick_ratio() >= cutoff and \
				701	s.quick_ratio() >= cutoff and \
				702	s.ratio() >= cutoff:
				703	result.append((s.ratio(), x))
				704	# Sort by score.
				705	result.sort()
				706	# Retain only the best n.
				707	result = result[-n:]
				708	# Move best-scorer to head of list.
				709	result.reverse()
				710	# Strip scores.
				711	return [x for score, x in result]
				712
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	713
				714	def _count_leading(line, ch):
				715	"""
				716	Return number of `ch` characters at the start of `line`.
				717
				718	Example:
				719
				720	>>> _count_leading(' abc', ' ')
				721	3
				722	"""
				723
				724	i, n = 0, len(line)
				725	while i < n and line[i] == ch:
				726	i += 1
				727	return i
				728
				729	class Differ:
				730	r"""
				731	Differ is a class for comparing sequences of lines of text, and
				732	producing human-readable differences or deltas. Differ uses
				733	SequenceMatcher both to compare sequences of lines, and to compare
				734	sequences of characters within similar (near-matching) lines.
				735
				736	Each line of a Differ delta begins with a two-letter code:
				737
				738	'- ' line unique to sequence 1
				739	'+ ' line unique to sequence 2
				740	' ' line common to both sequences
				741	'? ' line not present in either input sequence
				742
				743	Lines beginning with '? ' attempt to guide the eye to intraline
				744	differences, and were not present in either input sequence. These lines
				745	can be confusing if the sequences contain tab characters.
				746
				747	Note that Differ makes no claim to produce a minimal diff. To the
				748	contrary, minimal diffs are often counter-intuitive, because they synch
				749	up anywhere possible, sometimes accidental matches 100 pages apart.
				750	Restricting synch points to contiguous matches preserves some notion of
				751	locality, at the occasional cost of producing a longer diff.
				752
				753	Example: Comparing two texts.
				754
				755	First we set up the texts, sequences of individual single-line strings
				756	ending with newlines (such sequences can also be obtained from the
				757	`readlines()` method of file-like objects):
				758
				759	>>> text1 = ''' 1. Beautiful is better than ugly.
				760	... 2. Explicit is better than implicit.
				761	... 3. Simple is better than complex.
				762	... 4. Complex is better than complicated.
				763	... '''.splitlines(1)
				764	>>> len(text1)
				765	4
				766	>>> text1[0][-1]
				767	'\n'
				768	>>> text2 = ''' 1. Beautiful is better than ugly.
				769	... 3. Simple is better than complex.
				770	... 4. Complicated is better than complex.
				771	... 5. Flat is better than nested.
				772	... '''.splitlines(1)
				773
				774	Next we instantiate a Differ object:
				775
				776	>>> d = Differ()
				777
				778	Note that when instantiating a Differ object we may pass functions to
				779	filter out line and character 'junk'. See Differ.__init__ for details.
				780
				781	Finally, we compare the two:
				782
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	783	>>> result = list(d.compare(text1, text2))
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	784
				785	'result' is a list of strings, so let's pretty-print it:
				786
				787	>>> from pprint import pprint as _pprint
				788	>>> _pprint(result)
				789	[' 1. Beautiful is better than ugly.\n',
				790	'- 2. Explicit is better than implicit.\n',
				791	'- 3. Simple is better than complex.\n',
				792	'+ 3. Simple is better than complex.\n',
				793	'? ++\n',
				794	'- 4. Complex is better than complicated.\n',
				795	'? ^ ---- ^\n',
				796	'+ 4. Complicated is better than complex.\n',
				797	'? ++++ ^ ^\n',
				798	'+ 5. Flat is better than nested.\n']
				799
				800	As a single multi-line string it looks like this:
				801
				802	>>> print ''.join(result),
				803	1. Beautiful is better than ugly.
				804	- 2. Explicit is better than implicit.
				805	- 3. Simple is better than complex.
				806	+ 3. Simple is better than complex.
				807	? ++
				808	- 4. Complex is better than complicated.
				809	? ^ ---- ^
				810	+ 4. Complicated is better than complex.
				811	? ++++ ^ ^
				812	+ 5. Flat is better than nested.
				813
				814	Methods:
				815
				816	__init__(linejunk=None, charjunk=None)
				817	Construct a text differencer, with optional filters.
				818
				819	compare(a, b)
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	820	Compare two sequences of lines; generate the resulting delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	821	"""
				822
				823	def __init__(self, linejunk=None, charjunk=None):
				824	"""
				825	Construct a text differencer, with optional filters.
				826
				827	The two optional keyword parameters are for filter functions:
				828
				829	- `linejunk`: A function that should accept a single string argument,
				830	and return true iff the string is junk. The module-level function
				831	`IS_LINE_JUNK` may be used to filter out lines without visible
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	832	characters, except for at most one splat ('#'). It is recommended
				833	to leave linejunk None; as of Python 2.3, the underlying
				834	SequenceMatcher class has grown an adaptive notion of "noise" lines
				835	that's better than any static definition the author has ever been
				836	able to craft.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	837
				838	- `charjunk`: A function that should accept a string of length 1. The
				839	module-level function `IS_CHARACTER_JUNK` may be used to filter out
				840	whitespace characters (a blank or tab; note: bad idea to include
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	841	newline in this!). Use of IS_CHARACTER_JUNK is recommended.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	842	"""
				843
				844	self.linejunk = linejunk
				845	self.charjunk = charjunk
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	846
				847	def compare(self, a, b):
				848	r"""
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	849	Compare two sequences of lines; generate the resulting delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	850
				851	Each sequence must contain individual single-line strings ending with
				852	newlines. Such sequences can be obtained from the `readlines()` method
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	853	of file-like objects. The delta generated also consists of newline-
				854	terminated strings, ready to be printed as-is via the writeline()
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	855	method of a file-like object.
				856
				857	Example:
				858
				859	>>> print ''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
				860	... 'ore\ntree\nemu\n'.splitlines(1))),
				861	- one
				862	? ^
				863	+ ore
				864	? ^
				865	- two
				866	- three
				867	? -
				868	+ tree
				869	+ emu
				870	"""
				871
				872	cruncher = SequenceMatcher(self.linejunk, a, b)
				873	for tag, alo, ahi, blo, bhi in cruncher.get_opcodes():
				874	if tag == 'replace':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	875	g = self._fancy_replace(a, alo, ahi, b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	876	elif tag == 'delete':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	877	g = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	878	elif tag == 'insert':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	879	g = self._dump('+', b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	880	elif tag == 'equal':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	881	g = self._dump(' ', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	882	else:
				883	raise ValueError, 'unknown tag ' + `tag`
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	884
				885	for line in g:
				886	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	887
				888	def _dump(self, tag, x, lo, hi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	889	"""Generate comparison results for a same-tagged range."""
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	890	for i in xrange(lo, hi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	891	yield '%s %s' % (tag, x[i])
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	892
				893	def _plain_replace(self, a, alo, ahi, b, blo, bhi):
				894	assert alo < ahi and blo < bhi
				895	# dump the shorter block first -- reduces the burden on short-term
				896	# memory if the blocks are of very different sizes
				897	if bhi - blo < ahi - alo:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	898	first = self._dump('+', b, blo, bhi)
				899	second = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	900	else:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	901	first = self._dump('-', a, alo, ahi)
				902	second = self._dump('+', b, blo, bhi)
				903
				904	for g in first, second:
				905	for line in g:
				906	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	907
				908	def _fancy_replace(self, a, alo, ahi, b, blo, bhi):
				909	r"""
				910	When replacing one block of lines with another, search the blocks
				911	for similar lines; the best-matching pair (if any) is used as a
				912	synch point, and intraline difference marking is done on the
				913	similar pair. Lots of work, but often worth it.
				914
				915	Example:
				916
				917	>>> d = Differ()
				918	>>> d._fancy_replace(['abcDefghiJkl\n'], 0, 1, ['abcdefGhijkl\n'], 0, 1)
				919	>>> print ''.join(d.results),
				920	- abcDefghiJkl
				921	? ^ ^ ^
				922	+ abcdefGhijkl
				923	? ^ ^ ^
				924	"""
				925
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	926	# don't synch up unless the lines have a similarity score of at
				927	# least cutoff; best_ratio tracks the best score seen so far
				928	best_ratio, cutoff = 0.74, 0.75
				929	cruncher = SequenceMatcher(self.charjunk)
				930	eqi, eqj = None, None # 1st indices of equal lines (if any)
				931
				932	# search for the pair that matches best without being identical
				933	# (identical lines must be junk lines, & we don't want to synch up
				934	# on junk -- unless we have to)
				935	for j in xrange(blo, bhi):
				936	bj = b[j]
				937	cruncher.set_seq2(bj)
				938	for i in xrange(alo, ahi):
				939	ai = a[i]
				940	if ai == bj:
				941	if eqi is None:
				942	eqi, eqj = i, j
				943	continue
				944	cruncher.set_seq1(ai)
				945	# computing similarity is expensive, so use the quick
				946	# upper bounds first -- have seen this speed up messy
				947	# compares by a factor of 3.
				948	# note that ratio() is only expensive to compute the first
				949	# time it's called on a sequence pair; the expensive part
				950	# of the computation is cached by cruncher
				951	if cruncher.real_quick_ratio() > best_ratio and \
				952	cruncher.quick_ratio() > best_ratio and \
				953	cruncher.ratio() > best_ratio:
				954	best_ratio, best_i, best_j = cruncher.ratio(), i, j
				955	if best_ratio < cutoff:
				956	# no non-identical "pretty close" pair
				957	if eqi is None:
				958	# no identical pair either -- treat it as a straight replace
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	959	for line in self._plain_replace(a, alo, ahi, b, blo, bhi):
				960	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	961	return
				962	# no close pair, but an identical pair -- synch up on that
				963	best_i, best_j, best_ratio = eqi, eqj, 1.0
				964	else:
				965	# there's a close pair, so forget the identical pair (if any)
				966	eqi = None
				967
				968	# a[best_i] very similar to b[best_j]; eqi is None iff they're not
				969	# identical
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	970
				971	# pump out diffs from before the synch point
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	972	for line in self._fancy_helper(a, alo, best_i, b, blo, best_j):
				973	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	974
				975	# do intraline marking on the synch pair
				976	aelt, belt = a[best_i], b[best_j]
				977	if eqi is None:
				978	# pump out a '-', '?', '+', '?' quad for the synched lines
				979	atags = btags = ""
				980	cruncher.set_seqs(aelt, belt)
				981	for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes():
				982	la, lb = ai2 - ai1, bj2 - bj1
				983	if tag == 'replace':
				984	atags += '^' * la
				985	btags += '^' * lb
				986	elif tag == 'delete':
				987	atags += '-' * la
				988	elif tag == 'insert':
				989	btags += '+' * lb
				990	elif tag == 'equal':
				991	atags += ' ' * la
				992	btags += ' ' * lb
				993	else:
				994	raise ValueError, 'unknown tag ' + `tag`
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	995	for line in self._qformat(aelt, belt, atags, btags):
				996	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	997	else:
				998	# the synch pair is identical
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	999	yield ' ' + aelt
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1000
				1001	# pump out diffs from after the synch point
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1002	for line in self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi):
				1003	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1004
				1005	def _fancy_helper(self, a, alo, ahi, b, blo, bhi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1006	g = []
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1007	if alo < ahi:
				1008	if blo < bhi:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1009	g = self._fancy_replace(a, alo, ahi, b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1010	else:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1011	g = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1012	elif blo < bhi:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1013	g = self._dump('+', b, blo, bhi)
				1014
				1015	for line in g:
				1016	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1017
				1018	def _qformat(self, aline, bline, atags, btags):
				1019	r"""
				1020	Format "?" output and deal with leading tabs.
				1021
				1022	Example:
				1023
				1024	>>> d = Differ()
				1025	>>> d._qformat('\tabcDefghiJkl\n', '\t\tabcdefGhijkl\n',
				1026	... ' ^ ^ ^ ', '+ ^ ^ ^ ')
				1027	>>> for line in d.results: print repr(line)
				1028	...
				1029	'- \tabcDefghiJkl\n'
				1030	'? \t ^ ^ ^\n'
				1031	'+ \t\tabcdefGhijkl\n'
				1032	'? \t ^ ^ ^\n'
				1033	"""
				1034
				1035	# Can hurt, but will probably help most of the time.
				1036	common = min(_count_leading(aline, "\t"),
				1037	_count_leading(bline, "\t"))
				1038	common = min(common, _count_leading(atags[:common], " "))
				1039	atags = atags[common:].rstrip()
				1040	btags = btags[common:].rstrip()
				1041
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1042	yield "- " + aline
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1043	if atags:
Tim Peters	527e64f	2001-10-04 05:36:56 +0000	[diff] [blame]	1044	yield "? %s%s\n" % ("\t" * common, atags)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1045
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1046	yield "+ " + bline
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1047	if btags:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1048	yield "? %s%s\n" % ("\t" * common, btags)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1049
				1050	# With respect to junk, an earlier version of ndiff simply refused to
				1051	# start a match with a junk element. The result was cases like this:
				1052	# before: private Thread currentThread;
				1053	# after: private volatile Thread currentThread;
				1054	# If you consider whitespace to be junk, the longest contiguous match
				1055	# not starting with junk is "e Thread currentThread". So ndiff reported
				1056	# that "e volatil" was inserted between the 't' and the 'e' in "private".
				1057	# While an accurate view, to people that's absurd. The current version
				1058	# looks for matching blocks that are entirely junk-free, then extends the
				1059	# longest one of those as far as possible but only with matching junk.
				1060	# So now "currentThread" is matched, then extended to suck up the
				1061	# preceding blank; then "private" is matched, and extended to suck up the
				1062	# following blank; then "Thread" is matched; and finally ndiff reports
				1063	# that "volatile " was inserted before "Thread". The only quibble
				1064	# remaining is that perhaps it was really the case that " volatile"
				1065	# was inserted after "private". I can live with that <wink>.
				1066
				1067	import re
				1068
				1069	def IS_LINE_JUNK(line, pat=re.compile(r"\s#?\s$").match):
				1070	r"""
				1071	Return 1 for ignorable line: iff `line` is blank or contains a single '#'.
				1072
				1073	Examples:
				1074
				1075	>>> IS_LINE_JUNK('\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1076	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1077	>>> IS_LINE_JUNK(' # \n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1078	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1079	>>> IS_LINE_JUNK('hello\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1080	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1081	"""
				1082
				1083	return pat(line) is not None
				1084
				1085	def IS_CHARACTER_JUNK(ch, ws=" \t"):
				1086	r"""
				1087	Return 1 for ignorable character: iff `ch` is a space or tab.
				1088
				1089	Examples:
				1090
				1091	>>> IS_CHARACTER_JUNK(' ')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1092	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1093	>>> IS_CHARACTER_JUNK('\t')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1094	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1095	>>> IS_CHARACTER_JUNK('\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1096	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1097	>>> IS_CHARACTER_JUNK('x')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1098	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1099	"""
				1100
				1101	return ch in ws
				1102
				1103	del re
				1104
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1105
				1106	def unified_diff(a, b, fromfile='', tofile='', fromfiledate='',
				1107	tofiledate='', n=3, lineterm='\n'):
				1108	r"""
				1109	Compare two sequences of lines; generate the delta as a unified diff.
				1110
				1111	Unified diffs are a compact way of showing line changes and a few
				1112	lines of context. The number of context lines is set by 'n' which
				1113	defaults to three.
				1114
Raymond Hettinger	0887c73	2003-06-17 16:53:25 +0000	[diff] [blame]	1115	By default, the diff control lines (those with ---, +++, or @@) are
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1116	created with a trailing newline. This is helpful so that inputs
				1117	created from file.readlines() result in diffs that are suitable for
				1118	file.writelines() since both the inputs and outputs have trailing
				1119	newlines.
				1120
				1121	For inputs that do not have trailing newlines, set the lineterm
				1122	argument to "" so that the output will be uniformly newline free.
				1123
				1124	The unidiff format normally has a header for filenames and modification
				1125	times. Any or all of these may be specified using strings for
				1126	'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'. The modification
				1127	times are normally expressed in the format returned by time.ctime().
				1128
				1129	Example:
				1130
				1131	>>> for line in unified_diff('one two three four'.split(),
				1132	... 'zero one tree four'.split(), 'Original', 'Current',
				1133	... 'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:20:52 2003',
				1134	... lineterm=''):
				1135	... print line
				1136	--- Original Sat Jan 26 23:30:50 1991
				1137	+++ Current Fri Jun 06 10:20:52 2003
				1138	@@ -1,4 +1,4 @@
				1139	+zero
				1140	one
				1141	-two
				1142	-three
				1143	+tree
				1144	four
				1145	"""
				1146
				1147	started = False
				1148	for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
				1149	if not started:
				1150	yield '--- %s %s%s' % (fromfile, fromfiledate, lineterm)
				1151	yield '+++ %s %s%s' % (tofile, tofiledate, lineterm)
				1152	started = True
				1153	i1, i2, j1, j2 = group[0][1], group[-1][2], group[0][3], group[-1][4]
				1154	yield "@@ -%d,%d +%d,%d @@%s" % (i1+1, i2-i1, j1+1, j2-j1, lineterm)
				1155	for tag, i1, i2, j1, j2 in group:
				1156	if tag == 'equal':
				1157	for line in a[i1:i2]:
				1158	yield ' ' + line
				1159	continue
				1160	if tag == 'replace' or tag == 'delete':
				1161	for line in a[i1:i2]:
				1162	yield '-' + line
				1163	if tag == 'replace' or tag == 'insert':
				1164	for line in b[j1:j2]:
				1165	yield '+' + line
				1166
				1167	# See http://www.unix.org/single_unix_specification/
				1168	def context_diff(a, b, fromfile='', tofile='',
				1169	fromfiledate='', tofiledate='', n=3, lineterm='\n'):
				1170	r"""
				1171	Compare two sequences of lines; generate the delta as a context diff.
				1172
				1173	Context diffs are a compact way of showing line changes and a few
				1174	lines of context. The number of context lines is set by 'n' which
				1175	defaults to three.
				1176
				1177	By default, the diff control lines (those with *** or ---) are
				1178	created with a trailing newline. This is helpful so that inputs
				1179	created from file.readlines() result in diffs that are suitable for
				1180	file.writelines() since both the inputs and outputs have trailing
				1181	newlines.
				1182
				1183	For inputs that do not have trailing newlines, set the lineterm
				1184	argument to "" so that the output will be uniformly newline free.
				1185
				1186	The context diff format normally has a header for filenames and
				1187	modification times. Any or all of these may be specified using
				1188	strings for 'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'.
				1189	The modification times are normally expressed in the format returned
				1190	by time.ctime(). If not specified, the strings default to blanks.
				1191
				1192	Example:
				1193
				1194	>>> print ''.join(context_diff('one\ntwo\nthree\nfour\n'.splitlines(1),
				1195	... 'zero\none\ntree\nfour\n'.splitlines(1), 'Original', 'Current',
				1196	... 'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:22:46 2003')),
				1197	*** Original Sat Jan 26 23:30:50 1991
				1198	--- Current Fri Jun 06 10:22:46 2003
				1199	***************
				1200	* 1,4 **
				1201	one
				1202	! two
				1203	! three
				1204	four
				1205	--- 1,4 ----
				1206	+ zero
				1207	one
				1208	! tree
				1209	four
				1210	"""
				1211
				1212	started = False
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1213	prefixmap = {'insert':'+ ', 'delete':'- ', 'replace':'! ', 'equal':' '}
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1214	for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
				1215	if not started:
				1216	yield '*** %s %s%s' % (fromfile, fromfiledate, lineterm)
				1217	yield '--- %s %s%s' % (tofile, tofiledate, lineterm)
				1218	started = True
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1219
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1220	yield '***************%s' % (lineterm,)
				1221	if group[-1][2] - group[0][1] >= 2:
				1222	yield '* %d,%d **%s' % (group[0][1]+1, group[-1][2], lineterm)
				1223	else:
				1224	yield '* %d **%s' % (group[-1][2], lineterm)
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1225	visiblechanges = [e for e in group if e[0] in ('replace', 'delete')]
				1226	if visiblechanges:
				1227	for tag, i1, i2, _, _ in group:
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1228	if tag != 'insert':
				1229	for line in a[i1:i2]:
				1230	yield prefixmap[tag] + line
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1231
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1232	if group[-1][4] - group[0][3] >= 2:
				1233	yield '--- %d,%d ----%s' % (group[0][3]+1, group[-1][4], lineterm)
				1234	else:
				1235	yield '--- %d ----%s' % (group[-1][4], lineterm)
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1236	visiblechanges = [e for e in group if e[0] in ('replace', 'insert')]
				1237	if visiblechanges:
				1238	for tag, _, _, j1, j2 in group:
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1239	if tag != 'delete':
				1240	for line in b[j1:j2]:
				1241	yield prefixmap[tag] + line
				1242
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	1243	def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK):
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1244	r"""
				1245	Compare `a` and `b` (lists of strings); return a `Differ`-style delta.
				1246
				1247	Optional keyword parameters `linejunk` and `charjunk` are for filter
				1248	functions (or None):
				1249
				1250	- linejunk: A function that should accept a single string argument, and
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	1251	return true iff the string is junk. The default is None, and is
				1252	recommended; as of Python 2.3, an adaptive notion of "noise" lines is
				1253	used that does a good job on its own.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1254
				1255	- charjunk: A function that should accept a string of length 1. The
				1256	default is module-level function IS_CHARACTER_JUNK, which filters out
				1257	whitespace characters (a blank or tab; note: bad idea to include newline
				1258	in this!).
				1259
				1260	Tools/scripts/ndiff.py is a command-line front-end to this function.
				1261
				1262	Example:
				1263
				1264	>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
				1265	... 'ore\ntree\nemu\n'.splitlines(1))
				1266	>>> print ''.join(diff),
				1267	- one
				1268	? ^
				1269	+ ore
				1270	? ^
				1271	- two
				1272	- three
				1273	? -
				1274	+ tree
				1275	+ emu
				1276	"""
				1277	return Differ(linejunk, charjunk).compare(a, b)
				1278
				1279	def restore(delta, which):
				1280	r"""
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1281	Generate one of the two sequences that generated a delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1282
				1283	Given a `delta` produced by `Differ.compare()` or `ndiff()`, extract
				1284	lines originating from file 1 or 2 (parameter `which`), stripping off line
				1285	prefixes.
				1286
				1287	Examples:
				1288
				1289	>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
				1290	... 'ore\ntree\nemu\n'.splitlines(1))
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1291	>>> diff = list(diff)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1292	>>> print ''.join(restore(diff, 1)),
				1293	one
				1294	two
				1295	three
				1296	>>> print ''.join(restore(diff, 2)),
				1297	ore
				1298	tree
				1299	emu
				1300	"""
				1301	try:
				1302	tag = {1: "- ", 2: "+ "}[int(which)]
				1303	except KeyError:
				1304	raise ValueError, ('unknown delta choice (must be 1 or 2): %r'
				1305	% which)
				1306	prefixes = (" ", tag)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1307	for line in delta:
				1308	if line[:2] in prefixes:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1309	yield line[2:]
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1310
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	1311	def _test():
				1312	import doctest, difflib
				1313	return doctest.testmod(difflib)
				1314
				1315	if __name__ == "__main__":
				1316	_test()