Blame - Lib/difflib.py - platform/external/python/cpython3

blob: e82c703a7d875fda158345b30e88ea337f05e12c [file] [log] [blame]

Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	1	#! /usr/bin/env python
				2
				3	"""
				4	Module difflib -- helpers for computing deltas between objects.
				5
				6	Function get_close_matches(word, possibilities, n=3, cutoff=0.6):
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	7	Use SequenceMatcher to return list of the best "good enough" matches.
				8
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	9	Function context_diff(a, b):
				10	For two lists of strings, return a delta in context diff format.
				11
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	12	Function ndiff(a, b):
				13	Return a delta: the difference between `a` and `b` (lists of strings).
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	14
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	15	Function restore(delta, which):
				16	Return one of the two sequences that generated an ndiff delta.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	17
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	18	Function unified_diff(a, b):
				19	For two lists of strings, return a delta in unified diff format.
				20
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	21	Class SequenceMatcher:
				22	A flexible class for comparing pairs of sequences of any type.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	23
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	24	Class Differ:
				25	For producing human-readable deltas from sequences of lines of text.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	26	"""
				27
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	28	__all__ = ['get_close_matches', 'ndiff', 'restore', 'SequenceMatcher',
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	29	'Differ','IS_CHARACTER_JUNK', 'IS_LINE_JUNK', 'context_diff',
				30	'unified_diff']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	31
Raymond Hettinger	bb6b734	2004-06-13 09:57:33 +0000	[diff] [blame]	32	import heapq
				33
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	34	def _calculate_ratio(matches, length):
				35	if length:
				36	return 2.0 * matches / length
				37	return 1.0
				38
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	39	class SequenceMatcher:
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	40
				41	"""
				42	SequenceMatcher is a flexible class for comparing pairs of sequences of
				43	any type, so long as the sequence elements are hashable. The basic
				44	algorithm predates, and is a little fancier than, an algorithm
				45	published in the late 1980's by Ratcliff and Obershelp under the
				46	hyperbolic name "gestalt pattern matching". The basic idea is to find
				47	the longest contiguous matching subsequence that contains no "junk"
				48	elements (R-O doesn't address junk). The same idea is then applied
				49	recursively to the pieces of the sequences to the left and to the right
				50	of the matching subsequence. This does not yield minimal edit
				51	sequences, but does tend to yield matches that "look right" to people.
				52
				53	SequenceMatcher tries to compute a "human-friendly diff" between two
				54	sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the
				55	longest contiguous & junk-free matching subsequence. That's what
				56	catches peoples' eyes. The Windows(tm) windiff has another interesting
				57	notion, pairing up elements that appear uniquely in each sequence.
				58	That, and the method here, appear to yield more intuitive difference
				59	reports than does diff. This method appears to be the least vulnerable
				60	to synching up on blocks of "junk lines", though (like blank lines in
				61	ordinary text files, or maybe "<P>" lines in HTML files). That may be
				62	because this is the only method of the 3 that has a concept of
				63	"junk" <wink>.
				64
				65	Example, comparing two strings, and considering blanks to be "junk":
				66
				67	>>> s = SequenceMatcher(lambda x: x == " ",
				68	... "private Thread currentThread;",
				69	... "private volatile Thread currentThread;")
				70	>>>
				71
				72	.ratio() returns a float in [0, 1], measuring the "similarity" of the
				73	sequences. As a rule of thumb, a .ratio() value over 0.6 means the
				74	sequences are close matches:
				75
				76	>>> print round(s.ratio(), 3)
				77	0.866
				78	>>>
				79
				80	If you're only interested in where the sequences match,
				81	.get_matching_blocks() is handy:
				82
				83	>>> for block in s.get_matching_blocks():
				84	... print "a[%d] and b[%d] match for %d elements" % block
				85	a[0] and b[0] match for 8 elements
				86	a[8] and b[17] match for 6 elements
				87	a[14] and b[23] match for 15 elements
				88	a[29] and b[38] match for 0 elements
				89
				90	Note that the last tuple returned by .get_matching_blocks() is always a
				91	dummy, (len(a), len(b), 0), and this is the only case in which the last
				92	tuple element (number of elements matched) is 0.
				93
				94	If you want to know how to change the first sequence into the second,
				95	use .get_opcodes():
				96
				97	>>> for opcode in s.get_opcodes():
				98	... print "%6s a[%d:%d] b[%d:%d]" % opcode
				99	equal a[0:8] b[0:8]
				100	insert a[8:8] b[8:17]
				101	equal a[8:14] b[17:23]
				102	equal a[14:29] b[23:38]
				103
				104	See the Differ class for a fancy human-friendly file differencer, which
				105	uses SequenceMatcher both to compare sequences of lines, and to compare
				106	sequences of characters within similar (near-matching) lines.
				107
				108	See also function get_close_matches() in this module, which shows how
				109	simple code building on SequenceMatcher can be used to do useful work.
				110
				111	Timing: Basic R-O is cubic time worst case and quadratic time expected
				112	case. SequenceMatcher is quadratic time for the worst case and has
				113	expected-case behavior dependent in a complicated way on how many
				114	elements the sequences have in common; best case time is linear.
				115
				116	Methods:
				117
				118	__init__(isjunk=None, a='', b='')
				119	Construct a SequenceMatcher.
				120
				121	set_seqs(a, b)
				122	Set the two sequences to be compared.
				123
				124	set_seq1(a)
				125	Set the first sequence to be compared.
				126
				127	set_seq2(b)
				128	Set the second sequence to be compared.
				129
				130	find_longest_match(alo, ahi, blo, bhi)
				131	Find longest matching block in a[alo:ahi] and b[blo:bhi].
				132
				133	get_matching_blocks()
				134	Return list of triples describing matching subsequences.
				135
				136	get_opcodes()
				137	Return list of 5-tuples describing how to turn a into b.
				138
				139	ratio()
				140	Return a measure of the sequences' similarity (float in [0,1]).
				141
				142	quick_ratio()
				143	Return an upper bound on .ratio() relatively quickly.
				144
				145	real_quick_ratio()
				146	Return an upper bound on ratio() very quickly.
				147	"""
				148
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	149	def __init__(self, isjunk=None, a='', b=''):
				150	"""Construct a SequenceMatcher.
				151
				152	Optional arg isjunk is None (the default), or a one-argument
				153	function that takes a sequence element and returns true iff the
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	154	element is junk. None is equivalent to passing "lambda x: 0", i.e.
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	155	no elements are considered to be junk. For example, pass
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	156	lambda x: x in " \\t"
				157	if you're comparing lines as sequences of characters, and don't
				158	want to synch up on blanks or hard tabs.
				159
				160	Optional arg a is the first of two sequences to be compared. By
				161	default, an empty string. The elements of a must be hashable. See
				162	also .set_seqs() and .set_seq1().
				163
				164	Optional arg b is the second of two sequences to be compared. By
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	165	default, an empty string. The elements of b must be hashable. See
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	166	also .set_seqs() and .set_seq2().
				167	"""
				168
				169	# Members:
				170	# a
				171	# first sequence
				172	# b
				173	# second sequence; differences are computed as "what do
				174	# we need to do to 'a' to change it into 'b'?"
				175	# b2j
				176	# for x in b, b2j[x] is a list of the indices (into b)
				177	# at which x appears; junk elements do not appear
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	178	# fullbcount
				179	# for x in b, fullbcount[x] == the number of times x
				180	# appears in b; only materialized if really needed (used
				181	# only for computing quick_ratio())
				182	# matching_blocks
				183	# a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k];
				184	# ascending & non-overlapping in i and in j; terminated by
				185	# a dummy (len(a), len(b), 0) sentinel
				186	# opcodes
				187	# a list of (tag, i1, i2, j1, j2) tuples, where tag is
				188	# one of
				189	# 'replace' a[i1:i2] should be replaced by b[j1:j2]
				190	# 'delete' a[i1:i2] should be deleted
				191	# 'insert' b[j1:j2] should be inserted
				192	# 'equal' a[i1:i2] == b[j1:j2]
				193	# isjunk
				194	# a user-supplied function taking a sequence element and
				195	# returning true iff the element is "junk" -- this has
				196	# subtle but helpful effects on the algorithm, which I'll
				197	# get around to writing up someday <0.9 wink>.
				198	# DON'T USE! Only __chain_b uses this. Use isbjunk.
				199	# isbjunk
				200	# for x in b, isbjunk(x) == isjunk(x) but much faster;
				201	# it's really the has_key method of a hidden dict.
				202	# DOES NOT WORK for x in a!
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	203	# isbpopular
				204	# for x in b, isbpopular(x) is true iff b is reasonably long
				205	# (at least 200 elements) and x accounts for more than 1% of
				206	# its elements. DOES NOT WORK for x in a!
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	207
				208	self.isjunk = isjunk
				209	self.a = self.b = None
				210	self.set_seqs(a, b)
				211
				212	def set_seqs(self, a, b):
				213	"""Set the two sequences to be compared.
				214
				215	>>> s = SequenceMatcher()
				216	>>> s.set_seqs("abcd", "bcde")
				217	>>> s.ratio()
				218	0.75
				219	"""
				220
				221	self.set_seq1(a)
				222	self.set_seq2(b)
				223
				224	def set_seq1(self, a):
				225	"""Set the first sequence to be compared.
				226
				227	The second sequence to be compared is not changed.
				228
				229	>>> s = SequenceMatcher(None, "abcd", "bcde")
				230	>>> s.ratio()
				231	0.75
				232	>>> s.set_seq1("bcde")
				233	>>> s.ratio()
				234	1.0
				235	>>>
				236
				237	SequenceMatcher computes and caches detailed information about the
				238	second sequence, so if you want to compare one sequence S against
				239	many sequences, use .set_seq2(S) once and call .set_seq1(x)
				240	repeatedly for each of the other sequences.
				241
				242	See also set_seqs() and set_seq2().
				243	"""
				244
				245	if a is self.a:
				246	return
				247	self.a = a
				248	self.matching_blocks = self.opcodes = None
				249
				250	def set_seq2(self, b):
				251	"""Set the second sequence to be compared.
				252
				253	The first sequence to be compared is not changed.
				254
				255	>>> s = SequenceMatcher(None, "abcd", "bcde")
				256	>>> s.ratio()
				257	0.75
				258	>>> s.set_seq2("abcd")
				259	>>> s.ratio()
				260	1.0
				261	>>>
				262
				263	SequenceMatcher computes and caches detailed information about the
				264	second sequence, so if you want to compare one sequence S against
				265	many sequences, use .set_seq2(S) once and call .set_seq1(x)
				266	repeatedly for each of the other sequences.
				267
				268	See also set_seqs() and set_seq1().
				269	"""
				270
				271	if b is self.b:
				272	return
				273	self.b = b
				274	self.matching_blocks = self.opcodes = None
				275	self.fullbcount = None
				276	self.__chain_b()
				277
				278	# For each element x in b, set b2j[x] to a list of the indices in
				279	# b where x appears; the indices are in increasing order; note that
				280	# the number of times x appears in b is len(b2j[x]) ...
				281	# when self.isjunk is defined, junk elements don't show up in this
				282	# map at all, which stops the central find_longest_match method
				283	# from starting any matching block at a junk element ...
				284	# also creates the fast isbjunk function ...
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	285	# b2j also does not contain entries for "popular" elements, meaning
				286	# elements that account for more than 1% of the total elements, and
				287	# when the sequence is reasonably large (>= 200 elements); this can
				288	# be viewed as an adaptive notion of semi-junk, and yields an enormous
				289	# speedup when, e.g., comparing program files with hundreds of
				290	# instances of "return NULL;" ...
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	291	# note that this is only called when b changes; so for cross-product
				292	# kinds of matches, it's best to call set_seq2 once, then set_seq1
				293	# repeatedly
				294
				295	def __chain_b(self):
				296	# Because isjunk is a user-defined (not C) function, and we test
				297	# for junk a LOT, it's important to minimize the number of calls.
				298	# Before the tricks described here, __chain_b was by far the most
				299	# time-consuming routine in the whole module! If anyone sees
				300	# Jim Roskind, thank him again for profile.py -- I never would
				301	# have guessed that.
				302	# The first trick is to build b2j ignoring the possibility
				303	# of junk. I.e., we don't call isjunk at all yet. Throwing
				304	# out the junk later is much cheaper than building b2j "right"
				305	# from the start.
				306	b = self.b
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	307	n = len(b)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	308	self.b2j = b2j = {}
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	309	populardict = {}
				310	for i, elt in enumerate(b):
				311	if elt in b2j:
				312	indices = b2j[elt]
				313	if n >= 200 and len(indices) * 100 > n:
				314	populardict[elt] = 1
				315	del indices[:]
				316	else:
				317	indices.append(i)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	318	else:
				319	b2j[elt] = [i]
				320
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	321	# Purge leftover indices for popular elements.
				322	for elt in populardict:
				323	del b2j[elt]
				324
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	325	# Now b2j.keys() contains elements uniquely, and especially when
				326	# the sequence is a string, that's usually a good deal smaller
				327	# than len(string). The difference is the number of isjunk calls
				328	# saved.
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	329	isjunk = self.isjunk
				330	junkdict = {}
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	331	if isjunk:
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	332	for d in populardict, b2j:
				333	for elt in d.keys():
				334	if isjunk(elt):
				335	junkdict[elt] = 1
				336	del d[elt]
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	337
Raymond Hettinger	54f0222	2002-06-01 14:18:47 +0000	[diff] [blame]	338	# Now for x in b, isjunk(x) == x in junkdict, but the
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	339	# latter is much faster. Note too that while there may be a
				340	# lot of junk in the sequence, the number of unique junk
				341	# elements is probably small. So the memory burden of keeping
				342	# this dict alive is likely trivial compared to the size of b2j.
				343	self.isbjunk = junkdict.has_key
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	344	self.isbpopular = populardict.has_key
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	345
				346	def find_longest_match(self, alo, ahi, blo, bhi):
				347	"""Find longest matching block in a[alo:ahi] and b[blo:bhi].
				348
				349	If isjunk is not defined:
				350
				351	Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
				352	alo <= i <= i+k <= ahi
				353	blo <= j <= j+k <= bhi
				354	and for all (i',j',k') meeting those conditions,
				355	k >= k'
				356	i <= i'
				357	and if i == i', j <= j'
				358
				359	In other words, of all maximal matching blocks, return one that
				360	starts earliest in a, and of all those maximal matching blocks that
				361	start earliest in a, return the one that starts earliest in b.
				362
				363	>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
				364	>>> s.find_longest_match(0, 5, 0, 9)
				365	(0, 4, 5)
				366
				367	If isjunk is defined, first the longest matching block is
				368	determined as above, but with the additional restriction that no
				369	junk element appears in the block. Then that block is extended as
				370	far as possible by matching (only) junk elements on both sides. So
				371	the resulting block never matches on junk except as identical junk
				372	happens to be adjacent to an "interesting" match.
				373
				374	Here's the same example as before, but considering blanks to be
				375	junk. That prevents " abcd" from matching the " abcd" at the tail
				376	end of the second sequence directly. Instead only the "abcd" can
				377	match, and matches the leftmost "abcd" in the second sequence:
				378
				379	>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
				380	>>> s.find_longest_match(0, 5, 0, 9)
				381	(1, 0, 4)
				382
				383	If no blocks match, return (alo, blo, 0).
				384
				385	>>> s = SequenceMatcher(None, "ab", "c")
				386	>>> s.find_longest_match(0, 2, 0, 1)
				387	(0, 0, 0)
				388	"""
				389
				390	# CAUTION: stripping common prefix or suffix would be incorrect.
				391	# E.g.,
				392	# ab
				393	# acab
				394	# Longest matching block is "ab", but if common prefix is
				395	# stripped, it's "a" (tied with "b"). UNIX(tm) diff does so
				396	# strip, so ends up claiming that ab is changed to acab by
				397	# inserting "ca" in the middle. That's minimal but unintuitive:
				398	# "it's obvious" that someone inserted "ac" at the front.
				399	# Windiff ends up at the same place as diff, but by pairing up
				400	# the unique 'b's and then matching the first two 'a's.
				401
				402	a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.isbjunk
				403	besti, bestj, bestsize = alo, blo, 0
				404	# find longest junk-free match
				405	# during an iteration of the loop, j2len[j] = length of longest
				406	# junk-free match ending with a[i-1] and b[j]
				407	j2len = {}
				408	nothing = []
				409	for i in xrange(alo, ahi):
				410	# look at all instances of a[i] in b; note that because
				411	# b2j has no junk keys, the loop is skipped if a[i] is junk
				412	j2lenget = j2len.get
				413	newj2len = {}
				414	for j in b2j.get(a[i], nothing):
				415	# a[i] matches b[j]
				416	if j < blo:
				417	continue
				418	if j >= bhi:
				419	break
				420	k = newj2len[j] = j2lenget(j-1, 0) + 1
				421	if k > bestsize:
				422	besti, bestj, bestsize = i-k+1, j-k+1, k
				423	j2len = newj2len
				424
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	425	# Extend the best by non-junk elements on each end. In particular,
				426	# "popular" non-junk elements aren't in b2j, which greatly speeds
				427	# the inner loop above, but also means "the best" match so far
				428	# doesn't contain any junk or popular non-junk elements.
				429	while besti > alo and bestj > blo and \
				430	not isbjunk(b[bestj-1]) and \
				431	a[besti-1] == b[bestj-1]:
				432	besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
				433	while besti+bestsize < ahi and bestj+bestsize < bhi and \
				434	not isbjunk(b[bestj+bestsize]) and \
				435	a[besti+bestsize] == b[bestj+bestsize]:
				436	bestsize += 1
				437
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	438	# Now that we have a wholly interesting match (albeit possibly
				439	# empty!), we may as well suck up the matching junk on each
				440	# side of it too. Can't think of a good reason not to, and it
				441	# saves post-processing the (possibly considerable) expense of
				442	# figuring out what to do with it. In the case of an empty
				443	# interesting match, this is clearly the right thing to do,
				444	# because no other kind of match is possible in the regions.
				445	while besti > alo and bestj > blo and \
				446	isbjunk(b[bestj-1]) and \
				447	a[besti-1] == b[bestj-1]:
				448	besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
				449	while besti+bestsize < ahi and bestj+bestsize < bhi and \
				450	isbjunk(b[bestj+bestsize]) and \
				451	a[besti+bestsize] == b[bestj+bestsize]:
				452	bestsize = bestsize + 1
				453
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	454	return besti, bestj, bestsize
				455
				456	def get_matching_blocks(self):
				457	"""Return list of triples describing matching subsequences.
				458
				459	Each triple is of the form (i, j, n), and means that
				460	a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in
				461	i and in j.
				462
				463	The last triple is a dummy, (len(a), len(b), 0), and is the only
				464	triple with n==0.
				465
				466	>>> s = SequenceMatcher(None, "abxcd", "abcd")
				467	>>> s.get_matching_blocks()
				468	[(0, 0, 2), (3, 2, 2), (5, 4, 0)]
				469	"""
				470
				471	if self.matching_blocks is not None:
				472	return self.matching_blocks
				473	self.matching_blocks = []
				474	la, lb = len(self.a), len(self.b)
				475	self.__helper(0, la, 0, lb, self.matching_blocks)
				476	self.matching_blocks.append( (la, lb, 0) )
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	477	return self.matching_blocks
				478
				479	# builds list of matching blocks covering a[alo:ahi] and
				480	# b[blo:bhi], appending them in increasing order to answer
				481
				482	def __helper(self, alo, ahi, blo, bhi, answer):
				483	i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
				484	# a[alo:i] vs b[blo:j] unknown
				485	# a[i:i+k] same as b[j:j+k]
				486	# a[i+k:ahi] vs b[j+k:bhi] unknown
				487	if k:
				488	if alo < i and blo < j:
				489	self.__helper(alo, i, blo, j, answer)
				490	answer.append(x)
				491	if i+k < ahi and j+k < bhi:
				492	self.__helper(i+k, ahi, j+k, bhi, answer)
				493
				494	def get_opcodes(self):
				495	"""Return list of 5-tuples describing how to turn a into b.
				496
				497	Each tuple is of the form (tag, i1, i2, j1, j2). The first tuple
				498	has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
				499	tuple preceding it, and likewise for j1 == the previous j2.
				500
				501	The tags are strings, with these meanings:
				502
				503	'replace': a[i1:i2] should be replaced by b[j1:j2]
				504	'delete': a[i1:i2] should be deleted.
				505	Note that j1==j2 in this case.
				506	'insert': b[j1:j2] should be inserted at a[i1:i1].
				507	Note that i1==i2 in this case.
				508	'equal': a[i1:i2] == b[j1:j2]
				509
				510	>>> a = "qabxcd"
				511	>>> b = "abycdf"
				512	>>> s = SequenceMatcher(None, a, b)
				513	>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
				514	... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
				515	... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
				516	delete a[0:1] (q) b[0:0] ()
				517	equal a[1:3] (ab) b[0:2] (ab)
				518	replace a[3:4] (x) b[2:3] (y)
				519	equal a[4:6] (cd) b[3:5] (cd)
				520	insert a[6:6] () b[5:6] (f)
				521	"""
				522
				523	if self.opcodes is not None:
				524	return self.opcodes
				525	i = j = 0
				526	self.opcodes = answer = []
				527	for ai, bj, size in self.get_matching_blocks():
				528	# invariant: we've pumped out correct diffs to change
				529	# a[:i] into b[:j], and the next matching block is
				530	# a[ai:ai+size] == b[bj:bj+size]. So we need to pump
				531	# out a diff to change a[i:ai] into b[j:bj], pump out
				532	# the matching block, and move (i,j) beyond the match
				533	tag = ''
				534	if i < ai and j < bj:
				535	tag = 'replace'
				536	elif i < ai:
				537	tag = 'delete'
				538	elif j < bj:
				539	tag = 'insert'
				540	if tag:
				541	answer.append( (tag, i, ai, j, bj) )
				542	i, j = ai+size, bj+size
				543	# the list of matching blocks is terminated by a
				544	# sentinel with size 0
				545	if size:
				546	answer.append( ('equal', ai, i, bj, j) )
				547	return answer
				548
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	549	def get_grouped_opcodes(self, n=3):
				550	""" Isolate change clusters by eliminating ranges with no changes.
				551
				552	Return a generator of groups with upto n lines of context.
				553	Each group is in the same format as returned by get_opcodes().
				554
				555	>>> from pprint import pprint
				556	>>> a = map(str, range(1,40))
				557	>>> b = a[:]
				558	>>> b[8:8] = ['i'] # Make an insertion
				559	>>> b[20] += 'x' # Make a replacement
				560	>>> b[23:28] = [] # Make a deletion
				561	>>> b[30] += 'y' # Make another replacement
				562	>>> pprint(list(SequenceMatcher(None,a,b).get_grouped_opcodes()))
				563	[[('equal', 5, 8, 5, 8), ('insert', 8, 8, 8, 9), ('equal', 8, 11, 9, 12)],
				564	[('equal', 16, 19, 17, 20),
				565	('replace', 19, 20, 20, 21),
				566	('equal', 20, 22, 21, 23),
				567	('delete', 22, 27, 23, 23),
				568	('equal', 27, 30, 23, 26)],
				569	[('equal', 31, 34, 27, 30),
				570	('replace', 34, 35, 30, 31),
				571	('equal', 35, 38, 31, 34)]]
				572	"""
				573
				574	codes = self.get_opcodes()
Brett Cannon	d2c5b4b	2004-07-10 23:54:07 +0000	[diff] [blame^]	575	if not codes:
				576	codes = [("equal", 0, 1, 0, 1)]
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	577	# Fixup leading and trailing groups if they show no changes.
				578	if codes[0][0] == 'equal':
				579	tag, i1, i2, j1, j2 = codes[0]
				580	codes[0] = tag, max(i1, i2-n), i2, max(j1, j2-n), j2
				581	if codes[-1][0] == 'equal':
				582	tag, i1, i2, j1, j2 = codes[-1]
				583	codes[-1] = tag, i1, min(i2, i1+n), j1, min(j2, j1+n)
				584
				585	nn = n + n
				586	group = []
				587	for tag, i1, i2, j1, j2 in codes:
				588	# End the current group and start a new one whenever
				589	# there is a large range with no changes.
				590	if tag == 'equal' and i2-i1 > nn:
				591	group.append((tag, i1, min(i2, i1+n), j1, min(j2, j1+n)))
				592	yield group
				593	group = []
				594	i1, j1 = max(i1, i2-n), max(j1, j2-n)
				595	group.append((tag, i1, i2, j1 ,j2))
				596	if group and not (len(group)==1 and group[0][0] == 'equal'):
				597	yield group
				598
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	599	def ratio(self):
				600	"""Return a measure of the sequences' similarity (float in [0,1]).
				601
				602	Where T is the total number of elements in both sequences, and
				603	M is the number of matches, this is 2,0*M / T.
				604	Note that this is 1 if the sequences are identical, and 0 if
				605	they have nothing in common.
				606
				607	.ratio() is expensive to compute if you haven't already computed
				608	.get_matching_blocks() or .get_opcodes(), in which case you may
				609	want to try .quick_ratio() or .real_quick_ratio() first to get an
				610	upper bound.
				611
				612	>>> s = SequenceMatcher(None, "abcd", "bcde")
				613	>>> s.ratio()
				614	0.75
				615	>>> s.quick_ratio()
				616	0.75
				617	>>> s.real_quick_ratio()
				618	1.0
				619	"""
				620
				621	matches = reduce(lambda sum, triple: sum + triple[-1],
				622	self.get_matching_blocks(), 0)
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	623	return _calculate_ratio(matches, len(self.a) + len(self.b))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	624
				625	def quick_ratio(self):
				626	"""Return an upper bound on ratio() relatively quickly.
				627
				628	This isn't defined beyond that it is an upper bound on .ratio(), and
				629	is faster to compute.
				630	"""
				631
				632	# viewing a and b as multisets, set matches to the cardinality
				633	# of their intersection; this counts the number of matches
				634	# without regard to order, so is clearly an upper bound
				635	if self.fullbcount is None:
				636	self.fullbcount = fullbcount = {}
				637	for elt in self.b:
				638	fullbcount[elt] = fullbcount.get(elt, 0) + 1
				639	fullbcount = self.fullbcount
				640	# avail[x] is the number of times x appears in 'b' less the
				641	# number of times we've seen it in 'a' so far ... kinda
				642	avail = {}
				643	availhas, matches = avail.has_key, 0
				644	for elt in self.a:
				645	if availhas(elt):
				646	numb = avail[elt]
				647	else:
				648	numb = fullbcount.get(elt, 0)
				649	avail[elt] = numb - 1
				650	if numb > 0:
				651	matches = matches + 1
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	652	return _calculate_ratio(matches, len(self.a) + len(self.b))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	653
				654	def real_quick_ratio(self):
				655	"""Return an upper bound on ratio() very quickly.
				656
				657	This isn't defined beyond that it is an upper bound on .ratio(), and
				658	is faster to compute than either .ratio() or .quick_ratio().
				659	"""
				660
				661	la, lb = len(self.a), len(self.b)
				662	# can't have more matches than the number of elements in the
				663	# shorter sequence
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	664	return _calculate_ratio(min(la, lb), la + lb)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	665
				666	def get_close_matches(word, possibilities, n=3, cutoff=0.6):
				667	"""Use SequenceMatcher to return list of the best "good enough" matches.
				668
				669	word is a sequence for which close matches are desired (typically a
				670	string).
				671
				672	possibilities is a list of sequences against which to match word
				673	(typically a list of strings).
				674
				675	Optional arg n (default 3) is the maximum number of close matches to
				676	return. n must be > 0.
				677
				678	Optional arg cutoff (default 0.6) is a float in [0, 1]. Possibilities
				679	that don't score at least that similar to word are ignored.
				680
				681	The best (no more than n) matches among the possibilities are returned
				682	in a list, sorted by similarity score, most similar first.
				683
				684	>>> get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
				685	['apple', 'ape']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	686	>>> import keyword as _keyword
				687	>>> get_close_matches("wheel", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	688	['while']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	689	>>> get_close_matches("apple", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	690	[]
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	691	>>> get_close_matches("accept", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	692	['except']
				693	"""
				694
				695	if not n > 0:
Walter Dörwald	70a6b49	2004-02-12 17:35:32 +0000	[diff] [blame]	696	raise ValueError("n must be > 0: %r" % (n,))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	697	if not 0.0 <= cutoff <= 1.0:
Walter Dörwald	70a6b49	2004-02-12 17:35:32 +0000	[diff] [blame]	698	raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	699	result = []
				700	s = SequenceMatcher()
				701	s.set_seq2(word)
				702	for x in possibilities:
				703	s.set_seq1(x)
				704	if s.real_quick_ratio() >= cutoff and \
				705	s.quick_ratio() >= cutoff and \
				706	s.ratio() >= cutoff:
				707	result.append((s.ratio(), x))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	708
Raymond Hettinger	6b59f5f	2003-10-16 05:53:16 +0000	[diff] [blame]	709	# Move the best scorers to head of list
Raymond Hettinger	aefde43	2004-06-15 23:53:35 +0000	[diff] [blame]	710	result = heapq.nlargest(n, result)
Raymond Hettinger	6b59f5f	2003-10-16 05:53:16 +0000	[diff] [blame]	711	# Strip scores for the best n matches
Raymond Hettinger	bb6b734	2004-06-13 09:57:33 +0000	[diff] [blame]	712	return [x for score, x in result]
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	713
				714	def _count_leading(line, ch):
				715	"""
				716	Return number of `ch` characters at the start of `line`.
				717
				718	Example:
				719
				720	>>> _count_leading(' abc', ' ')
				721	3
				722	"""
				723
				724	i, n = 0, len(line)
				725	while i < n and line[i] == ch:
				726	i += 1
				727	return i
				728
				729	class Differ:
				730	r"""
				731	Differ is a class for comparing sequences of lines of text, and
				732	producing human-readable differences or deltas. Differ uses
				733	SequenceMatcher both to compare sequences of lines, and to compare
				734	sequences of characters within similar (near-matching) lines.
				735
				736	Each line of a Differ delta begins with a two-letter code:
				737
				738	'- ' line unique to sequence 1
				739	'+ ' line unique to sequence 2
				740	' ' line common to both sequences
				741	'? ' line not present in either input sequence
				742
				743	Lines beginning with '? ' attempt to guide the eye to intraline
				744	differences, and were not present in either input sequence. These lines
				745	can be confusing if the sequences contain tab characters.
				746
				747	Note that Differ makes no claim to produce a minimal diff. To the
				748	contrary, minimal diffs are often counter-intuitive, because they synch
				749	up anywhere possible, sometimes accidental matches 100 pages apart.
				750	Restricting synch points to contiguous matches preserves some notion of
				751	locality, at the occasional cost of producing a longer diff.
				752
				753	Example: Comparing two texts.
				754
				755	First we set up the texts, sequences of individual single-line strings
				756	ending with newlines (such sequences can also be obtained from the
				757	`readlines()` method of file-like objects):
				758
				759	>>> text1 = ''' 1. Beautiful is better than ugly.
				760	... 2. Explicit is better than implicit.
				761	... 3. Simple is better than complex.
				762	... 4. Complex is better than complicated.
				763	... '''.splitlines(1)
				764	>>> len(text1)
				765	4
				766	>>> text1[0][-1]
				767	'\n'
				768	>>> text2 = ''' 1. Beautiful is better than ugly.
				769	... 3. Simple is better than complex.
				770	... 4. Complicated is better than complex.
				771	... 5. Flat is better than nested.
				772	... '''.splitlines(1)
				773
				774	Next we instantiate a Differ object:
				775
				776	>>> d = Differ()
				777
				778	Note that when instantiating a Differ object we may pass functions to
				779	filter out line and character 'junk'. See Differ.__init__ for details.
				780
				781	Finally, we compare the two:
				782
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	783	>>> result = list(d.compare(text1, text2))
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	784
				785	'result' is a list of strings, so let's pretty-print it:
				786
				787	>>> from pprint import pprint as _pprint
				788	>>> _pprint(result)
				789	[' 1. Beautiful is better than ugly.\n',
				790	'- 2. Explicit is better than implicit.\n',
				791	'- 3. Simple is better than complex.\n',
				792	'+ 3. Simple is better than complex.\n',
				793	'? ++\n',
				794	'- 4. Complex is better than complicated.\n',
				795	'? ^ ---- ^\n',
				796	'+ 4. Complicated is better than complex.\n',
				797	'? ++++ ^ ^\n',
				798	'+ 5. Flat is better than nested.\n']
				799
				800	As a single multi-line string it looks like this:
				801
				802	>>> print ''.join(result),
				803	1. Beautiful is better than ugly.
				804	- 2. Explicit is better than implicit.
				805	- 3. Simple is better than complex.
				806	+ 3. Simple is better than complex.
				807	? ++
				808	- 4. Complex is better than complicated.
				809	? ^ ---- ^
				810	+ 4. Complicated is better than complex.
				811	? ++++ ^ ^
				812	+ 5. Flat is better than nested.
				813
				814	Methods:
				815
				816	__init__(linejunk=None, charjunk=None)
				817	Construct a text differencer, with optional filters.
				818
				819	compare(a, b)
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	820	Compare two sequences of lines; generate the resulting delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	821	"""
				822
				823	def __init__(self, linejunk=None, charjunk=None):
				824	"""
				825	Construct a text differencer, with optional filters.
				826
				827	The two optional keyword parameters are for filter functions:
				828
				829	- `linejunk`: A function that should accept a single string argument,
				830	and return true iff the string is junk. The module-level function
				831	`IS_LINE_JUNK` may be used to filter out lines without visible
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	832	characters, except for at most one splat ('#'). It is recommended
				833	to leave linejunk None; as of Python 2.3, the underlying
				834	SequenceMatcher class has grown an adaptive notion of "noise" lines
				835	that's better than any static definition the author has ever been
				836	able to craft.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	837
				838	- `charjunk`: A function that should accept a string of length 1. The
				839	module-level function `IS_CHARACTER_JUNK` may be used to filter out
				840	whitespace characters (a blank or tab; note: bad idea to include
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	841	newline in this!). Use of IS_CHARACTER_JUNK is recommended.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	842	"""
				843
				844	self.linejunk = linejunk
				845	self.charjunk = charjunk
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	846
				847	def compare(self, a, b):
				848	r"""
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	849	Compare two sequences of lines; generate the resulting delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	850
				851	Each sequence must contain individual single-line strings ending with
				852	newlines. Such sequences can be obtained from the `readlines()` method
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	853	of file-like objects. The delta generated also consists of newline-
				854	terminated strings, ready to be printed as-is via the writeline()
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	855	method of a file-like object.
				856
				857	Example:
				858
				859	>>> print ''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
				860	... 'ore\ntree\nemu\n'.splitlines(1))),
				861	- one
				862	? ^
				863	+ ore
				864	? ^
				865	- two
				866	- three
				867	? -
				868	+ tree
				869	+ emu
				870	"""
				871
				872	cruncher = SequenceMatcher(self.linejunk, a, b)
				873	for tag, alo, ahi, blo, bhi in cruncher.get_opcodes():
				874	if tag == 'replace':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	875	g = self._fancy_replace(a, alo, ahi, b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	876	elif tag == 'delete':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	877	g = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	878	elif tag == 'insert':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	879	g = self._dump('+', b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	880	elif tag == 'equal':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	881	g = self._dump(' ', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	882	else:
Walter Dörwald	70a6b49	2004-02-12 17:35:32 +0000	[diff] [blame]	883	raise ValueError, 'unknown tag %r' % (tag,)
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	884
				885	for line in g:
				886	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	887
				888	def _dump(self, tag, x, lo, hi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	889	"""Generate comparison results for a same-tagged range."""
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	890	for i in xrange(lo, hi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	891	yield '%s %s' % (tag, x[i])
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	892
				893	def _plain_replace(self, a, alo, ahi, b, blo, bhi):
				894	assert alo < ahi and blo < bhi
				895	# dump the shorter block first -- reduces the burden on short-term
				896	# memory if the blocks are of very different sizes
				897	if bhi - blo < ahi - alo:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	898	first = self._dump('+', b, blo, bhi)
				899	second = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	900	else:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	901	first = self._dump('-', a, alo, ahi)
				902	second = self._dump('+', b, blo, bhi)
				903
				904	for g in first, second:
				905	for line in g:
				906	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	907
				908	def _fancy_replace(self, a, alo, ahi, b, blo, bhi):
				909	r"""
				910	When replacing one block of lines with another, search the blocks
				911	for similar lines; the best-matching pair (if any) is used as a
				912	synch point, and intraline difference marking is done on the
				913	similar pair. Lots of work, but often worth it.
				914
				915	Example:
				916
				917	>>> d = Differ()
Raymond Hettinger	83325e9	2003-07-16 04:32:32 +0000	[diff] [blame]	918	>>> results = d._fancy_replace(['abcDefghiJkl\n'], 0, 1,
				919	... ['abcdefGhijkl\n'], 0, 1)
				920	>>> print ''.join(results),
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	921	- abcDefghiJkl
				922	? ^ ^ ^
				923	+ abcdefGhijkl
				924	? ^ ^ ^
				925	"""
				926
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	927	# don't synch up unless the lines have a similarity score of at
				928	# least cutoff; best_ratio tracks the best score seen so far
				929	best_ratio, cutoff = 0.74, 0.75
				930	cruncher = SequenceMatcher(self.charjunk)
				931	eqi, eqj = None, None # 1st indices of equal lines (if any)
				932
				933	# search for the pair that matches best without being identical
				934	# (identical lines must be junk lines, & we don't want to synch up
				935	# on junk -- unless we have to)
				936	for j in xrange(blo, bhi):
				937	bj = b[j]
				938	cruncher.set_seq2(bj)
				939	for i in xrange(alo, ahi):
				940	ai = a[i]
				941	if ai == bj:
				942	if eqi is None:
				943	eqi, eqj = i, j
				944	continue
				945	cruncher.set_seq1(ai)
				946	# computing similarity is expensive, so use the quick
				947	# upper bounds first -- have seen this speed up messy
				948	# compares by a factor of 3.
				949	# note that ratio() is only expensive to compute the first
				950	# time it's called on a sequence pair; the expensive part
				951	# of the computation is cached by cruncher
				952	if cruncher.real_quick_ratio() > best_ratio and \
				953	cruncher.quick_ratio() > best_ratio and \
				954	cruncher.ratio() > best_ratio:
				955	best_ratio, best_i, best_j = cruncher.ratio(), i, j
				956	if best_ratio < cutoff:
				957	# no non-identical "pretty close" pair
				958	if eqi is None:
				959	# no identical pair either -- treat it as a straight replace
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	960	for line in self._plain_replace(a, alo, ahi, b, blo, bhi):
				961	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	962	return
				963	# no close pair, but an identical pair -- synch up on that
				964	best_i, best_j, best_ratio = eqi, eqj, 1.0
				965	else:
				966	# there's a close pair, so forget the identical pair (if any)
				967	eqi = None
				968
				969	# a[best_i] very similar to b[best_j]; eqi is None iff they're not
				970	# identical
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	971
				972	# pump out diffs from before the synch point
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	973	for line in self._fancy_helper(a, alo, best_i, b, blo, best_j):
				974	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	975
				976	# do intraline marking on the synch pair
				977	aelt, belt = a[best_i], b[best_j]
				978	if eqi is None:
				979	# pump out a '-', '?', '+', '?' quad for the synched lines
				980	atags = btags = ""
				981	cruncher.set_seqs(aelt, belt)
				982	for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes():
				983	la, lb = ai2 - ai1, bj2 - bj1
				984	if tag == 'replace':
				985	atags += '^' * la
				986	btags += '^' * lb
				987	elif tag == 'delete':
				988	atags += '-' * la
				989	elif tag == 'insert':
				990	btags += '+' * lb
				991	elif tag == 'equal':
				992	atags += ' ' * la
				993	btags += ' ' * lb
				994	else:
Walter Dörwald	70a6b49	2004-02-12 17:35:32 +0000	[diff] [blame]	995	raise ValueError, 'unknown tag %r' % (tag,)
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	996	for line in self._qformat(aelt, belt, atags, btags):
				997	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	998	else:
				999	# the synch pair is identical
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1000	yield ' ' + aelt
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1001
				1002	# pump out diffs from after the synch point
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1003	for line in self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi):
				1004	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1005
				1006	def _fancy_helper(self, a, alo, ahi, b, blo, bhi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1007	g = []
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1008	if alo < ahi:
				1009	if blo < bhi:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1010	g = self._fancy_replace(a, alo, ahi, b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1011	else:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1012	g = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1013	elif blo < bhi:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1014	g = self._dump('+', b, blo, bhi)
				1015
				1016	for line in g:
				1017	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1018
				1019	def _qformat(self, aline, bline, atags, btags):
				1020	r"""
				1021	Format "?" output and deal with leading tabs.
				1022
				1023	Example:
				1024
				1025	>>> d = Differ()
Raymond Hettinger	83325e9	2003-07-16 04:32:32 +0000	[diff] [blame]	1026	>>> results = d._qformat('\tabcDefghiJkl\n', '\t\tabcdefGhijkl\n',
				1027	... ' ^ ^ ^ ', '+ ^ ^ ^ ')
				1028	>>> for line in results: print repr(line)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1029	...
				1030	'- \tabcDefghiJkl\n'
				1031	'? \t ^ ^ ^\n'
				1032	'+ \t\tabcdefGhijkl\n'
				1033	'? \t ^ ^ ^\n'
				1034	"""
				1035
				1036	# Can hurt, but will probably help most of the time.
				1037	common = min(_count_leading(aline, "\t"),
				1038	_count_leading(bline, "\t"))
				1039	common = min(common, _count_leading(atags[:common], " "))
				1040	atags = atags[common:].rstrip()
				1041	btags = btags[common:].rstrip()
				1042
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1043	yield "- " + aline
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1044	if atags:
Tim Peters	527e64f	2001-10-04 05:36:56 +0000	[diff] [blame]	1045	yield "? %s%s\n" % ("\t" * common, atags)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1046
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1047	yield "+ " + bline
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1048	if btags:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1049	yield "? %s%s\n" % ("\t" * common, btags)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1050
				1051	# With respect to junk, an earlier version of ndiff simply refused to
				1052	# start a match with a junk element. The result was cases like this:
				1053	# before: private Thread currentThread;
				1054	# after: private volatile Thread currentThread;
				1055	# If you consider whitespace to be junk, the longest contiguous match
				1056	# not starting with junk is "e Thread currentThread". So ndiff reported
				1057	# that "e volatil" was inserted between the 't' and the 'e' in "private".
				1058	# While an accurate view, to people that's absurd. The current version
				1059	# looks for matching blocks that are entirely junk-free, then extends the
				1060	# longest one of those as far as possible but only with matching junk.
				1061	# So now "currentThread" is matched, then extended to suck up the
				1062	# preceding blank; then "private" is matched, and extended to suck up the
				1063	# following blank; then "Thread" is matched; and finally ndiff reports
				1064	# that "volatile " was inserted before "Thread". The only quibble
				1065	# remaining is that perhaps it was really the case that " volatile"
				1066	# was inserted after "private". I can live with that <wink>.
				1067
				1068	import re
				1069
				1070	def IS_LINE_JUNK(line, pat=re.compile(r"\s#?\s$").match):
				1071	r"""
				1072	Return 1 for ignorable line: iff `line` is blank or contains a single '#'.
				1073
				1074	Examples:
				1075
				1076	>>> IS_LINE_JUNK('\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1077	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1078	>>> IS_LINE_JUNK(' # \n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1079	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1080	>>> IS_LINE_JUNK('hello\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1081	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1082	"""
				1083
				1084	return pat(line) is not None
				1085
				1086	def IS_CHARACTER_JUNK(ch, ws=" \t"):
				1087	r"""
				1088	Return 1 for ignorable character: iff `ch` is a space or tab.
				1089
				1090	Examples:
				1091
				1092	>>> IS_CHARACTER_JUNK(' ')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1093	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1094	>>> IS_CHARACTER_JUNK('\t')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1095	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1096	>>> IS_CHARACTER_JUNK('\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1097	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1098	>>> IS_CHARACTER_JUNK('x')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1099	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1100	"""
				1101
				1102	return ch in ws
				1103
				1104	del re
				1105
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1106
				1107	def unified_diff(a, b, fromfile='', tofile='', fromfiledate='',
				1108	tofiledate='', n=3, lineterm='\n'):
				1109	r"""
				1110	Compare two sequences of lines; generate the delta as a unified diff.
				1111
				1112	Unified diffs are a compact way of showing line changes and a few
				1113	lines of context. The number of context lines is set by 'n' which
				1114	defaults to three.
				1115
Raymond Hettinger	0887c73	2003-06-17 16:53:25 +0000	[diff] [blame]	1116	By default, the diff control lines (those with ---, +++, or @@) are
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1117	created with a trailing newline. This is helpful so that inputs
				1118	created from file.readlines() result in diffs that are suitable for
				1119	file.writelines() since both the inputs and outputs have trailing
				1120	newlines.
				1121
				1122	For inputs that do not have trailing newlines, set the lineterm
				1123	argument to "" so that the output will be uniformly newline free.
				1124
				1125	The unidiff format normally has a header for filenames and modification
				1126	times. Any or all of these may be specified using strings for
				1127	'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'. The modification
				1128	times are normally expressed in the format returned by time.ctime().
				1129
				1130	Example:
				1131
				1132	>>> for line in unified_diff('one two three four'.split(),
				1133	... 'zero one tree four'.split(), 'Original', 'Current',
				1134	... 'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:20:52 2003',
				1135	... lineterm=''):
				1136	... print line
				1137	--- Original Sat Jan 26 23:30:50 1991
				1138	+++ Current Fri Jun 06 10:20:52 2003
				1139	@@ -1,4 +1,4 @@
				1140	+zero
				1141	one
				1142	-two
				1143	-three
				1144	+tree
				1145	four
				1146	"""
				1147
				1148	started = False
				1149	for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
				1150	if not started:
				1151	yield '--- %s %s%s' % (fromfile, fromfiledate, lineterm)
				1152	yield '+++ %s %s%s' % (tofile, tofiledate, lineterm)
				1153	started = True
				1154	i1, i2, j1, j2 = group[0][1], group[-1][2], group[0][3], group[-1][4]
				1155	yield "@@ -%d,%d +%d,%d @@%s" % (i1+1, i2-i1, j1+1, j2-j1, lineterm)
				1156	for tag, i1, i2, j1, j2 in group:
				1157	if tag == 'equal':
				1158	for line in a[i1:i2]:
				1159	yield ' ' + line
				1160	continue
				1161	if tag == 'replace' or tag == 'delete':
				1162	for line in a[i1:i2]:
				1163	yield '-' + line
				1164	if tag == 'replace' or tag == 'insert':
				1165	for line in b[j1:j2]:
				1166	yield '+' + line
				1167
				1168	# See http://www.unix.org/single_unix_specification/
				1169	def context_diff(a, b, fromfile='', tofile='',
				1170	fromfiledate='', tofiledate='', n=3, lineterm='\n'):
				1171	r"""
				1172	Compare two sequences of lines; generate the delta as a context diff.
				1173
				1174	Context diffs are a compact way of showing line changes and a few
				1175	lines of context. The number of context lines is set by 'n' which
				1176	defaults to three.
				1177
				1178	By default, the diff control lines (those with *** or ---) are
				1179	created with a trailing newline. This is helpful so that inputs
				1180	created from file.readlines() result in diffs that are suitable for
				1181	file.writelines() since both the inputs and outputs have trailing
				1182	newlines.
				1183
				1184	For inputs that do not have trailing newlines, set the lineterm
				1185	argument to "" so that the output will be uniformly newline free.
				1186
				1187	The context diff format normally has a header for filenames and
				1188	modification times. Any or all of these may be specified using
				1189	strings for 'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'.
				1190	The modification times are normally expressed in the format returned
				1191	by time.ctime(). If not specified, the strings default to blanks.
				1192
				1193	Example:
				1194
				1195	>>> print ''.join(context_diff('one\ntwo\nthree\nfour\n'.splitlines(1),
				1196	... 'zero\none\ntree\nfour\n'.splitlines(1), 'Original', 'Current',
				1197	... 'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:22:46 2003')),
				1198	*** Original Sat Jan 26 23:30:50 1991
				1199	--- Current Fri Jun 06 10:22:46 2003
				1200	***************
				1201	* 1,4 **
				1202	one
				1203	! two
				1204	! three
				1205	four
				1206	--- 1,4 ----
				1207	+ zero
				1208	one
				1209	! tree
				1210	four
				1211	"""
				1212
				1213	started = False
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1214	prefixmap = {'insert':'+ ', 'delete':'- ', 'replace':'! ', 'equal':' '}
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1215	for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
				1216	if not started:
				1217	yield '*** %s %s%s' % (fromfile, fromfiledate, lineterm)
				1218	yield '--- %s %s%s' % (tofile, tofiledate, lineterm)
				1219	started = True
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1220
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1221	yield '***************%s' % (lineterm,)
				1222	if group[-1][2] - group[0][1] >= 2:
				1223	yield '* %d,%d **%s' % (group[0][1]+1, group[-1][2], lineterm)
				1224	else:
				1225	yield '* %d **%s' % (group[-1][2], lineterm)
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1226	visiblechanges = [e for e in group if e[0] in ('replace', 'delete')]
				1227	if visiblechanges:
				1228	for tag, i1, i2, _, _ in group:
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1229	if tag != 'insert':
				1230	for line in a[i1:i2]:
				1231	yield prefixmap[tag] + line
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1232
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1233	if group[-1][4] - group[0][3] >= 2:
				1234	yield '--- %d,%d ----%s' % (group[0][3]+1, group[-1][4], lineterm)
				1235	else:
				1236	yield '--- %d ----%s' % (group[-1][4], lineterm)
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1237	visiblechanges = [e for e in group if e[0] in ('replace', 'insert')]
				1238	if visiblechanges:
				1239	for tag, _, _, j1, j2 in group:
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1240	if tag != 'delete':
				1241	for line in b[j1:j2]:
				1242	yield prefixmap[tag] + line
				1243
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	1244	def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK):
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1245	r"""
				1246	Compare `a` and `b` (lists of strings); return a `Differ`-style delta.
				1247
				1248	Optional keyword parameters `linejunk` and `charjunk` are for filter
				1249	functions (or None):
				1250
				1251	- linejunk: A function that should accept a single string argument, and
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	1252	return true iff the string is junk. The default is None, and is
				1253	recommended; as of Python 2.3, an adaptive notion of "noise" lines is
				1254	used that does a good job on its own.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1255
				1256	- charjunk: A function that should accept a string of length 1. The
				1257	default is module-level function IS_CHARACTER_JUNK, which filters out
				1258	whitespace characters (a blank or tab; note: bad idea to include newline
				1259	in this!).
				1260
				1261	Tools/scripts/ndiff.py is a command-line front-end to this function.
				1262
				1263	Example:
				1264
				1265	>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
				1266	... 'ore\ntree\nemu\n'.splitlines(1))
				1267	>>> print ''.join(diff),
				1268	- one
				1269	? ^
				1270	+ ore
				1271	? ^
				1272	- two
				1273	- three
				1274	? -
				1275	+ tree
				1276	+ emu
				1277	"""
				1278	return Differ(linejunk, charjunk).compare(a, b)
				1279
				1280	def restore(delta, which):
				1281	r"""
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1282	Generate one of the two sequences that generated a delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1283
				1284	Given a `delta` produced by `Differ.compare()` or `ndiff()`, extract
				1285	lines originating from file 1 or 2 (parameter `which`), stripping off line
				1286	prefixes.
				1287
				1288	Examples:
				1289
				1290	>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
				1291	... 'ore\ntree\nemu\n'.splitlines(1))
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1292	>>> diff = list(diff)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1293	>>> print ''.join(restore(diff, 1)),
				1294	one
				1295	two
				1296	three
				1297	>>> print ''.join(restore(diff, 2)),
				1298	ore
				1299	tree
				1300	emu
				1301	"""
				1302	try:
				1303	tag = {1: "- ", 2: "+ "}[int(which)]
				1304	except KeyError:
				1305	raise ValueError, ('unknown delta choice (must be 1 or 2): %r'
				1306	% which)
				1307	prefixes = (" ", tag)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1308	for line in delta:
				1309	if line[:2] in prefixes:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1310	yield line[2:]
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1311
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	1312	def _test():
				1313	import doctest, difflib
				1314	return doctest.testmod(difflib)
				1315
				1316	if __name__ == "__main__":
				1317	_test()