Blame - Lib/difflib.py - platform/external/python/cpython3

blob: 529c78638c6b2e54f313f6265a5f91d0b3f6faa1 [file] [log] [blame]

Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	1	#! /usr/bin/env python
				2
				3	"""
				4	Module difflib -- helpers for computing deltas between objects.
				5
				6	Function get_close_matches(word, possibilities, n=3, cutoff=0.6):
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	7	Use SequenceMatcher to return list of the best "good enough" matches.
				8
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	9	Function context_diff(a, b):
				10	For two lists of strings, return a delta in context diff format.
				11
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	12	Function ndiff(a, b):
				13	Return a delta: the difference between `a` and `b` (lists of strings).
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	14
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	15	Function restore(delta, which):
				16	Return one of the two sequences that generated an ndiff delta.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	17
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	18	Function unified_diff(a, b):
				19	For two lists of strings, return a delta in unified diff format.
				20
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	21	Class SequenceMatcher:
				22	A flexible class for comparing pairs of sequences of any type.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	23
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	24	Class Differ:
				25	For producing human-readable deltas from sequences of lines of text.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	26	"""
				27
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	28	__all__ = ['get_close_matches', 'ndiff', 'restore', 'SequenceMatcher',
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	29	'Differ','IS_CHARACTER_JUNK', 'IS_LINE_JUNK', 'context_diff',
				30	'unified_diff']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	31
Raymond Hettinger	bb6b734	2004-06-13 09:57:33 +0000	[diff] [blame]	32	import heapq
				33
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	34	def _calculate_ratio(matches, length):
				35	if length:
				36	return 2.0 * matches / length
				37	return 1.0
				38
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	39	class SequenceMatcher:
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	40
				41	"""
				42	SequenceMatcher is a flexible class for comparing pairs of sequences of
				43	any type, so long as the sequence elements are hashable. The basic
				44	algorithm predates, and is a little fancier than, an algorithm
				45	published in the late 1980's by Ratcliff and Obershelp under the
				46	hyperbolic name "gestalt pattern matching". The basic idea is to find
				47	the longest contiguous matching subsequence that contains no "junk"
				48	elements (R-O doesn't address junk). The same idea is then applied
				49	recursively to the pieces of the sequences to the left and to the right
				50	of the matching subsequence. This does not yield minimal edit
				51	sequences, but does tend to yield matches that "look right" to people.
				52
				53	SequenceMatcher tries to compute a "human-friendly diff" between two
				54	sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the
				55	longest contiguous & junk-free matching subsequence. That's what
				56	catches peoples' eyes. The Windows(tm) windiff has another interesting
				57	notion, pairing up elements that appear uniquely in each sequence.
				58	That, and the method here, appear to yield more intuitive difference
				59	reports than does diff. This method appears to be the least vulnerable
				60	to synching up on blocks of "junk lines", though (like blank lines in
				61	ordinary text files, or maybe "<P>" lines in HTML files). That may be
				62	because this is the only method of the 3 that has a concept of
				63	"junk" <wink>.
				64
				65	Example, comparing two strings, and considering blanks to be "junk":
				66
				67	>>> s = SequenceMatcher(lambda x: x == " ",
				68	... "private Thread currentThread;",
				69	... "private volatile Thread currentThread;")
				70	>>>
				71
				72	.ratio() returns a float in [0, 1], measuring the "similarity" of the
				73	sequences. As a rule of thumb, a .ratio() value over 0.6 means the
				74	sequences are close matches:
				75
				76	>>> print round(s.ratio(), 3)
				77	0.866
				78	>>>
				79
				80	If you're only interested in where the sequences match,
				81	.get_matching_blocks() is handy:
				82
				83	>>> for block in s.get_matching_blocks():
				84	... print "a[%d] and b[%d] match for %d elements" % block
				85	a[0] and b[0] match for 8 elements
				86	a[8] and b[17] match for 6 elements
				87	a[14] and b[23] match for 15 elements
				88	a[29] and b[38] match for 0 elements
				89
				90	Note that the last tuple returned by .get_matching_blocks() is always a
				91	dummy, (len(a), len(b), 0), and this is the only case in which the last
				92	tuple element (number of elements matched) is 0.
				93
				94	If you want to know how to change the first sequence into the second,
				95	use .get_opcodes():
				96
				97	>>> for opcode in s.get_opcodes():
				98	... print "%6s a[%d:%d] b[%d:%d]" % opcode
				99	equal a[0:8] b[0:8]
				100	insert a[8:8] b[8:17]
				101	equal a[8:14] b[17:23]
				102	equal a[14:29] b[23:38]
				103
				104	See the Differ class for a fancy human-friendly file differencer, which
				105	uses SequenceMatcher both to compare sequences of lines, and to compare
				106	sequences of characters within similar (near-matching) lines.
				107
				108	See also function get_close_matches() in this module, which shows how
				109	simple code building on SequenceMatcher can be used to do useful work.
				110
				111	Timing: Basic R-O is cubic time worst case and quadratic time expected
				112	case. SequenceMatcher is quadratic time for the worst case and has
				113	expected-case behavior dependent in a complicated way on how many
				114	elements the sequences have in common; best case time is linear.
				115
				116	Methods:
				117
				118	__init__(isjunk=None, a='', b='')
				119	Construct a SequenceMatcher.
				120
				121	set_seqs(a, b)
				122	Set the two sequences to be compared.
				123
				124	set_seq1(a)
				125	Set the first sequence to be compared.
				126
				127	set_seq2(b)
				128	Set the second sequence to be compared.
				129
				130	find_longest_match(alo, ahi, blo, bhi)
				131	Find longest matching block in a[alo:ahi] and b[blo:bhi].
				132
				133	get_matching_blocks()
				134	Return list of triples describing matching subsequences.
				135
				136	get_opcodes()
				137	Return list of 5-tuples describing how to turn a into b.
				138
				139	ratio()
				140	Return a measure of the sequences' similarity (float in [0,1]).
				141
				142	quick_ratio()
				143	Return an upper bound on .ratio() relatively quickly.
				144
				145	real_quick_ratio()
				146	Return an upper bound on ratio() very quickly.
				147	"""
				148
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	149	def __init__(self, isjunk=None, a='', b=''):
				150	"""Construct a SequenceMatcher.
				151
				152	Optional arg isjunk is None (the default), or a one-argument
				153	function that takes a sequence element and returns true iff the
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	154	element is junk. None is equivalent to passing "lambda x: 0", i.e.
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	155	no elements are considered to be junk. For example, pass
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	156	lambda x: x in " \\t"
				157	if you're comparing lines as sequences of characters, and don't
				158	want to synch up on blanks or hard tabs.
				159
				160	Optional arg a is the first of two sequences to be compared. By
				161	default, an empty string. The elements of a must be hashable. See
				162	also .set_seqs() and .set_seq1().
				163
				164	Optional arg b is the second of two sequences to be compared. By
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	165	default, an empty string. The elements of b must be hashable. See
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	166	also .set_seqs() and .set_seq2().
				167	"""
				168
				169	# Members:
				170	# a
				171	# first sequence
				172	# b
				173	# second sequence; differences are computed as "what do
				174	# we need to do to 'a' to change it into 'b'?"
				175	# b2j
				176	# for x in b, b2j[x] is a list of the indices (into b)
				177	# at which x appears; junk elements do not appear
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	178	# fullbcount
				179	# for x in b, fullbcount[x] == the number of times x
				180	# appears in b; only materialized if really needed (used
				181	# only for computing quick_ratio())
				182	# matching_blocks
				183	# a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k];
				184	# ascending & non-overlapping in i and in j; terminated by
				185	# a dummy (len(a), len(b), 0) sentinel
				186	# opcodes
				187	# a list of (tag, i1, i2, j1, j2) tuples, where tag is
				188	# one of
				189	# 'replace' a[i1:i2] should be replaced by b[j1:j2]
				190	# 'delete' a[i1:i2] should be deleted
				191	# 'insert' b[j1:j2] should be inserted
				192	# 'equal' a[i1:i2] == b[j1:j2]
				193	# isjunk
				194	# a user-supplied function taking a sequence element and
				195	# returning true iff the element is "junk" -- this has
				196	# subtle but helpful effects on the algorithm, which I'll
				197	# get around to writing up someday <0.9 wink>.
				198	# DON'T USE! Only __chain_b uses this. Use isbjunk.
				199	# isbjunk
				200	# for x in b, isbjunk(x) == isjunk(x) but much faster;
				201	# it's really the has_key method of a hidden dict.
				202	# DOES NOT WORK for x in a!
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	203	# isbpopular
				204	# for x in b, isbpopular(x) is true iff b is reasonably long
				205	# (at least 200 elements) and x accounts for more than 1% of
				206	# its elements. DOES NOT WORK for x in a!
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	207
				208	self.isjunk = isjunk
				209	self.a = self.b = None
				210	self.set_seqs(a, b)
				211
				212	def set_seqs(self, a, b):
				213	"""Set the two sequences to be compared.
				214
				215	>>> s = SequenceMatcher()
				216	>>> s.set_seqs("abcd", "bcde")
				217	>>> s.ratio()
				218	0.75
				219	"""
				220
				221	self.set_seq1(a)
				222	self.set_seq2(b)
				223
				224	def set_seq1(self, a):
				225	"""Set the first sequence to be compared.
				226
				227	The second sequence to be compared is not changed.
				228
				229	>>> s = SequenceMatcher(None, "abcd", "bcde")
				230	>>> s.ratio()
				231	0.75
				232	>>> s.set_seq1("bcde")
				233	>>> s.ratio()
				234	1.0
				235	>>>
				236
				237	SequenceMatcher computes and caches detailed information about the
				238	second sequence, so if you want to compare one sequence S against
				239	many sequences, use .set_seq2(S) once and call .set_seq1(x)
				240	repeatedly for each of the other sequences.
				241
				242	See also set_seqs() and set_seq2().
				243	"""
				244
				245	if a is self.a:
				246	return
				247	self.a = a
				248	self.matching_blocks = self.opcodes = None
				249
				250	def set_seq2(self, b):
				251	"""Set the second sequence to be compared.
				252
				253	The first sequence to be compared is not changed.
				254
				255	>>> s = SequenceMatcher(None, "abcd", "bcde")
				256	>>> s.ratio()
				257	0.75
				258	>>> s.set_seq2("abcd")
				259	>>> s.ratio()
				260	1.0
				261	>>>
				262
				263	SequenceMatcher computes and caches detailed information about the
				264	second sequence, so if you want to compare one sequence S against
				265	many sequences, use .set_seq2(S) once and call .set_seq1(x)
				266	repeatedly for each of the other sequences.
				267
				268	See also set_seqs() and set_seq1().
				269	"""
				270
				271	if b is self.b:
				272	return
				273	self.b = b
				274	self.matching_blocks = self.opcodes = None
				275	self.fullbcount = None
				276	self.__chain_b()
				277
				278	# For each element x in b, set b2j[x] to a list of the indices in
				279	# b where x appears; the indices are in increasing order; note that
				280	# the number of times x appears in b is len(b2j[x]) ...
				281	# when self.isjunk is defined, junk elements don't show up in this
				282	# map at all, which stops the central find_longest_match method
				283	# from starting any matching block at a junk element ...
				284	# also creates the fast isbjunk function ...
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	285	# b2j also does not contain entries for "popular" elements, meaning
				286	# elements that account for more than 1% of the total elements, and
				287	# when the sequence is reasonably large (>= 200 elements); this can
				288	# be viewed as an adaptive notion of semi-junk, and yields an enormous
				289	# speedup when, e.g., comparing program files with hundreds of
				290	# instances of "return NULL;" ...
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	291	# note that this is only called when b changes; so for cross-product
				292	# kinds of matches, it's best to call set_seq2 once, then set_seq1
				293	# repeatedly
				294
				295	def __chain_b(self):
				296	# Because isjunk is a user-defined (not C) function, and we test
				297	# for junk a LOT, it's important to minimize the number of calls.
				298	# Before the tricks described here, __chain_b was by far the most
				299	# time-consuming routine in the whole module! If anyone sees
				300	# Jim Roskind, thank him again for profile.py -- I never would
				301	# have guessed that.
				302	# The first trick is to build b2j ignoring the possibility
				303	# of junk. I.e., we don't call isjunk at all yet. Throwing
				304	# out the junk later is much cheaper than building b2j "right"
				305	# from the start.
				306	b = self.b
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	307	n = len(b)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	308	self.b2j = b2j = {}
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	309	populardict = {}
				310	for i, elt in enumerate(b):
				311	if elt in b2j:
				312	indices = b2j[elt]
				313	if n >= 200 and len(indices) * 100 > n:
				314	populardict[elt] = 1
				315	del indices[:]
				316	else:
				317	indices.append(i)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	318	else:
				319	b2j[elt] = [i]
				320
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	321	# Purge leftover indices for popular elements.
				322	for elt in populardict:
				323	del b2j[elt]
				324
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	325	# Now b2j.keys() contains elements uniquely, and especially when
				326	# the sequence is a string, that's usually a good deal smaller
				327	# than len(string). The difference is the number of isjunk calls
				328	# saved.
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	329	isjunk = self.isjunk
				330	junkdict = {}
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	331	if isjunk:
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	332	for d in populardict, b2j:
				333	for elt in d.keys():
				334	if isjunk(elt):
				335	junkdict[elt] = 1
				336	del d[elt]
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	337
Raymond Hettinger	54f0222	2002-06-01 14:18:47 +0000	[diff] [blame]	338	# Now for x in b, isjunk(x) == x in junkdict, but the
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	339	# latter is much faster. Note too that while there may be a
				340	# lot of junk in the sequence, the number of unique junk
				341	# elements is probably small. So the memory burden of keeping
				342	# this dict alive is likely trivial compared to the size of b2j.
				343	self.isbjunk = junkdict.has_key
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	344	self.isbpopular = populardict.has_key
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	345
				346	def find_longest_match(self, alo, ahi, blo, bhi):
				347	"""Find longest matching block in a[alo:ahi] and b[blo:bhi].
				348
				349	If isjunk is not defined:
				350
				351	Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
				352	alo <= i <= i+k <= ahi
				353	blo <= j <= j+k <= bhi
				354	and for all (i',j',k') meeting those conditions,
				355	k >= k'
				356	i <= i'
				357	and if i == i', j <= j'
				358
				359	In other words, of all maximal matching blocks, return one that
				360	starts earliest in a, and of all those maximal matching blocks that
				361	start earliest in a, return the one that starts earliest in b.
				362
				363	>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
				364	>>> s.find_longest_match(0, 5, 0, 9)
				365	(0, 4, 5)
				366
				367	If isjunk is defined, first the longest matching block is
				368	determined as above, but with the additional restriction that no
				369	junk element appears in the block. Then that block is extended as
				370	far as possible by matching (only) junk elements on both sides. So
				371	the resulting block never matches on junk except as identical junk
				372	happens to be adjacent to an "interesting" match.
				373
				374	Here's the same example as before, but considering blanks to be
				375	junk. That prevents " abcd" from matching the " abcd" at the tail
				376	end of the second sequence directly. Instead only the "abcd" can
				377	match, and matches the leftmost "abcd" in the second sequence:
				378
				379	>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
				380	>>> s.find_longest_match(0, 5, 0, 9)
				381	(1, 0, 4)
				382
				383	If no blocks match, return (alo, blo, 0).
				384
				385	>>> s = SequenceMatcher(None, "ab", "c")
				386	>>> s.find_longest_match(0, 2, 0, 1)
				387	(0, 0, 0)
				388	"""
				389
				390	# CAUTION: stripping common prefix or suffix would be incorrect.
				391	# E.g.,
				392	# ab
				393	# acab
				394	# Longest matching block is "ab", but if common prefix is
				395	# stripped, it's "a" (tied with "b"). UNIX(tm) diff does so
				396	# strip, so ends up claiming that ab is changed to acab by
				397	# inserting "ca" in the middle. That's minimal but unintuitive:
				398	# "it's obvious" that someone inserted "ac" at the front.
				399	# Windiff ends up at the same place as diff, but by pairing up
				400	# the unique 'b's and then matching the first two 'a's.
				401
				402	a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.isbjunk
				403	besti, bestj, bestsize = alo, blo, 0
				404	# find longest junk-free match
				405	# during an iteration of the loop, j2len[j] = length of longest
				406	# junk-free match ending with a[i-1] and b[j]
				407	j2len = {}
				408	nothing = []
				409	for i in xrange(alo, ahi):
				410	# look at all instances of a[i] in b; note that because
				411	# b2j has no junk keys, the loop is skipped if a[i] is junk
				412	j2lenget = j2len.get
				413	newj2len = {}
				414	for j in b2j.get(a[i], nothing):
				415	# a[i] matches b[j]
				416	if j < blo:
				417	continue
				418	if j >= bhi:
				419	break
				420	k = newj2len[j] = j2lenget(j-1, 0) + 1
				421	if k > bestsize:
				422	besti, bestj, bestsize = i-k+1, j-k+1, k
				423	j2len = newj2len
				424
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	425	# Extend the best by non-junk elements on each end. In particular,
				426	# "popular" non-junk elements aren't in b2j, which greatly speeds
				427	# the inner loop above, but also means "the best" match so far
				428	# doesn't contain any junk or popular non-junk elements.
				429	while besti > alo and bestj > blo and \
				430	not isbjunk(b[bestj-1]) and \
				431	a[besti-1] == b[bestj-1]:
				432	besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
				433	while besti+bestsize < ahi and bestj+bestsize < bhi and \
				434	not isbjunk(b[bestj+bestsize]) and \
				435	a[besti+bestsize] == b[bestj+bestsize]:
				436	bestsize += 1
				437
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	438	# Now that we have a wholly interesting match (albeit possibly
				439	# empty!), we may as well suck up the matching junk on each
				440	# side of it too. Can't think of a good reason not to, and it
				441	# saves post-processing the (possibly considerable) expense of
				442	# figuring out what to do with it. In the case of an empty
				443	# interesting match, this is clearly the right thing to do,
				444	# because no other kind of match is possible in the regions.
				445	while besti > alo and bestj > blo and \
				446	isbjunk(b[bestj-1]) and \
				447	a[besti-1] == b[bestj-1]:
				448	besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
				449	while besti+bestsize < ahi and bestj+bestsize < bhi and \
				450	isbjunk(b[bestj+bestsize]) and \
				451	a[besti+bestsize] == b[bestj+bestsize]:
				452	bestsize = bestsize + 1
				453
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	454	return besti, bestj, bestsize
				455
				456	def get_matching_blocks(self):
				457	"""Return list of triples describing matching subsequences.
				458
				459	Each triple is of the form (i, j, n), and means that
				460	a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in
				461	i and in j.
				462
				463	The last triple is a dummy, (len(a), len(b), 0), and is the only
				464	triple with n==0.
				465
				466	>>> s = SequenceMatcher(None, "abxcd", "abcd")
				467	>>> s.get_matching_blocks()
				468	[(0, 0, 2), (3, 2, 2), (5, 4, 0)]
				469	"""
				470
				471	if self.matching_blocks is not None:
				472	return self.matching_blocks
				473	self.matching_blocks = []
				474	la, lb = len(self.a), len(self.b)
				475	self.__helper(0, la, 0, lb, self.matching_blocks)
				476	self.matching_blocks.append( (la, lb, 0) )
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	477	return self.matching_blocks
				478
				479	# builds list of matching blocks covering a[alo:ahi] and
				480	# b[blo:bhi], appending them in increasing order to answer
				481
				482	def __helper(self, alo, ahi, blo, bhi, answer):
				483	i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
				484	# a[alo:i] vs b[blo:j] unknown
				485	# a[i:i+k] same as b[j:j+k]
				486	# a[i+k:ahi] vs b[j+k:bhi] unknown
				487	if k:
				488	if alo < i and blo < j:
				489	self.__helper(alo, i, blo, j, answer)
				490	answer.append(x)
				491	if i+k < ahi and j+k < bhi:
				492	self.__helper(i+k, ahi, j+k, bhi, answer)
				493
				494	def get_opcodes(self):
				495	"""Return list of 5-tuples describing how to turn a into b.
				496
				497	Each tuple is of the form (tag, i1, i2, j1, j2). The first tuple
				498	has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
				499	tuple preceding it, and likewise for j1 == the previous j2.
				500
				501	The tags are strings, with these meanings:
				502
				503	'replace': a[i1:i2] should be replaced by b[j1:j2]
				504	'delete': a[i1:i2] should be deleted.
				505	Note that j1==j2 in this case.
				506	'insert': b[j1:j2] should be inserted at a[i1:i1].
				507	Note that i1==i2 in this case.
				508	'equal': a[i1:i2] == b[j1:j2]
				509
				510	>>> a = "qabxcd"
				511	>>> b = "abycdf"
				512	>>> s = SequenceMatcher(None, a, b)
				513	>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
				514	... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
				515	... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
				516	delete a[0:1] (q) b[0:0] ()
				517	equal a[1:3] (ab) b[0:2] (ab)
				518	replace a[3:4] (x) b[2:3] (y)
				519	equal a[4:6] (cd) b[3:5] (cd)
				520	insert a[6:6] () b[5:6] (f)
				521	"""
				522
				523	if self.opcodes is not None:
				524	return self.opcodes
				525	i = j = 0
				526	self.opcodes = answer = []
				527	for ai, bj, size in self.get_matching_blocks():
				528	# invariant: we've pumped out correct diffs to change
				529	# a[:i] into b[:j], and the next matching block is
				530	# a[ai:ai+size] == b[bj:bj+size]. So we need to pump
				531	# out a diff to change a[i:ai] into b[j:bj], pump out
				532	# the matching block, and move (i,j) beyond the match
				533	tag = ''
				534	if i < ai and j < bj:
				535	tag = 'replace'
				536	elif i < ai:
				537	tag = 'delete'
				538	elif j < bj:
				539	tag = 'insert'
				540	if tag:
				541	answer.append( (tag, i, ai, j, bj) )
				542	i, j = ai+size, bj+size
				543	# the list of matching blocks is terminated by a
				544	# sentinel with size 0
				545	if size:
				546	answer.append( ('equal', ai, i, bj, j) )
				547	return answer
				548
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	549	def get_grouped_opcodes(self, n=3):
				550	""" Isolate change clusters by eliminating ranges with no changes.
				551
				552	Return a generator of groups with upto n lines of context.
				553	Each group is in the same format as returned by get_opcodes().
				554
				555	>>> from pprint import pprint
				556	>>> a = map(str, range(1,40))
				557	>>> b = a[:]
				558	>>> b[8:8] = ['i'] # Make an insertion
				559	>>> b[20] += 'x' # Make a replacement
				560	>>> b[23:28] = [] # Make a deletion
				561	>>> b[30] += 'y' # Make another replacement
				562	>>> pprint(list(SequenceMatcher(None,a,b).get_grouped_opcodes()))
				563	[[('equal', 5, 8, 5, 8), ('insert', 8, 8, 8, 9), ('equal', 8, 11, 9, 12)],
				564	[('equal', 16, 19, 17, 20),
				565	('replace', 19, 20, 20, 21),
				566	('equal', 20, 22, 21, 23),
				567	('delete', 22, 27, 23, 23),
				568	('equal', 27, 30, 23, 26)],
				569	[('equal', 31, 34, 27, 30),
				570	('replace', 34, 35, 30, 31),
				571	('equal', 35, 38, 31, 34)]]
				572	"""
				573
				574	codes = self.get_opcodes()
				575	# Fixup leading and trailing groups if they show no changes.
				576	if codes[0][0] == 'equal':
				577	tag, i1, i2, j1, j2 = codes[0]
				578	codes[0] = tag, max(i1, i2-n), i2, max(j1, j2-n), j2
				579	if codes[-1][0] == 'equal':
				580	tag, i1, i2, j1, j2 = codes[-1]
				581	codes[-1] = tag, i1, min(i2, i1+n), j1, min(j2, j1+n)
				582
				583	nn = n + n
				584	group = []
				585	for tag, i1, i2, j1, j2 in codes:
				586	# End the current group and start a new one whenever
				587	# there is a large range with no changes.
				588	if tag == 'equal' and i2-i1 > nn:
				589	group.append((tag, i1, min(i2, i1+n), j1, min(j2, j1+n)))
				590	yield group
				591	group = []
				592	i1, j1 = max(i1, i2-n), max(j1, j2-n)
				593	group.append((tag, i1, i2, j1 ,j2))
				594	if group and not (len(group)==1 and group[0][0] == 'equal'):
				595	yield group
				596
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	597	def ratio(self):
				598	"""Return a measure of the sequences' similarity (float in [0,1]).
				599
				600	Where T is the total number of elements in both sequences, and
				601	M is the number of matches, this is 2,0*M / T.
				602	Note that this is 1 if the sequences are identical, and 0 if
				603	they have nothing in common.
				604
				605	.ratio() is expensive to compute if you haven't already computed
				606	.get_matching_blocks() or .get_opcodes(), in which case you may
				607	want to try .quick_ratio() or .real_quick_ratio() first to get an
				608	upper bound.
				609
				610	>>> s = SequenceMatcher(None, "abcd", "bcde")
				611	>>> s.ratio()
				612	0.75
				613	>>> s.quick_ratio()
				614	0.75
				615	>>> s.real_quick_ratio()
				616	1.0
				617	"""
				618
				619	matches = reduce(lambda sum, triple: sum + triple[-1],
				620	self.get_matching_blocks(), 0)
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	621	return _calculate_ratio(matches, len(self.a) + len(self.b))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	622
				623	def quick_ratio(self):
				624	"""Return an upper bound on ratio() relatively quickly.
				625
				626	This isn't defined beyond that it is an upper bound on .ratio(), and
				627	is faster to compute.
				628	"""
				629
				630	# viewing a and b as multisets, set matches to the cardinality
				631	# of their intersection; this counts the number of matches
				632	# without regard to order, so is clearly an upper bound
				633	if self.fullbcount is None:
				634	self.fullbcount = fullbcount = {}
				635	for elt in self.b:
				636	fullbcount[elt] = fullbcount.get(elt, 0) + 1
				637	fullbcount = self.fullbcount
				638	# avail[x] is the number of times x appears in 'b' less the
				639	# number of times we've seen it in 'a' so far ... kinda
				640	avail = {}
				641	availhas, matches = avail.has_key, 0
				642	for elt in self.a:
				643	if availhas(elt):
				644	numb = avail[elt]
				645	else:
				646	numb = fullbcount.get(elt, 0)
				647	avail[elt] = numb - 1
				648	if numb > 0:
				649	matches = matches + 1
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	650	return _calculate_ratio(matches, len(self.a) + len(self.b))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	651
				652	def real_quick_ratio(self):
				653	"""Return an upper bound on ratio() very quickly.
				654
				655	This isn't defined beyond that it is an upper bound on .ratio(), and
				656	is faster to compute than either .ratio() or .quick_ratio().
				657	"""
				658
				659	la, lb = len(self.a), len(self.b)
				660	# can't have more matches than the number of elements in the
				661	# shorter sequence
Neal Norwitz	e7dfe21	2003-07-01 14:59:46 +0000	[diff] [blame]	662	return _calculate_ratio(min(la, lb), la + lb)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	663
				664	def get_close_matches(word, possibilities, n=3, cutoff=0.6):
				665	"""Use SequenceMatcher to return list of the best "good enough" matches.
				666
				667	word is a sequence for which close matches are desired (typically a
				668	string).
				669
				670	possibilities is a list of sequences against which to match word
				671	(typically a list of strings).
				672
				673	Optional arg n (default 3) is the maximum number of close matches to
				674	return. n must be > 0.
				675
				676	Optional arg cutoff (default 0.6) is a float in [0, 1]. Possibilities
				677	that don't score at least that similar to word are ignored.
				678
				679	The best (no more than n) matches among the possibilities are returned
				680	in a list, sorted by similarity score, most similar first.
				681
				682	>>> get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
				683	['apple', 'ape']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	684	>>> import keyword as _keyword
				685	>>> get_close_matches("wheel", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	686	['while']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	687	>>> get_close_matches("apple", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	688	[]
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	689	>>> get_close_matches("accept", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	690	['except']
				691	"""
				692
				693	if not n > 0:
Walter Dörwald	70a6b49	2004-02-12 17:35:32 +0000	[diff] [blame]	694	raise ValueError("n must be > 0: %r" % (n,))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	695	if not 0.0 <= cutoff <= 1.0:
Walter Dörwald	70a6b49	2004-02-12 17:35:32 +0000	[diff] [blame]	696	raise ValueError("cutoff must be in [0.0, 1.0]: %r" % (cutoff,))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	697	result = []
				698	s = SequenceMatcher()
				699	s.set_seq2(word)
				700	for x in possibilities:
				701	s.set_seq1(x)
				702	if s.real_quick_ratio() >= cutoff and \
				703	s.quick_ratio() >= cutoff and \
				704	s.ratio() >= cutoff:
				705	result.append((s.ratio(), x))
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	706
Raymond Hettinger	6b59f5f	2003-10-16 05:53:16 +0000	[diff] [blame]	707	# Move the best scorers to head of list
Raymond Hettinger	aefde43	2004-06-15 23:53:35 +0000	[diff] [blame^]	708	result = heapq.nlargest(n, result)
Raymond Hettinger	6b59f5f	2003-10-16 05:53:16 +0000	[diff] [blame]	709	# Strip scores for the best n matches
Raymond Hettinger	bb6b734	2004-06-13 09:57:33 +0000	[diff] [blame]	710	return [x for score, x in result]
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	711
				712	def _count_leading(line, ch):
				713	"""
				714	Return number of `ch` characters at the start of `line`.
				715
				716	Example:
				717
				718	>>> _count_leading(' abc', ' ')
				719	3
				720	"""
				721
				722	i, n = 0, len(line)
				723	while i < n and line[i] == ch:
				724	i += 1
				725	return i
				726
				727	class Differ:
				728	r"""
				729	Differ is a class for comparing sequences of lines of text, and
				730	producing human-readable differences or deltas. Differ uses
				731	SequenceMatcher both to compare sequences of lines, and to compare
				732	sequences of characters within similar (near-matching) lines.
				733
				734	Each line of a Differ delta begins with a two-letter code:
				735
				736	'- ' line unique to sequence 1
				737	'+ ' line unique to sequence 2
				738	' ' line common to both sequences
				739	'? ' line not present in either input sequence
				740
				741	Lines beginning with '? ' attempt to guide the eye to intraline
				742	differences, and were not present in either input sequence. These lines
				743	can be confusing if the sequences contain tab characters.
				744
				745	Note that Differ makes no claim to produce a minimal diff. To the
				746	contrary, minimal diffs are often counter-intuitive, because they synch
				747	up anywhere possible, sometimes accidental matches 100 pages apart.
				748	Restricting synch points to contiguous matches preserves some notion of
				749	locality, at the occasional cost of producing a longer diff.
				750
				751	Example: Comparing two texts.
				752
				753	First we set up the texts, sequences of individual single-line strings
				754	ending with newlines (such sequences can also be obtained from the
				755	`readlines()` method of file-like objects):
				756
				757	>>> text1 = ''' 1. Beautiful is better than ugly.
				758	... 2. Explicit is better than implicit.
				759	... 3. Simple is better than complex.
				760	... 4. Complex is better than complicated.
				761	... '''.splitlines(1)
				762	>>> len(text1)
				763	4
				764	>>> text1[0][-1]
				765	'\n'
				766	>>> text2 = ''' 1. Beautiful is better than ugly.
				767	... 3. Simple is better than complex.
				768	... 4. Complicated is better than complex.
				769	... 5. Flat is better than nested.
				770	... '''.splitlines(1)
				771
				772	Next we instantiate a Differ object:
				773
				774	>>> d = Differ()
				775
				776	Note that when instantiating a Differ object we may pass functions to
				777	filter out line and character 'junk'. See Differ.__init__ for details.
				778
				779	Finally, we compare the two:
				780
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	781	>>> result = list(d.compare(text1, text2))
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	782
				783	'result' is a list of strings, so let's pretty-print it:
				784
				785	>>> from pprint import pprint as _pprint
				786	>>> _pprint(result)
				787	[' 1. Beautiful is better than ugly.\n',
				788	'- 2. Explicit is better than implicit.\n',
				789	'- 3. Simple is better than complex.\n',
				790	'+ 3. Simple is better than complex.\n',
				791	'? ++\n',
				792	'- 4. Complex is better than complicated.\n',
				793	'? ^ ---- ^\n',
				794	'+ 4. Complicated is better than complex.\n',
				795	'? ++++ ^ ^\n',
				796	'+ 5. Flat is better than nested.\n']
				797
				798	As a single multi-line string it looks like this:
				799
				800	>>> print ''.join(result),
				801	1. Beautiful is better than ugly.
				802	- 2. Explicit is better than implicit.
				803	- 3. Simple is better than complex.
				804	+ 3. Simple is better than complex.
				805	? ++
				806	- 4. Complex is better than complicated.
				807	? ^ ---- ^
				808	+ 4. Complicated is better than complex.
				809	? ++++ ^ ^
				810	+ 5. Flat is better than nested.
				811
				812	Methods:
				813
				814	__init__(linejunk=None, charjunk=None)
				815	Construct a text differencer, with optional filters.
				816
				817	compare(a, b)
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	818	Compare two sequences of lines; generate the resulting delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	819	"""
				820
				821	def __init__(self, linejunk=None, charjunk=None):
				822	"""
				823	Construct a text differencer, with optional filters.
				824
				825	The two optional keyword parameters are for filter functions:
				826
				827	- `linejunk`: A function that should accept a single string argument,
				828	and return true iff the string is junk. The module-level function
				829	`IS_LINE_JUNK` may be used to filter out lines without visible
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	830	characters, except for at most one splat ('#'). It is recommended
				831	to leave linejunk None; as of Python 2.3, the underlying
				832	SequenceMatcher class has grown an adaptive notion of "noise" lines
				833	that's better than any static definition the author has ever been
				834	able to craft.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	835
				836	- `charjunk`: A function that should accept a string of length 1. The
				837	module-level function `IS_CHARACTER_JUNK` may be used to filter out
				838	whitespace characters (a blank or tab; note: bad idea to include
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	839	newline in this!). Use of IS_CHARACTER_JUNK is recommended.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	840	"""
				841
				842	self.linejunk = linejunk
				843	self.charjunk = charjunk
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	844
				845	def compare(self, a, b):
				846	r"""
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	847	Compare two sequences of lines; generate the resulting delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	848
				849	Each sequence must contain individual single-line strings ending with
				850	newlines. Such sequences can be obtained from the `readlines()` method
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	851	of file-like objects. The delta generated also consists of newline-
				852	terminated strings, ready to be printed as-is via the writeline()
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	853	method of a file-like object.
				854
				855	Example:
				856
				857	>>> print ''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
				858	... 'ore\ntree\nemu\n'.splitlines(1))),
				859	- one
				860	? ^
				861	+ ore
				862	? ^
				863	- two
				864	- three
				865	? -
				866	+ tree
				867	+ emu
				868	"""
				869
				870	cruncher = SequenceMatcher(self.linejunk, a, b)
				871	for tag, alo, ahi, blo, bhi in cruncher.get_opcodes():
				872	if tag == 'replace':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	873	g = self._fancy_replace(a, alo, ahi, b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	874	elif tag == 'delete':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	875	g = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	876	elif tag == 'insert':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	877	g = self._dump('+', b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	878	elif tag == 'equal':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	879	g = self._dump(' ', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	880	else:
Walter Dörwald	70a6b49	2004-02-12 17:35:32 +0000	[diff] [blame]	881	raise ValueError, 'unknown tag %r' % (tag,)
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	882
				883	for line in g:
				884	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	885
				886	def _dump(self, tag, x, lo, hi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	887	"""Generate comparison results for a same-tagged range."""
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	888	for i in xrange(lo, hi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	889	yield '%s %s' % (tag, x[i])
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	890
				891	def _plain_replace(self, a, alo, ahi, b, blo, bhi):
				892	assert alo < ahi and blo < bhi
				893	# dump the shorter block first -- reduces the burden on short-term
				894	# memory if the blocks are of very different sizes
				895	if bhi - blo < ahi - alo:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	896	first = self._dump('+', b, blo, bhi)
				897	second = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	898	else:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	899	first = self._dump('-', a, alo, ahi)
				900	second = self._dump('+', b, blo, bhi)
				901
				902	for g in first, second:
				903	for line in g:
				904	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	905
				906	def _fancy_replace(self, a, alo, ahi, b, blo, bhi):
				907	r"""
				908	When replacing one block of lines with another, search the blocks
				909	for similar lines; the best-matching pair (if any) is used as a
				910	synch point, and intraline difference marking is done on the
				911	similar pair. Lots of work, but often worth it.
				912
				913	Example:
				914
				915	>>> d = Differ()
Raymond Hettinger	83325e9	2003-07-16 04:32:32 +0000	[diff] [blame]	916	>>> results = d._fancy_replace(['abcDefghiJkl\n'], 0, 1,
				917	... ['abcdefGhijkl\n'], 0, 1)
				918	>>> print ''.join(results),
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	919	- abcDefghiJkl
				920	? ^ ^ ^
				921	+ abcdefGhijkl
				922	? ^ ^ ^
				923	"""
				924
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	925	# don't synch up unless the lines have a similarity score of at
				926	# least cutoff; best_ratio tracks the best score seen so far
				927	best_ratio, cutoff = 0.74, 0.75
				928	cruncher = SequenceMatcher(self.charjunk)
				929	eqi, eqj = None, None # 1st indices of equal lines (if any)
				930
				931	# search for the pair that matches best without being identical
				932	# (identical lines must be junk lines, & we don't want to synch up
				933	# on junk -- unless we have to)
				934	for j in xrange(blo, bhi):
				935	bj = b[j]
				936	cruncher.set_seq2(bj)
				937	for i in xrange(alo, ahi):
				938	ai = a[i]
				939	if ai == bj:
				940	if eqi is None:
				941	eqi, eqj = i, j
				942	continue
				943	cruncher.set_seq1(ai)
				944	# computing similarity is expensive, so use the quick
				945	# upper bounds first -- have seen this speed up messy
				946	# compares by a factor of 3.
				947	# note that ratio() is only expensive to compute the first
				948	# time it's called on a sequence pair; the expensive part
				949	# of the computation is cached by cruncher
				950	if cruncher.real_quick_ratio() > best_ratio and \
				951	cruncher.quick_ratio() > best_ratio and \
				952	cruncher.ratio() > best_ratio:
				953	best_ratio, best_i, best_j = cruncher.ratio(), i, j
				954	if best_ratio < cutoff:
				955	# no non-identical "pretty close" pair
				956	if eqi is None:
				957	# no identical pair either -- treat it as a straight replace
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	958	for line in self._plain_replace(a, alo, ahi, b, blo, bhi):
				959	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	960	return
				961	# no close pair, but an identical pair -- synch up on that
				962	best_i, best_j, best_ratio = eqi, eqj, 1.0
				963	else:
				964	# there's a close pair, so forget the identical pair (if any)
				965	eqi = None
				966
				967	# a[best_i] very similar to b[best_j]; eqi is None iff they're not
				968	# identical
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	969
				970	# pump out diffs from before the synch point
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	971	for line in self._fancy_helper(a, alo, best_i, b, blo, best_j):
				972	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	973
				974	# do intraline marking on the synch pair
				975	aelt, belt = a[best_i], b[best_j]
				976	if eqi is None:
				977	# pump out a '-', '?', '+', '?' quad for the synched lines
				978	atags = btags = ""
				979	cruncher.set_seqs(aelt, belt)
				980	for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes():
				981	la, lb = ai2 - ai1, bj2 - bj1
				982	if tag == 'replace':
				983	atags += '^' * la
				984	btags += '^' * lb
				985	elif tag == 'delete':
				986	atags += '-' * la
				987	elif tag == 'insert':
				988	btags += '+' * lb
				989	elif tag == 'equal':
				990	atags += ' ' * la
				991	btags += ' ' * lb
				992	else:
Walter Dörwald	70a6b49	2004-02-12 17:35:32 +0000	[diff] [blame]	993	raise ValueError, 'unknown tag %r' % (tag,)
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	994	for line in self._qformat(aelt, belt, atags, btags):
				995	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	996	else:
				997	# the synch pair is identical
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	998	yield ' ' + aelt
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	999
				1000	# pump out diffs from after the synch point
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1001	for line in self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi):
				1002	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1003
				1004	def _fancy_helper(self, a, alo, ahi, b, blo, bhi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1005	g = []
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1006	if alo < ahi:
				1007	if blo < bhi:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1008	g = self._fancy_replace(a, alo, ahi, b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1009	else:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1010	g = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1011	elif blo < bhi:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1012	g = self._dump('+', b, blo, bhi)
				1013
				1014	for line in g:
				1015	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1016
				1017	def _qformat(self, aline, bline, atags, btags):
				1018	r"""
				1019	Format "?" output and deal with leading tabs.
				1020
				1021	Example:
				1022
				1023	>>> d = Differ()
Raymond Hettinger	83325e9	2003-07-16 04:32:32 +0000	[diff] [blame]	1024	>>> results = d._qformat('\tabcDefghiJkl\n', '\t\tabcdefGhijkl\n',
				1025	... ' ^ ^ ^ ', '+ ^ ^ ^ ')
				1026	>>> for line in results: print repr(line)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1027	...
				1028	'- \tabcDefghiJkl\n'
				1029	'? \t ^ ^ ^\n'
				1030	'+ \t\tabcdefGhijkl\n'
				1031	'? \t ^ ^ ^\n'
				1032	"""
				1033
				1034	# Can hurt, but will probably help most of the time.
				1035	common = min(_count_leading(aline, "\t"),
				1036	_count_leading(bline, "\t"))
				1037	common = min(common, _count_leading(atags[:common], " "))
				1038	atags = atags[common:].rstrip()
				1039	btags = btags[common:].rstrip()
				1040
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1041	yield "- " + aline
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1042	if atags:
Tim Peters	527e64f	2001-10-04 05:36:56 +0000	[diff] [blame]	1043	yield "? %s%s\n" % ("\t" * common, atags)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1044
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1045	yield "+ " + bline
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1046	if btags:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1047	yield "? %s%s\n" % ("\t" * common, btags)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1048
				1049	# With respect to junk, an earlier version of ndiff simply refused to
				1050	# start a match with a junk element. The result was cases like this:
				1051	# before: private Thread currentThread;
				1052	# after: private volatile Thread currentThread;
				1053	# If you consider whitespace to be junk, the longest contiguous match
				1054	# not starting with junk is "e Thread currentThread". So ndiff reported
				1055	# that "e volatil" was inserted between the 't' and the 'e' in "private".
				1056	# While an accurate view, to people that's absurd. The current version
				1057	# looks for matching blocks that are entirely junk-free, then extends the
				1058	# longest one of those as far as possible but only with matching junk.
				1059	# So now "currentThread" is matched, then extended to suck up the
				1060	# preceding blank; then "private" is matched, and extended to suck up the
				1061	# following blank; then "Thread" is matched; and finally ndiff reports
				1062	# that "volatile " was inserted before "Thread". The only quibble
				1063	# remaining is that perhaps it was really the case that " volatile"
				1064	# was inserted after "private". I can live with that <wink>.
				1065
				1066	import re
				1067
				1068	def IS_LINE_JUNK(line, pat=re.compile(r"\s#?\s$").match):
				1069	r"""
				1070	Return 1 for ignorable line: iff `line` is blank or contains a single '#'.
				1071
				1072	Examples:
				1073
				1074	>>> IS_LINE_JUNK('\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1075	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1076	>>> IS_LINE_JUNK(' # \n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1077	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1078	>>> IS_LINE_JUNK('hello\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1079	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1080	"""
				1081
				1082	return pat(line) is not None
				1083
				1084	def IS_CHARACTER_JUNK(ch, ws=" \t"):
				1085	r"""
				1086	Return 1 for ignorable character: iff `ch` is a space or tab.
				1087
				1088	Examples:
				1089
				1090	>>> IS_CHARACTER_JUNK(' ')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1091	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1092	>>> IS_CHARACTER_JUNK('\t')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1093	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1094	>>> IS_CHARACTER_JUNK('\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1095	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1096	>>> IS_CHARACTER_JUNK('x')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1097	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1098	"""
				1099
				1100	return ch in ws
				1101
				1102	del re
				1103
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1104
				1105	def unified_diff(a, b, fromfile='', tofile='', fromfiledate='',
				1106	tofiledate='', n=3, lineterm='\n'):
				1107	r"""
				1108	Compare two sequences of lines; generate the delta as a unified diff.
				1109
				1110	Unified diffs are a compact way of showing line changes and a few
				1111	lines of context. The number of context lines is set by 'n' which
				1112	defaults to three.
				1113
Raymond Hettinger	0887c73	2003-06-17 16:53:25 +0000	[diff] [blame]	1114	By default, the diff control lines (those with ---, +++, or @@) are
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1115	created with a trailing newline. This is helpful so that inputs
				1116	created from file.readlines() result in diffs that are suitable for
				1117	file.writelines() since both the inputs and outputs have trailing
				1118	newlines.
				1119
				1120	For inputs that do not have trailing newlines, set the lineterm
				1121	argument to "" so that the output will be uniformly newline free.
				1122
				1123	The unidiff format normally has a header for filenames and modification
				1124	times. Any or all of these may be specified using strings for
				1125	'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'. The modification
				1126	times are normally expressed in the format returned by time.ctime().
				1127
				1128	Example:
				1129
				1130	>>> for line in unified_diff('one two three four'.split(),
				1131	... 'zero one tree four'.split(), 'Original', 'Current',
				1132	... 'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:20:52 2003',
				1133	... lineterm=''):
				1134	... print line
				1135	--- Original Sat Jan 26 23:30:50 1991
				1136	+++ Current Fri Jun 06 10:20:52 2003
				1137	@@ -1,4 +1,4 @@
				1138	+zero
				1139	one
				1140	-two
				1141	-three
				1142	+tree
				1143	four
				1144	"""
				1145
				1146	started = False
				1147	for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
				1148	if not started:
				1149	yield '--- %s %s%s' % (fromfile, fromfiledate, lineterm)
				1150	yield '+++ %s %s%s' % (tofile, tofiledate, lineterm)
				1151	started = True
				1152	i1, i2, j1, j2 = group[0][1], group[-1][2], group[0][3], group[-1][4]
				1153	yield "@@ -%d,%d +%d,%d @@%s" % (i1+1, i2-i1, j1+1, j2-j1, lineterm)
				1154	for tag, i1, i2, j1, j2 in group:
				1155	if tag == 'equal':
				1156	for line in a[i1:i2]:
				1157	yield ' ' + line
				1158	continue
				1159	if tag == 'replace' or tag == 'delete':
				1160	for line in a[i1:i2]:
				1161	yield '-' + line
				1162	if tag == 'replace' or tag == 'insert':
				1163	for line in b[j1:j2]:
				1164	yield '+' + line
				1165
				1166	# See http://www.unix.org/single_unix_specification/
				1167	def context_diff(a, b, fromfile='', tofile='',
				1168	fromfiledate='', tofiledate='', n=3, lineterm='\n'):
				1169	r"""
				1170	Compare two sequences of lines; generate the delta as a context diff.
				1171
				1172	Context diffs are a compact way of showing line changes and a few
				1173	lines of context. The number of context lines is set by 'n' which
				1174	defaults to three.
				1175
				1176	By default, the diff control lines (those with *** or ---) are
				1177	created with a trailing newline. This is helpful so that inputs
				1178	created from file.readlines() result in diffs that are suitable for
				1179	file.writelines() since both the inputs and outputs have trailing
				1180	newlines.
				1181
				1182	For inputs that do not have trailing newlines, set the lineterm
				1183	argument to "" so that the output will be uniformly newline free.
				1184
				1185	The context diff format normally has a header for filenames and
				1186	modification times. Any or all of these may be specified using
				1187	strings for 'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'.
				1188	The modification times are normally expressed in the format returned
				1189	by time.ctime(). If not specified, the strings default to blanks.
				1190
				1191	Example:
				1192
				1193	>>> print ''.join(context_diff('one\ntwo\nthree\nfour\n'.splitlines(1),
				1194	... 'zero\none\ntree\nfour\n'.splitlines(1), 'Original', 'Current',
				1195	... 'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:22:46 2003')),
				1196	*** Original Sat Jan 26 23:30:50 1991
				1197	--- Current Fri Jun 06 10:22:46 2003
				1198	***************
				1199	* 1,4 **
				1200	one
				1201	! two
				1202	! three
				1203	four
				1204	--- 1,4 ----
				1205	+ zero
				1206	one
				1207	! tree
				1208	four
				1209	"""
				1210
				1211	started = False
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1212	prefixmap = {'insert':'+ ', 'delete':'- ', 'replace':'! ', 'equal':' '}
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1213	for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
				1214	if not started:
				1215	yield '*** %s %s%s' % (fromfile, fromfiledate, lineterm)
				1216	yield '--- %s %s%s' % (tofile, tofiledate, lineterm)
				1217	started = True
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1218
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1219	yield '***************%s' % (lineterm,)
				1220	if group[-1][2] - group[0][1] >= 2:
				1221	yield '* %d,%d **%s' % (group[0][1]+1, group[-1][2], lineterm)
				1222	else:
				1223	yield '* %d **%s' % (group[-1][2], lineterm)
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1224	visiblechanges = [e for e in group if e[0] in ('replace', 'delete')]
				1225	if visiblechanges:
				1226	for tag, i1, i2, _, _ in group:
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1227	if tag != 'insert':
				1228	for line in a[i1:i2]:
				1229	yield prefixmap[tag] + line
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1230
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1231	if group[-1][4] - group[0][3] >= 2:
				1232	yield '--- %d,%d ----%s' % (group[0][3]+1, group[-1][4], lineterm)
				1233	else:
				1234	yield '--- %d ----%s' % (group[-1][4], lineterm)
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1235	visiblechanges = [e for e in group if e[0] in ('replace', 'insert')]
				1236	if visiblechanges:
				1237	for tag, _, _, j1, j2 in group:
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1238	if tag != 'delete':
				1239	for line in b[j1:j2]:
				1240	yield prefixmap[tag] + line
				1241
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	1242	def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK):
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1243	r"""
				1244	Compare `a` and `b` (lists of strings); return a `Differ`-style delta.
				1245
				1246	Optional keyword parameters `linejunk` and `charjunk` are for filter
				1247	functions (or None):
				1248
				1249	- linejunk: A function that should accept a single string argument, and
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	1250	return true iff the string is junk. The default is None, and is
				1251	recommended; as of Python 2.3, an adaptive notion of "noise" lines is
				1252	used that does a good job on its own.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1253
				1254	- charjunk: A function that should accept a string of length 1. The
				1255	default is module-level function IS_CHARACTER_JUNK, which filters out
				1256	whitespace characters (a blank or tab; note: bad idea to include newline
				1257	in this!).
				1258
				1259	Tools/scripts/ndiff.py is a command-line front-end to this function.
				1260
				1261	Example:
				1262
				1263	>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
				1264	... 'ore\ntree\nemu\n'.splitlines(1))
				1265	>>> print ''.join(diff),
				1266	- one
				1267	? ^
				1268	+ ore
				1269	? ^
				1270	- two
				1271	- three
				1272	? -
				1273	+ tree
				1274	+ emu
				1275	"""
				1276	return Differ(linejunk, charjunk).compare(a, b)
				1277
				1278	def restore(delta, which):
				1279	r"""
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1280	Generate one of the two sequences that generated a delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1281
				1282	Given a `delta` produced by `Differ.compare()` or `ndiff()`, extract
				1283	lines originating from file 1 or 2 (parameter `which`), stripping off line
				1284	prefixes.
				1285
				1286	Examples:
				1287
				1288	>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
				1289	... 'ore\ntree\nemu\n'.splitlines(1))
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1290	>>> diff = list(diff)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1291	>>> print ''.join(restore(diff, 1)),
				1292	one
				1293	two
				1294	three
				1295	>>> print ''.join(restore(diff, 2)),
				1296	ore
				1297	tree
				1298	emu
				1299	"""
				1300	try:
				1301	tag = {1: "- ", 2: "+ "}[int(which)]
				1302	except KeyError:
				1303	raise ValueError, ('unknown delta choice (must be 1 or 2): %r'
				1304	% which)
				1305	prefixes = (" ", tag)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1306	for line in delta:
				1307	if line[:2] in prefixes:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1308	yield line[2:]
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1309
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	1310	def _test():
				1311	import doctest, difflib
				1312	return doctest.testmod(difflib)
				1313
				1314	if __name__ == "__main__":
				1315	_test()