Blame - Lib/difflib.py - platform/external/python/cpython3

blob: 198114d987aac453f25ace1006cf71dc31396b96 [file] [log] [blame]

Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	1	#! /usr/bin/env python
				2
				3	"""
				4	Module difflib -- helpers for computing deltas between objects.
				5
				6	Function get_close_matches(word, possibilities, n=3, cutoff=0.6):
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	7	Use SequenceMatcher to return list of the best "good enough" matches.
				8
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	9	Function context_diff(a, b):
				10	For two lists of strings, return a delta in context diff format.
				11
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	12	Function ndiff(a, b):
				13	Return a delta: the difference between `a` and `b` (lists of strings).
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	14
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	15	Function restore(delta, which):
				16	Return one of the two sequences that generated an ndiff delta.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	17
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	18	Function unified_diff(a, b):
				19	For two lists of strings, return a delta in unified diff format.
				20
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	21	Class SequenceMatcher:
				22	A flexible class for comparing pairs of sequences of any type.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	23
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	24	Class Differ:
				25	For producing human-readable deltas from sequences of lines of text.
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	26	"""
				27
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	28	__all__ = ['get_close_matches', 'ndiff', 'restore', 'SequenceMatcher',
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	29	'Differ','IS_CHARACTER_JUNK', 'IS_LINE_JUNK', 'context_diff',
				30	'unified_diff']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	31
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	32	class SequenceMatcher:
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	33
				34	"""
				35	SequenceMatcher is a flexible class for comparing pairs of sequences of
				36	any type, so long as the sequence elements are hashable. The basic
				37	algorithm predates, and is a little fancier than, an algorithm
				38	published in the late 1980's by Ratcliff and Obershelp under the
				39	hyperbolic name "gestalt pattern matching". The basic idea is to find
				40	the longest contiguous matching subsequence that contains no "junk"
				41	elements (R-O doesn't address junk). The same idea is then applied
				42	recursively to the pieces of the sequences to the left and to the right
				43	of the matching subsequence. This does not yield minimal edit
				44	sequences, but does tend to yield matches that "look right" to people.
				45
				46	SequenceMatcher tries to compute a "human-friendly diff" between two
				47	sequences. Unlike e.g. UNIX(tm) diff, the fundamental notion is the
				48	longest contiguous & junk-free matching subsequence. That's what
				49	catches peoples' eyes. The Windows(tm) windiff has another interesting
				50	notion, pairing up elements that appear uniquely in each sequence.
				51	That, and the method here, appear to yield more intuitive difference
				52	reports than does diff. This method appears to be the least vulnerable
				53	to synching up on blocks of "junk lines", though (like blank lines in
				54	ordinary text files, or maybe "<P>" lines in HTML files). That may be
				55	because this is the only method of the 3 that has a concept of
				56	"junk" <wink>.
				57
				58	Example, comparing two strings, and considering blanks to be "junk":
				59
				60	>>> s = SequenceMatcher(lambda x: x == " ",
				61	... "private Thread currentThread;",
				62	... "private volatile Thread currentThread;")
				63	>>>
				64
				65	.ratio() returns a float in [0, 1], measuring the "similarity" of the
				66	sequences. As a rule of thumb, a .ratio() value over 0.6 means the
				67	sequences are close matches:
				68
				69	>>> print round(s.ratio(), 3)
				70	0.866
				71	>>>
				72
				73	If you're only interested in where the sequences match,
				74	.get_matching_blocks() is handy:
				75
				76	>>> for block in s.get_matching_blocks():
				77	... print "a[%d] and b[%d] match for %d elements" % block
				78	a[0] and b[0] match for 8 elements
				79	a[8] and b[17] match for 6 elements
				80	a[14] and b[23] match for 15 elements
				81	a[29] and b[38] match for 0 elements
				82
				83	Note that the last tuple returned by .get_matching_blocks() is always a
				84	dummy, (len(a), len(b), 0), and this is the only case in which the last
				85	tuple element (number of elements matched) is 0.
				86
				87	If you want to know how to change the first sequence into the second,
				88	use .get_opcodes():
				89
				90	>>> for opcode in s.get_opcodes():
				91	... print "%6s a[%d:%d] b[%d:%d]" % opcode
				92	equal a[0:8] b[0:8]
				93	insert a[8:8] b[8:17]
				94	equal a[8:14] b[17:23]
				95	equal a[14:29] b[23:38]
				96
				97	See the Differ class for a fancy human-friendly file differencer, which
				98	uses SequenceMatcher both to compare sequences of lines, and to compare
				99	sequences of characters within similar (near-matching) lines.
				100
				101	See also function get_close_matches() in this module, which shows how
				102	simple code building on SequenceMatcher can be used to do useful work.
				103
				104	Timing: Basic R-O is cubic time worst case and quadratic time expected
				105	case. SequenceMatcher is quadratic time for the worst case and has
				106	expected-case behavior dependent in a complicated way on how many
				107	elements the sequences have in common; best case time is linear.
				108
				109	Methods:
				110
				111	__init__(isjunk=None, a='', b='')
				112	Construct a SequenceMatcher.
				113
				114	set_seqs(a, b)
				115	Set the two sequences to be compared.
				116
				117	set_seq1(a)
				118	Set the first sequence to be compared.
				119
				120	set_seq2(b)
				121	Set the second sequence to be compared.
				122
				123	find_longest_match(alo, ahi, blo, bhi)
				124	Find longest matching block in a[alo:ahi] and b[blo:bhi].
				125
				126	get_matching_blocks()
				127	Return list of triples describing matching subsequences.
				128
				129	get_opcodes()
				130	Return list of 5-tuples describing how to turn a into b.
				131
				132	ratio()
				133	Return a measure of the sequences' similarity (float in [0,1]).
				134
				135	quick_ratio()
				136	Return an upper bound on .ratio() relatively quickly.
				137
				138	real_quick_ratio()
				139	Return an upper bound on ratio() very quickly.
				140	"""
				141
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	142	def __init__(self, isjunk=None, a='', b=''):
				143	"""Construct a SequenceMatcher.
				144
				145	Optional arg isjunk is None (the default), or a one-argument
				146	function that takes a sequence element and returns true iff the
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	147	element is junk. None is equivalent to passing "lambda x: 0", i.e.
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	148	no elements are considered to be junk. For example, pass
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	149	lambda x: x in " \\t"
				150	if you're comparing lines as sequences of characters, and don't
				151	want to synch up on blanks or hard tabs.
				152
				153	Optional arg a is the first of two sequences to be compared. By
				154	default, an empty string. The elements of a must be hashable. See
				155	also .set_seqs() and .set_seq1().
				156
				157	Optional arg b is the second of two sequences to be compared. By
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	158	default, an empty string. The elements of b must be hashable. See
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	159	also .set_seqs() and .set_seq2().
				160	"""
				161
				162	# Members:
				163	# a
				164	# first sequence
				165	# b
				166	# second sequence; differences are computed as "what do
				167	# we need to do to 'a' to change it into 'b'?"
				168	# b2j
				169	# for x in b, b2j[x] is a list of the indices (into b)
				170	# at which x appears; junk elements do not appear
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	171	# fullbcount
				172	# for x in b, fullbcount[x] == the number of times x
				173	# appears in b; only materialized if really needed (used
				174	# only for computing quick_ratio())
				175	# matching_blocks
				176	# a list of (i, j, k) triples, where a[i:i+k] == b[j:j+k];
				177	# ascending & non-overlapping in i and in j; terminated by
				178	# a dummy (len(a), len(b), 0) sentinel
				179	# opcodes
				180	# a list of (tag, i1, i2, j1, j2) tuples, where tag is
				181	# one of
				182	# 'replace' a[i1:i2] should be replaced by b[j1:j2]
				183	# 'delete' a[i1:i2] should be deleted
				184	# 'insert' b[j1:j2] should be inserted
				185	# 'equal' a[i1:i2] == b[j1:j2]
				186	# isjunk
				187	# a user-supplied function taking a sequence element and
				188	# returning true iff the element is "junk" -- this has
				189	# subtle but helpful effects on the algorithm, which I'll
				190	# get around to writing up someday <0.9 wink>.
				191	# DON'T USE! Only __chain_b uses this. Use isbjunk.
				192	# isbjunk
				193	# for x in b, isbjunk(x) == isjunk(x) but much faster;
				194	# it's really the has_key method of a hidden dict.
				195	# DOES NOT WORK for x in a!
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	196	# isbpopular
				197	# for x in b, isbpopular(x) is true iff b is reasonably long
				198	# (at least 200 elements) and x accounts for more than 1% of
				199	# its elements. DOES NOT WORK for x in a!
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	200
				201	self.isjunk = isjunk
				202	self.a = self.b = None
				203	self.set_seqs(a, b)
				204
				205	def set_seqs(self, a, b):
				206	"""Set the two sequences to be compared.
				207
				208	>>> s = SequenceMatcher()
				209	>>> s.set_seqs("abcd", "bcde")
				210	>>> s.ratio()
				211	0.75
				212	"""
				213
				214	self.set_seq1(a)
				215	self.set_seq2(b)
				216
				217	def set_seq1(self, a):
				218	"""Set the first sequence to be compared.
				219
				220	The second sequence to be compared is not changed.
				221
				222	>>> s = SequenceMatcher(None, "abcd", "bcde")
				223	>>> s.ratio()
				224	0.75
				225	>>> s.set_seq1("bcde")
				226	>>> s.ratio()
				227	1.0
				228	>>>
				229
				230	SequenceMatcher computes and caches detailed information about the
				231	second sequence, so if you want to compare one sequence S against
				232	many sequences, use .set_seq2(S) once and call .set_seq1(x)
				233	repeatedly for each of the other sequences.
				234
				235	See also set_seqs() and set_seq2().
				236	"""
				237
				238	if a is self.a:
				239	return
				240	self.a = a
				241	self.matching_blocks = self.opcodes = None
				242
				243	def set_seq2(self, b):
				244	"""Set the second sequence to be compared.
				245
				246	The first sequence to be compared is not changed.
				247
				248	>>> s = SequenceMatcher(None, "abcd", "bcde")
				249	>>> s.ratio()
				250	0.75
				251	>>> s.set_seq2("abcd")
				252	>>> s.ratio()
				253	1.0
				254	>>>
				255
				256	SequenceMatcher computes and caches detailed information about the
				257	second sequence, so if you want to compare one sequence S against
				258	many sequences, use .set_seq2(S) once and call .set_seq1(x)
				259	repeatedly for each of the other sequences.
				260
				261	See also set_seqs() and set_seq1().
				262	"""
				263
				264	if b is self.b:
				265	return
				266	self.b = b
				267	self.matching_blocks = self.opcodes = None
				268	self.fullbcount = None
				269	self.__chain_b()
				270
				271	# For each element x in b, set b2j[x] to a list of the indices in
				272	# b where x appears; the indices are in increasing order; note that
				273	# the number of times x appears in b is len(b2j[x]) ...
				274	# when self.isjunk is defined, junk elements don't show up in this
				275	# map at all, which stops the central find_longest_match method
				276	# from starting any matching block at a junk element ...
				277	# also creates the fast isbjunk function ...
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	278	# b2j also does not contain entries for "popular" elements, meaning
				279	# elements that account for more than 1% of the total elements, and
				280	# when the sequence is reasonably large (>= 200 elements); this can
				281	# be viewed as an adaptive notion of semi-junk, and yields an enormous
				282	# speedup when, e.g., comparing program files with hundreds of
				283	# instances of "return NULL;" ...
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	284	# note that this is only called when b changes; so for cross-product
				285	# kinds of matches, it's best to call set_seq2 once, then set_seq1
				286	# repeatedly
				287
				288	def __chain_b(self):
				289	# Because isjunk is a user-defined (not C) function, and we test
				290	# for junk a LOT, it's important to minimize the number of calls.
				291	# Before the tricks described here, __chain_b was by far the most
				292	# time-consuming routine in the whole module! If anyone sees
				293	# Jim Roskind, thank him again for profile.py -- I never would
				294	# have guessed that.
				295	# The first trick is to build b2j ignoring the possibility
				296	# of junk. I.e., we don't call isjunk at all yet. Throwing
				297	# out the junk later is much cheaper than building b2j "right"
				298	# from the start.
				299	b = self.b
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	300	n = len(b)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	301	self.b2j = b2j = {}
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	302	populardict = {}
				303	for i, elt in enumerate(b):
				304	if elt in b2j:
				305	indices = b2j[elt]
				306	if n >= 200 and len(indices) * 100 > n:
				307	populardict[elt] = 1
				308	del indices[:]
				309	else:
				310	indices.append(i)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	311	else:
				312	b2j[elt] = [i]
				313
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	314	# Purge leftover indices for popular elements.
				315	for elt in populardict:
				316	del b2j[elt]
				317
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	318	# Now b2j.keys() contains elements uniquely, and especially when
				319	# the sequence is a string, that's usually a good deal smaller
				320	# than len(string). The difference is the number of isjunk calls
				321	# saved.
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	322	isjunk = self.isjunk
				323	junkdict = {}
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	324	if isjunk:
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	325	for d in populardict, b2j:
				326	for elt in d.keys():
				327	if isjunk(elt):
				328	junkdict[elt] = 1
				329	del d[elt]
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	330
Raymond Hettinger	54f0222	2002-06-01 14:18:47 +0000	[diff] [blame]	331	# Now for x in b, isjunk(x) == x in junkdict, but the
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	332	# latter is much faster. Note too that while there may be a
				333	# lot of junk in the sequence, the number of unique junk
				334	# elements is probably small. So the memory burden of keeping
				335	# this dict alive is likely trivial compared to the size of b2j.
				336	self.isbjunk = junkdict.has_key
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	337	self.isbpopular = populardict.has_key
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	338
				339	def find_longest_match(self, alo, ahi, blo, bhi):
				340	"""Find longest matching block in a[alo:ahi] and b[blo:bhi].
				341
				342	If isjunk is not defined:
				343
				344	Return (i,j,k) such that a[i:i+k] is equal to b[j:j+k], where
				345	alo <= i <= i+k <= ahi
				346	blo <= j <= j+k <= bhi
				347	and for all (i',j',k') meeting those conditions,
				348	k >= k'
				349	i <= i'
				350	and if i == i', j <= j'
				351
				352	In other words, of all maximal matching blocks, return one that
				353	starts earliest in a, and of all those maximal matching blocks that
				354	start earliest in a, return the one that starts earliest in b.
				355
				356	>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
				357	>>> s.find_longest_match(0, 5, 0, 9)
				358	(0, 4, 5)
				359
				360	If isjunk is defined, first the longest matching block is
				361	determined as above, but with the additional restriction that no
				362	junk element appears in the block. Then that block is extended as
				363	far as possible by matching (only) junk elements on both sides. So
				364	the resulting block never matches on junk except as identical junk
				365	happens to be adjacent to an "interesting" match.
				366
				367	Here's the same example as before, but considering blanks to be
				368	junk. That prevents " abcd" from matching the " abcd" at the tail
				369	end of the second sequence directly. Instead only the "abcd" can
				370	match, and matches the leftmost "abcd" in the second sequence:
				371
				372	>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
				373	>>> s.find_longest_match(0, 5, 0, 9)
				374	(1, 0, 4)
				375
				376	If no blocks match, return (alo, blo, 0).
				377
				378	>>> s = SequenceMatcher(None, "ab", "c")
				379	>>> s.find_longest_match(0, 2, 0, 1)
				380	(0, 0, 0)
				381	"""
				382
				383	# CAUTION: stripping common prefix or suffix would be incorrect.
				384	# E.g.,
				385	# ab
				386	# acab
				387	# Longest matching block is "ab", but if common prefix is
				388	# stripped, it's "a" (tied with "b"). UNIX(tm) diff does so
				389	# strip, so ends up claiming that ab is changed to acab by
				390	# inserting "ca" in the middle. That's minimal but unintuitive:
				391	# "it's obvious" that someone inserted "ac" at the front.
				392	# Windiff ends up at the same place as diff, but by pairing up
				393	# the unique 'b's and then matching the first two 'a's.
				394
				395	a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.isbjunk
				396	besti, bestj, bestsize = alo, blo, 0
				397	# find longest junk-free match
				398	# during an iteration of the loop, j2len[j] = length of longest
				399	# junk-free match ending with a[i-1] and b[j]
				400	j2len = {}
				401	nothing = []
				402	for i in xrange(alo, ahi):
				403	# look at all instances of a[i] in b; note that because
				404	# b2j has no junk keys, the loop is skipped if a[i] is junk
				405	j2lenget = j2len.get
				406	newj2len = {}
				407	for j in b2j.get(a[i], nothing):
				408	# a[i] matches b[j]
				409	if j < blo:
				410	continue
				411	if j >= bhi:
				412	break
				413	k = newj2len[j] = j2lenget(j-1, 0) + 1
				414	if k > bestsize:
				415	besti, bestj, bestsize = i-k+1, j-k+1, k
				416	j2len = newj2len
				417
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	418	# Extend the best by non-junk elements on each end. In particular,
				419	# "popular" non-junk elements aren't in b2j, which greatly speeds
				420	# the inner loop above, but also means "the best" match so far
				421	# doesn't contain any junk or popular non-junk elements.
				422	while besti > alo and bestj > blo and \
				423	not isbjunk(b[bestj-1]) and \
				424	a[besti-1] == b[bestj-1]:
				425	besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
				426	while besti+bestsize < ahi and bestj+bestsize < bhi and \
				427	not isbjunk(b[bestj+bestsize]) and \
				428	a[besti+bestsize] == b[bestj+bestsize]:
				429	bestsize += 1
				430
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	431	# Now that we have a wholly interesting match (albeit possibly
				432	# empty!), we may as well suck up the matching junk on each
				433	# side of it too. Can't think of a good reason not to, and it
				434	# saves post-processing the (possibly considerable) expense of
				435	# figuring out what to do with it. In the case of an empty
				436	# interesting match, this is clearly the right thing to do,
				437	# because no other kind of match is possible in the regions.
				438	while besti > alo and bestj > blo and \
				439	isbjunk(b[bestj-1]) and \
				440	a[besti-1] == b[bestj-1]:
				441	besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
				442	while besti+bestsize < ahi and bestj+bestsize < bhi and \
				443	isbjunk(b[bestj+bestsize]) and \
				444	a[besti+bestsize] == b[bestj+bestsize]:
				445	bestsize = bestsize + 1
				446
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	447	return besti, bestj, bestsize
				448
				449	def get_matching_blocks(self):
				450	"""Return list of triples describing matching subsequences.
				451
				452	Each triple is of the form (i, j, n), and means that
				453	a[i:i+n] == b[j:j+n]. The triples are monotonically increasing in
				454	i and in j.
				455
				456	The last triple is a dummy, (len(a), len(b), 0), and is the only
				457	triple with n==0.
				458
				459	>>> s = SequenceMatcher(None, "abxcd", "abcd")
				460	>>> s.get_matching_blocks()
				461	[(0, 0, 2), (3, 2, 2), (5, 4, 0)]
				462	"""
				463
				464	if self.matching_blocks is not None:
				465	return self.matching_blocks
				466	self.matching_blocks = []
				467	la, lb = len(self.a), len(self.b)
				468	self.__helper(0, la, 0, lb, self.matching_blocks)
				469	self.matching_blocks.append( (la, lb, 0) )
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	470	return self.matching_blocks
				471
				472	# builds list of matching blocks covering a[alo:ahi] and
				473	# b[blo:bhi], appending them in increasing order to answer
				474
				475	def __helper(self, alo, ahi, blo, bhi, answer):
				476	i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
				477	# a[alo:i] vs b[blo:j] unknown
				478	# a[i:i+k] same as b[j:j+k]
				479	# a[i+k:ahi] vs b[j+k:bhi] unknown
				480	if k:
				481	if alo < i and blo < j:
				482	self.__helper(alo, i, blo, j, answer)
				483	answer.append(x)
				484	if i+k < ahi and j+k < bhi:
				485	self.__helper(i+k, ahi, j+k, bhi, answer)
				486
				487	def get_opcodes(self):
				488	"""Return list of 5-tuples describing how to turn a into b.
				489
				490	Each tuple is of the form (tag, i1, i2, j1, j2). The first tuple
				491	has i1 == j1 == 0, and remaining tuples have i1 == the i2 from the
				492	tuple preceding it, and likewise for j1 == the previous j2.
				493
				494	The tags are strings, with these meanings:
				495
				496	'replace': a[i1:i2] should be replaced by b[j1:j2]
				497	'delete': a[i1:i2] should be deleted.
				498	Note that j1==j2 in this case.
				499	'insert': b[j1:j2] should be inserted at a[i1:i1].
				500	Note that i1==i2 in this case.
				501	'equal': a[i1:i2] == b[j1:j2]
				502
				503	>>> a = "qabxcd"
				504	>>> b = "abycdf"
				505	>>> s = SequenceMatcher(None, a, b)
				506	>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
				507	... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
				508	... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
				509	delete a[0:1] (q) b[0:0] ()
				510	equal a[1:3] (ab) b[0:2] (ab)
				511	replace a[3:4] (x) b[2:3] (y)
				512	equal a[4:6] (cd) b[3:5] (cd)
				513	insert a[6:6] () b[5:6] (f)
				514	"""
				515
				516	if self.opcodes is not None:
				517	return self.opcodes
				518	i = j = 0
				519	self.opcodes = answer = []
				520	for ai, bj, size in self.get_matching_blocks():
				521	# invariant: we've pumped out correct diffs to change
				522	# a[:i] into b[:j], and the next matching block is
				523	# a[ai:ai+size] == b[bj:bj+size]. So we need to pump
				524	# out a diff to change a[i:ai] into b[j:bj], pump out
				525	# the matching block, and move (i,j) beyond the match
				526	tag = ''
				527	if i < ai and j < bj:
				528	tag = 'replace'
				529	elif i < ai:
				530	tag = 'delete'
				531	elif j < bj:
				532	tag = 'insert'
				533	if tag:
				534	answer.append( (tag, i, ai, j, bj) )
				535	i, j = ai+size, bj+size
				536	# the list of matching blocks is terminated by a
				537	# sentinel with size 0
				538	if size:
				539	answer.append( ('equal', ai, i, bj, j) )
				540	return answer
				541
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	542	def get_grouped_opcodes(self, n=3):
				543	""" Isolate change clusters by eliminating ranges with no changes.
				544
				545	Return a generator of groups with upto n lines of context.
				546	Each group is in the same format as returned by get_opcodes().
				547
				548	>>> from pprint import pprint
				549	>>> a = map(str, range(1,40))
				550	>>> b = a[:]
				551	>>> b[8:8] = ['i'] # Make an insertion
				552	>>> b[20] += 'x' # Make a replacement
				553	>>> b[23:28] = [] # Make a deletion
				554	>>> b[30] += 'y' # Make another replacement
				555	>>> pprint(list(SequenceMatcher(None,a,b).get_grouped_opcodes()))
				556	[[('equal', 5, 8, 5, 8), ('insert', 8, 8, 8, 9), ('equal', 8, 11, 9, 12)],
				557	[('equal', 16, 19, 17, 20),
				558	('replace', 19, 20, 20, 21),
				559	('equal', 20, 22, 21, 23),
				560	('delete', 22, 27, 23, 23),
				561	('equal', 27, 30, 23, 26)],
				562	[('equal', 31, 34, 27, 30),
				563	('replace', 34, 35, 30, 31),
				564	('equal', 35, 38, 31, 34)]]
				565	"""
				566
				567	codes = self.get_opcodes()
				568	# Fixup leading and trailing groups if they show no changes.
				569	if codes[0][0] == 'equal':
				570	tag, i1, i2, j1, j2 = codes[0]
				571	codes[0] = tag, max(i1, i2-n), i2, max(j1, j2-n), j2
				572	if codes[-1][0] == 'equal':
				573	tag, i1, i2, j1, j2 = codes[-1]
				574	codes[-1] = tag, i1, min(i2, i1+n), j1, min(j2, j1+n)
				575
				576	nn = n + n
				577	group = []
				578	for tag, i1, i2, j1, j2 in codes:
				579	# End the current group and start a new one whenever
				580	# there is a large range with no changes.
				581	if tag == 'equal' and i2-i1 > nn:
				582	group.append((tag, i1, min(i2, i1+n), j1, min(j2, j1+n)))
				583	yield group
				584	group = []
				585	i1, j1 = max(i1, i2-n), max(j1, j2-n)
				586	group.append((tag, i1, i2, j1 ,j2))
				587	if group and not (len(group)==1 and group[0][0] == 'equal'):
				588	yield group
				589
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	590	def ratio(self):
				591	"""Return a measure of the sequences' similarity (float in [0,1]).
				592
				593	Where T is the total number of elements in both sequences, and
				594	M is the number of matches, this is 2,0*M / T.
				595	Note that this is 1 if the sequences are identical, and 0 if
				596	they have nothing in common.
				597
				598	.ratio() is expensive to compute if you haven't already computed
				599	.get_matching_blocks() or .get_opcodes(), in which case you may
				600	want to try .quick_ratio() or .real_quick_ratio() first to get an
				601	upper bound.
				602
				603	>>> s = SequenceMatcher(None, "abcd", "bcde")
				604	>>> s.ratio()
				605	0.75
				606	>>> s.quick_ratio()
				607	0.75
				608	>>> s.real_quick_ratio()
				609	1.0
				610	"""
				611
				612	matches = reduce(lambda sum, triple: sum + triple[-1],
				613	self.get_matching_blocks(), 0)
				614	return 2.0 * matches / (len(self.a) + len(self.b))
				615
				616	def quick_ratio(self):
				617	"""Return an upper bound on ratio() relatively quickly.
				618
				619	This isn't defined beyond that it is an upper bound on .ratio(), and
				620	is faster to compute.
				621	"""
				622
				623	# viewing a and b as multisets, set matches to the cardinality
				624	# of their intersection; this counts the number of matches
				625	# without regard to order, so is clearly an upper bound
				626	if self.fullbcount is None:
				627	self.fullbcount = fullbcount = {}
				628	for elt in self.b:
				629	fullbcount[elt] = fullbcount.get(elt, 0) + 1
				630	fullbcount = self.fullbcount
				631	# avail[x] is the number of times x appears in 'b' less the
				632	# number of times we've seen it in 'a' so far ... kinda
				633	avail = {}
				634	availhas, matches = avail.has_key, 0
				635	for elt in self.a:
				636	if availhas(elt):
				637	numb = avail[elt]
				638	else:
				639	numb = fullbcount.get(elt, 0)
				640	avail[elt] = numb - 1
				641	if numb > 0:
				642	matches = matches + 1
				643	return 2.0 * matches / (len(self.a) + len(self.b))
				644
				645	def real_quick_ratio(self):
				646	"""Return an upper bound on ratio() very quickly.
				647
				648	This isn't defined beyond that it is an upper bound on .ratio(), and
				649	is faster to compute than either .ratio() or .quick_ratio().
				650	"""
				651
				652	la, lb = len(self.a), len(self.b)
				653	# can't have more matches than the number of elements in the
				654	# shorter sequence
				655	return 2.0 * min(la, lb) / (la + lb)
				656
				657	def get_close_matches(word, possibilities, n=3, cutoff=0.6):
				658	"""Use SequenceMatcher to return list of the best "good enough" matches.
				659
				660	word is a sequence for which close matches are desired (typically a
				661	string).
				662
				663	possibilities is a list of sequences against which to match word
				664	(typically a list of strings).
				665
				666	Optional arg n (default 3) is the maximum number of close matches to
				667	return. n must be > 0.
				668
				669	Optional arg cutoff (default 0.6) is a float in [0, 1]. Possibilities
				670	that don't score at least that similar to word are ignored.
				671
				672	The best (no more than n) matches among the possibilities are returned
				673	in a list, sorted by similarity score, most similar first.
				674
				675	>>> get_close_matches("appel", ["ape", "apple", "peach", "puppy"])
				676	['apple', 'ape']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	677	>>> import keyword as _keyword
				678	>>> get_close_matches("wheel", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	679	['while']
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	680	>>> get_close_matches("apple", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	681	[]
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	682	>>> get_close_matches("accept", _keyword.kwlist)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	683	['except']
				684	"""
				685
				686	if not n > 0:
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	687	raise ValueError("n must be > 0: " + `n`)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	688	if not 0.0 <= cutoff <= 1.0:
Fred Drake	f1da628	2001-02-19 19:30:05 +0000	[diff] [blame]	689	raise ValueError("cutoff must be in [0.0, 1.0]: " + `cutoff`)
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	690	result = []
				691	s = SequenceMatcher()
				692	s.set_seq2(word)
				693	for x in possibilities:
				694	s.set_seq1(x)
				695	if s.real_quick_ratio() >= cutoff and \
				696	s.quick_ratio() >= cutoff and \
				697	s.ratio() >= cutoff:
				698	result.append((s.ratio(), x))
				699	# Sort by score.
				700	result.sort()
				701	# Retain only the best n.
				702	result = result[-n:]
				703	# Move best-scorer to head of list.
				704	result.reverse()
				705	# Strip scores.
				706	return [x for score, x in result]
				707
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	708
				709	def _count_leading(line, ch):
				710	"""
				711	Return number of `ch` characters at the start of `line`.
				712
				713	Example:
				714
				715	>>> _count_leading(' abc', ' ')
				716	3
				717	"""
				718
				719	i, n = 0, len(line)
				720	while i < n and line[i] == ch:
				721	i += 1
				722	return i
				723
				724	class Differ:
				725	r"""
				726	Differ is a class for comparing sequences of lines of text, and
				727	producing human-readable differences or deltas. Differ uses
				728	SequenceMatcher both to compare sequences of lines, and to compare
				729	sequences of characters within similar (near-matching) lines.
				730
				731	Each line of a Differ delta begins with a two-letter code:
				732
				733	'- ' line unique to sequence 1
				734	'+ ' line unique to sequence 2
				735	' ' line common to both sequences
				736	'? ' line not present in either input sequence
				737
				738	Lines beginning with '? ' attempt to guide the eye to intraline
				739	differences, and were not present in either input sequence. These lines
				740	can be confusing if the sequences contain tab characters.
				741
				742	Note that Differ makes no claim to produce a minimal diff. To the
				743	contrary, minimal diffs are often counter-intuitive, because they synch
				744	up anywhere possible, sometimes accidental matches 100 pages apart.
				745	Restricting synch points to contiguous matches preserves some notion of
				746	locality, at the occasional cost of producing a longer diff.
				747
				748	Example: Comparing two texts.
				749
				750	First we set up the texts, sequences of individual single-line strings
				751	ending with newlines (such sequences can also be obtained from the
				752	`readlines()` method of file-like objects):
				753
				754	>>> text1 = ''' 1. Beautiful is better than ugly.
				755	... 2. Explicit is better than implicit.
				756	... 3. Simple is better than complex.
				757	... 4. Complex is better than complicated.
				758	... '''.splitlines(1)
				759	>>> len(text1)
				760	4
				761	>>> text1[0][-1]
				762	'\n'
				763	>>> text2 = ''' 1. Beautiful is better than ugly.
				764	... 3. Simple is better than complex.
				765	... 4. Complicated is better than complex.
				766	... 5. Flat is better than nested.
				767	... '''.splitlines(1)
				768
				769	Next we instantiate a Differ object:
				770
				771	>>> d = Differ()
				772
				773	Note that when instantiating a Differ object we may pass functions to
				774	filter out line and character 'junk'. See Differ.__init__ for details.
				775
				776	Finally, we compare the two:
				777
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	778	>>> result = list(d.compare(text1, text2))
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	779
				780	'result' is a list of strings, so let's pretty-print it:
				781
				782	>>> from pprint import pprint as _pprint
				783	>>> _pprint(result)
				784	[' 1. Beautiful is better than ugly.\n',
				785	'- 2. Explicit is better than implicit.\n',
				786	'- 3. Simple is better than complex.\n',
				787	'+ 3. Simple is better than complex.\n',
				788	'? ++\n',
				789	'- 4. Complex is better than complicated.\n',
				790	'? ^ ---- ^\n',
				791	'+ 4. Complicated is better than complex.\n',
				792	'? ++++ ^ ^\n',
				793	'+ 5. Flat is better than nested.\n']
				794
				795	As a single multi-line string it looks like this:
				796
				797	>>> print ''.join(result),
				798	1. Beautiful is better than ugly.
				799	- 2. Explicit is better than implicit.
				800	- 3. Simple is better than complex.
				801	+ 3. Simple is better than complex.
				802	? ++
				803	- 4. Complex is better than complicated.
				804	? ^ ---- ^
				805	+ 4. Complicated is better than complex.
				806	? ++++ ^ ^
				807	+ 5. Flat is better than nested.
				808
				809	Methods:
				810
				811	__init__(linejunk=None, charjunk=None)
				812	Construct a text differencer, with optional filters.
				813
				814	compare(a, b)
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	815	Compare two sequences of lines; generate the resulting delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	816	"""
				817
				818	def __init__(self, linejunk=None, charjunk=None):
				819	"""
				820	Construct a text differencer, with optional filters.
				821
				822	The two optional keyword parameters are for filter functions:
				823
				824	- `linejunk`: A function that should accept a single string argument,
				825	and return true iff the string is junk. The module-level function
				826	`IS_LINE_JUNK` may be used to filter out lines without visible
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	827	characters, except for at most one splat ('#'). It is recommended
				828	to leave linejunk None; as of Python 2.3, the underlying
				829	SequenceMatcher class has grown an adaptive notion of "noise" lines
				830	that's better than any static definition the author has ever been
				831	able to craft.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	832
				833	- `charjunk`: A function that should accept a string of length 1. The
				834	module-level function `IS_CHARACTER_JUNK` may be used to filter out
				835	whitespace characters (a blank or tab; note: bad idea to include
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	836	newline in this!). Use of IS_CHARACTER_JUNK is recommended.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	837	"""
				838
				839	self.linejunk = linejunk
				840	self.charjunk = charjunk
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	841
				842	def compare(self, a, b):
				843	r"""
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	844	Compare two sequences of lines; generate the resulting delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	845
				846	Each sequence must contain individual single-line strings ending with
				847	newlines. Such sequences can be obtained from the `readlines()` method
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	848	of file-like objects. The delta generated also consists of newline-
				849	terminated strings, ready to be printed as-is via the writeline()
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	850	method of a file-like object.
				851
				852	Example:
				853
				854	>>> print ''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
				855	... 'ore\ntree\nemu\n'.splitlines(1))),
				856	- one
				857	? ^
				858	+ ore
				859	? ^
				860	- two
				861	- three
				862	? -
				863	+ tree
				864	+ emu
				865	"""
				866
				867	cruncher = SequenceMatcher(self.linejunk, a, b)
				868	for tag, alo, ahi, blo, bhi in cruncher.get_opcodes():
				869	if tag == 'replace':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	870	g = self._fancy_replace(a, alo, ahi, b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	871	elif tag == 'delete':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	872	g = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	873	elif tag == 'insert':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	874	g = self._dump('+', b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	875	elif tag == 'equal':
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	876	g = self._dump(' ', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	877	else:
				878	raise ValueError, 'unknown tag ' + `tag`
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	879
				880	for line in g:
				881	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	882
				883	def _dump(self, tag, x, lo, hi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	884	"""Generate comparison results for a same-tagged range."""
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	885	for i in xrange(lo, hi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	886	yield '%s %s' % (tag, x[i])
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	887
				888	def _plain_replace(self, a, alo, ahi, b, blo, bhi):
				889	assert alo < ahi and blo < bhi
				890	# dump the shorter block first -- reduces the burden on short-term
				891	# memory if the blocks are of very different sizes
				892	if bhi - blo < ahi - alo:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	893	first = self._dump('+', b, blo, bhi)
				894	second = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	895	else:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	896	first = self._dump('-', a, alo, ahi)
				897	second = self._dump('+', b, blo, bhi)
				898
				899	for g in first, second:
				900	for line in g:
				901	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	902
				903	def _fancy_replace(self, a, alo, ahi, b, blo, bhi):
				904	r"""
				905	When replacing one block of lines with another, search the blocks
				906	for similar lines; the best-matching pair (if any) is used as a
				907	synch point, and intraline difference marking is done on the
				908	similar pair. Lots of work, but often worth it.
				909
				910	Example:
				911
				912	>>> d = Differ()
				913	>>> d._fancy_replace(['abcDefghiJkl\n'], 0, 1, ['abcdefGhijkl\n'], 0, 1)
				914	>>> print ''.join(d.results),
				915	- abcDefghiJkl
				916	? ^ ^ ^
				917	+ abcdefGhijkl
				918	? ^ ^ ^
				919	"""
				920
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	921	# don't synch up unless the lines have a similarity score of at
				922	# least cutoff; best_ratio tracks the best score seen so far
				923	best_ratio, cutoff = 0.74, 0.75
				924	cruncher = SequenceMatcher(self.charjunk)
				925	eqi, eqj = None, None # 1st indices of equal lines (if any)
				926
				927	# search for the pair that matches best without being identical
				928	# (identical lines must be junk lines, & we don't want to synch up
				929	# on junk -- unless we have to)
				930	for j in xrange(blo, bhi):
				931	bj = b[j]
				932	cruncher.set_seq2(bj)
				933	for i in xrange(alo, ahi):
				934	ai = a[i]
				935	if ai == bj:
				936	if eqi is None:
				937	eqi, eqj = i, j
				938	continue
				939	cruncher.set_seq1(ai)
				940	# computing similarity is expensive, so use the quick
				941	# upper bounds first -- have seen this speed up messy
				942	# compares by a factor of 3.
				943	# note that ratio() is only expensive to compute the first
				944	# time it's called on a sequence pair; the expensive part
				945	# of the computation is cached by cruncher
				946	if cruncher.real_quick_ratio() > best_ratio and \
				947	cruncher.quick_ratio() > best_ratio and \
				948	cruncher.ratio() > best_ratio:
				949	best_ratio, best_i, best_j = cruncher.ratio(), i, j
				950	if best_ratio < cutoff:
				951	# no non-identical "pretty close" pair
				952	if eqi is None:
				953	# no identical pair either -- treat it as a straight replace
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	954	for line in self._plain_replace(a, alo, ahi, b, blo, bhi):
				955	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	956	return
				957	# no close pair, but an identical pair -- synch up on that
				958	best_i, best_j, best_ratio = eqi, eqj, 1.0
				959	else:
				960	# there's a close pair, so forget the identical pair (if any)
				961	eqi = None
				962
				963	# a[best_i] very similar to b[best_j]; eqi is None iff they're not
				964	# identical
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	965
				966	# pump out diffs from before the synch point
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	967	for line in self._fancy_helper(a, alo, best_i, b, blo, best_j):
				968	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	969
				970	# do intraline marking on the synch pair
				971	aelt, belt = a[best_i], b[best_j]
				972	if eqi is None:
				973	# pump out a '-', '?', '+', '?' quad for the synched lines
				974	atags = btags = ""
				975	cruncher.set_seqs(aelt, belt)
				976	for tag, ai1, ai2, bj1, bj2 in cruncher.get_opcodes():
				977	la, lb = ai2 - ai1, bj2 - bj1
				978	if tag == 'replace':
				979	atags += '^' * la
				980	btags += '^' * lb
				981	elif tag == 'delete':
				982	atags += '-' * la
				983	elif tag == 'insert':
				984	btags += '+' * lb
				985	elif tag == 'equal':
				986	atags += ' ' * la
				987	btags += ' ' * lb
				988	else:
				989	raise ValueError, 'unknown tag ' + `tag`
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	990	for line in self._qformat(aelt, belt, atags, btags):
				991	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	992	else:
				993	# the synch pair is identical
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	994	yield ' ' + aelt
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	995
				996	# pump out diffs from after the synch point
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	997	for line in self._fancy_helper(a, best_i+1, ahi, b, best_j+1, bhi):
				998	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	999
				1000	def _fancy_helper(self, a, alo, ahi, b, blo, bhi):
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1001	g = []
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1002	if alo < ahi:
				1003	if blo < bhi:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1004	g = self._fancy_replace(a, alo, ahi, b, blo, bhi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1005	else:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1006	g = self._dump('-', a, alo, ahi)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1007	elif blo < bhi:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1008	g = self._dump('+', b, blo, bhi)
				1009
				1010	for line in g:
				1011	yield line
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1012
				1013	def _qformat(self, aline, bline, atags, btags):
				1014	r"""
				1015	Format "?" output and deal with leading tabs.
				1016
				1017	Example:
				1018
				1019	>>> d = Differ()
				1020	>>> d._qformat('\tabcDefghiJkl\n', '\t\tabcdefGhijkl\n',
				1021	... ' ^ ^ ^ ', '+ ^ ^ ^ ')
				1022	>>> for line in d.results: print repr(line)
				1023	...
				1024	'- \tabcDefghiJkl\n'
				1025	'? \t ^ ^ ^\n'
				1026	'+ \t\tabcdefGhijkl\n'
				1027	'? \t ^ ^ ^\n'
				1028	"""
				1029
				1030	# Can hurt, but will probably help most of the time.
				1031	common = min(_count_leading(aline, "\t"),
				1032	_count_leading(bline, "\t"))
				1033	common = min(common, _count_leading(atags[:common], " "))
				1034	atags = atags[common:].rstrip()
				1035	btags = btags[common:].rstrip()
				1036
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1037	yield "- " + aline
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1038	if atags:
Tim Peters	527e64f	2001-10-04 05:36:56 +0000	[diff] [blame]	1039	yield "? %s%s\n" % ("\t" * common, atags)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1040
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1041	yield "+ " + bline
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1042	if btags:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1043	yield "? %s%s\n" % ("\t" * common, btags)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1044
				1045	# With respect to junk, an earlier version of ndiff simply refused to
				1046	# start a match with a junk element. The result was cases like this:
				1047	# before: private Thread currentThread;
				1048	# after: private volatile Thread currentThread;
				1049	# If you consider whitespace to be junk, the longest contiguous match
				1050	# not starting with junk is "e Thread currentThread". So ndiff reported
				1051	# that "e volatil" was inserted between the 't' and the 'e' in "private".
				1052	# While an accurate view, to people that's absurd. The current version
				1053	# looks for matching blocks that are entirely junk-free, then extends the
				1054	# longest one of those as far as possible but only with matching junk.
				1055	# So now "currentThread" is matched, then extended to suck up the
				1056	# preceding blank; then "private" is matched, and extended to suck up the
				1057	# following blank; then "Thread" is matched; and finally ndiff reports
				1058	# that "volatile " was inserted before "Thread". The only quibble
				1059	# remaining is that perhaps it was really the case that " volatile"
				1060	# was inserted after "private". I can live with that <wink>.
				1061
				1062	import re
				1063
				1064	def IS_LINE_JUNK(line, pat=re.compile(r"\s#?\s$").match):
				1065	r"""
				1066	Return 1 for ignorable line: iff `line` is blank or contains a single '#'.
				1067
				1068	Examples:
				1069
				1070	>>> IS_LINE_JUNK('\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1071	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1072	>>> IS_LINE_JUNK(' # \n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1073	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1074	>>> IS_LINE_JUNK('hello\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1075	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1076	"""
				1077
				1078	return pat(line) is not None
				1079
				1080	def IS_CHARACTER_JUNK(ch, ws=" \t"):
				1081	r"""
				1082	Return 1 for ignorable character: iff `ch` is a space or tab.
				1083
				1084	Examples:
				1085
				1086	>>> IS_CHARACTER_JUNK(' ')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1087	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1088	>>> IS_CHARACTER_JUNK('\t')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1089	True
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1090	>>> IS_CHARACTER_JUNK('\n')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1091	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1092	>>> IS_CHARACTER_JUNK('x')
Guido van Rossum	77f6a65	2002-04-03 22:41:51 +0000	[diff] [blame]	1093	False
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1094	"""
				1095
				1096	return ch in ws
				1097
				1098	del re
				1099
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1100
				1101	def unified_diff(a, b, fromfile='', tofile='', fromfiledate='',
				1102	tofiledate='', n=3, lineterm='\n'):
				1103	r"""
				1104	Compare two sequences of lines; generate the delta as a unified diff.
				1105
				1106	Unified diffs are a compact way of showing line changes and a few
				1107	lines of context. The number of context lines is set by 'n' which
				1108	defaults to three.
				1109
Raymond Hettinger	0887c73	2003-06-17 16:53:25 +0000	[diff] [blame]	1110	By default, the diff control lines (those with ---, +++, or @@) are
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1111	created with a trailing newline. This is helpful so that inputs
				1112	created from file.readlines() result in diffs that are suitable for
				1113	file.writelines() since both the inputs and outputs have trailing
				1114	newlines.
				1115
				1116	For inputs that do not have trailing newlines, set the lineterm
				1117	argument to "" so that the output will be uniformly newline free.
				1118
				1119	The unidiff format normally has a header for filenames and modification
				1120	times. Any or all of these may be specified using strings for
				1121	'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'. The modification
				1122	times are normally expressed in the format returned by time.ctime().
				1123
				1124	Example:
				1125
				1126	>>> for line in unified_diff('one two three four'.split(),
				1127	... 'zero one tree four'.split(), 'Original', 'Current',
				1128	... 'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:20:52 2003',
				1129	... lineterm=''):
				1130	... print line
				1131	--- Original Sat Jan 26 23:30:50 1991
				1132	+++ Current Fri Jun 06 10:20:52 2003
				1133	@@ -1,4 +1,4 @@
				1134	+zero
				1135	one
				1136	-two
				1137	-three
				1138	+tree
				1139	four
				1140	"""
				1141
				1142	started = False
				1143	for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
				1144	if not started:
				1145	yield '--- %s %s%s' % (fromfile, fromfiledate, lineterm)
				1146	yield '+++ %s %s%s' % (tofile, tofiledate, lineterm)
				1147	started = True
				1148	i1, i2, j1, j2 = group[0][1], group[-1][2], group[0][3], group[-1][4]
				1149	yield "@@ -%d,%d +%d,%d @@%s" % (i1+1, i2-i1, j1+1, j2-j1, lineterm)
				1150	for tag, i1, i2, j1, j2 in group:
				1151	if tag == 'equal':
				1152	for line in a[i1:i2]:
				1153	yield ' ' + line
				1154	continue
				1155	if tag == 'replace' or tag == 'delete':
				1156	for line in a[i1:i2]:
				1157	yield '-' + line
				1158	if tag == 'replace' or tag == 'insert':
				1159	for line in b[j1:j2]:
				1160	yield '+' + line
				1161
				1162	# See http://www.unix.org/single_unix_specification/
				1163	def context_diff(a, b, fromfile='', tofile='',
				1164	fromfiledate='', tofiledate='', n=3, lineterm='\n'):
				1165	r"""
				1166	Compare two sequences of lines; generate the delta as a context diff.
				1167
				1168	Context diffs are a compact way of showing line changes and a few
				1169	lines of context. The number of context lines is set by 'n' which
				1170	defaults to three.
				1171
				1172	By default, the diff control lines (those with *** or ---) are
				1173	created with a trailing newline. This is helpful so that inputs
				1174	created from file.readlines() result in diffs that are suitable for
				1175	file.writelines() since both the inputs and outputs have trailing
				1176	newlines.
				1177
				1178	For inputs that do not have trailing newlines, set the lineterm
				1179	argument to "" so that the output will be uniformly newline free.
				1180
				1181	The context diff format normally has a header for filenames and
				1182	modification times. Any or all of these may be specified using
				1183	strings for 'fromfile', 'tofile', 'fromfiledate', and 'tofiledate'.
				1184	The modification times are normally expressed in the format returned
				1185	by time.ctime(). If not specified, the strings default to blanks.
				1186
				1187	Example:
				1188
				1189	>>> print ''.join(context_diff('one\ntwo\nthree\nfour\n'.splitlines(1),
				1190	... 'zero\none\ntree\nfour\n'.splitlines(1), 'Original', 'Current',
				1191	... 'Sat Jan 26 23:30:50 1991', 'Fri Jun 06 10:22:46 2003')),
				1192	*** Original Sat Jan 26 23:30:50 1991
				1193	--- Current Fri Jun 06 10:22:46 2003
				1194	***************
				1195	* 1,4 **
				1196	one
				1197	! two
				1198	! three
				1199	four
				1200	--- 1,4 ----
				1201	+ zero
				1202	one
				1203	! tree
				1204	four
				1205	"""
				1206
				1207	started = False
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1208	prefixmap = {'insert':'+ ', 'delete':'- ', 'replace':'! ', 'equal':' '}
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1209	for group in SequenceMatcher(None,a,b).get_grouped_opcodes(n):
				1210	if not started:
				1211	yield '*** %s %s%s' % (fromfile, fromfiledate, lineterm)
				1212	yield '--- %s %s%s' % (tofile, tofiledate, lineterm)
				1213	started = True
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1214
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1215	yield '***************%s' % (lineterm,)
				1216	if group[-1][2] - group[0][1] >= 2:
				1217	yield '* %d,%d **%s' % (group[0][1]+1, group[-1][2], lineterm)
				1218	else:
				1219	yield '* %d **%s' % (group[-1][2], lineterm)
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1220	visiblechanges = [e for e in group if e[0] in ('replace', 'delete')]
				1221	if visiblechanges:
				1222	for tag, i1, i2, _, _ in group:
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1223	if tag != 'insert':
				1224	for line in a[i1:i2]:
				1225	yield prefixmap[tag] + line
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1226
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1227	if group[-1][4] - group[0][3] >= 2:
				1228	yield '--- %d,%d ----%s' % (group[0][3]+1, group[-1][4], lineterm)
				1229	else:
				1230	yield '--- %d ----%s' % (group[-1][4], lineterm)
Raymond Hettinger	7f2d302	2003-06-08 19:38:42 +0000	[diff] [blame]	1231	visiblechanges = [e for e in group if e[0] in ('replace', 'insert')]
				1232	if visiblechanges:
				1233	for tag, _, _, j1, j2 in group:
Raymond Hettinger	f0b1a1f	2003-06-08 11:07:08 +0000	[diff] [blame]	1234	if tag != 'delete':
				1235	for line in b[j1:j2]:
				1236	yield prefixmap[tag] + line
				1237
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	1238	def ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK):
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1239	r"""
				1240	Compare `a` and `b` (lists of strings); return a `Differ`-style delta.
				1241
				1242	Optional keyword parameters `linejunk` and `charjunk` are for filter
				1243	functions (or None):
				1244
				1245	- linejunk: A function that should accept a single string argument, and
Tim Peters	81b9251	2002-04-29 01:37:32 +0000	[diff] [blame]	1246	return true iff the string is junk. The default is None, and is
				1247	recommended; as of Python 2.3, an adaptive notion of "noise" lines is
				1248	used that does a good job on its own.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1249
				1250	- charjunk: A function that should accept a string of length 1. The
				1251	default is module-level function IS_CHARACTER_JUNK, which filters out
				1252	whitespace characters (a blank or tab; note: bad idea to include newline
				1253	in this!).
				1254
				1255	Tools/scripts/ndiff.py is a command-line front-end to this function.
				1256
				1257	Example:
				1258
				1259	>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
				1260	... 'ore\ntree\nemu\n'.splitlines(1))
				1261	>>> print ''.join(diff),
				1262	- one
				1263	? ^
				1264	+ ore
				1265	? ^
				1266	- two
				1267	- three
				1268	? -
				1269	+ tree
				1270	+ emu
				1271	"""
				1272	return Differ(linejunk, charjunk).compare(a, b)
				1273
				1274	def restore(delta, which):
				1275	r"""
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1276	Generate one of the two sequences that generated a delta.
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1277
				1278	Given a `delta` produced by `Differ.compare()` or `ndiff()`, extract
				1279	lines originating from file 1 or 2 (parameter `which`), stripping off line
				1280	prefixes.
				1281
				1282	Examples:
				1283
				1284	>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
				1285	... 'ore\ntree\nemu\n'.splitlines(1))
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1286	>>> diff = list(diff)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1287	>>> print ''.join(restore(diff, 1)),
				1288	one
				1289	two
				1290	three
				1291	>>> print ''.join(restore(diff, 2)),
				1292	ore
				1293	tree
				1294	emu
				1295	"""
				1296	try:
				1297	tag = {1: "- ", 2: "+ "}[int(which)]
				1298	except KeyError:
				1299	raise ValueError, ('unknown delta choice (must be 1 or 2): %r'
				1300	% which)
				1301	prefixes = (" ", tag)
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1302	for line in delta:
				1303	if line[:2] in prefixes:
Tim Peters	8a9c284	2001-09-22 21:30:22 +0000	[diff] [blame]	1304	yield line[2:]
Tim Peters	5e824c3	2001-08-12 22:25:01 +0000	[diff] [blame]	1305
Tim Peters	9ae2148	2001-02-10 08:00:53 +0000	[diff] [blame]	1306	def _test():
				1307	import doctest, difflib
				1308	return doctest.testmod(difflib)
				1309
				1310	if __name__ == "__main__":
				1311	_test()