blob: 299369307f2400b010d81d7b9f6ef76a1c042390 [file] [log] [blame]
Fred Drakebaf71422001-02-19 16:31:02 +00001\section{\module{difflib} ---
2 Helpers for computing deltas}
3
4\declaremodule{standard}{difflib}
5\modulesynopsis{Helpers for computing differences between objects.}
Fred Drake34929f22003-12-30 16:12:27 +00006\moduleauthor{Tim Peters}{tim_one@users.sourceforge.net}
7\sectionauthor{Tim Peters}{tim_one@users.sourceforge.net}
Fred Drakebaf71422001-02-19 16:31:02 +00008% LaTeXification by Fred L. Drake, Jr. <fdrake@acm.org>.
9
Fred Drakeda00cda2001-04-10 19:56:09 +000010\versionadded{2.1}
11
12
Fred Drake6943a292001-08-13 19:31:59 +000013\begin{classdesc*}{SequenceMatcher}
14 This is a flexible class for comparing pairs of sequences of any
15 type, so long as the sequence elements are hashable. The basic
16 algorithm predates, and is a little fancier than, an algorithm
17 published in the late 1980's by Ratcliff and Obershelp under the
18 hyperbolic name ``gestalt pattern matching.'' The idea is to find
19 the longest contiguous matching subsequence that contains no
20 ``junk'' elements (the Ratcliff and Obershelp algorithm doesn't
21 address junk). The same idea is then applied recursively to the
22 pieces of the sequences to the left and to the right of the matching
23 subsequence. This does not yield minimal edit sequences, but does
24 tend to yield matches that ``look right'' to people.
25
26 \strong{Timing:} The basic Ratcliff-Obershelp algorithm is cubic
27 time in the worst case and quadratic time in the expected case.
28 \class{SequenceMatcher} is quadratic time for the worst case and has
29 expected-case behavior dependent in a complicated way on how many
30 elements the sequences have in common; best case time is linear.
31\end{classdesc*}
32
33\begin{classdesc*}{Differ}
34 This is a class for comparing sequences of lines of text, and
Tim Peters8a9c2842001-09-22 21:30:22 +000035 producing human-readable differences or deltas. Differ uses
Fred Drake6943a292001-08-13 19:31:59 +000036 \class{SequenceMatcher} both to compare sequences of lines, and to
37 compare sequences of characters within similar (near-matching)
38 lines.
39
40 Each line of a \class{Differ} delta begins with a two-letter code:
41
42\begin{tableii}{l|l}{code}{Code}{Meaning}
43 \lineii{'- '}{line unique to sequence 1}
44 \lineii{'+ '}{line unique to sequence 2}
45 \lineii{' '}{line common to both sequences}
46 \lineii{'? '}{line not present in either input sequence}
47\end{tableii}
48
49 Lines beginning with `\code{?~}' attempt to guide the eye to
50 intraline differences, and were not present in either input
51 sequence. These lines can be confusing if the sequences contain tab
52 characters.
53\end{classdesc*}
54
Raymond Hettingere07b8352003-06-09 21:44:59 +000055\begin{funcdesc}{context_diff}{a, b\optional{, fromfile\optional{, tofile
56 \optional{, fromfiledate\optional{, tofiledate\optional{, n
57 \optional{, lineterm}}}}}}}
Raymond Hettingere07b8352003-06-09 21:44:59 +000058 Compare \var{a} and \var{b} (lists of strings); return a
59 delta (a generator generating the delta lines) in context diff
60 format.
61
62 Context diffs are a compact way of showing just the lines that have
63 changed plus a few lines of context. The changes are shown in a
64 before/after style. The number of context lines is set by \var{n}
65 which defaults to three.
66
67 By default, the diff control lines (those with \code{***} or \code{---})
68 are created with a trailing newline. This is helpful so that inputs created
69 from \function{file.readlines()} result in diffs that are suitable for use
70 with \function{file.writelines()} since both the inputs and outputs have
71 trailing newlines.
72
73 For inputs that do not have trailing newlines, set the \var{lineterm}
74 argument to \code{""} so that the output will be uniformly newline free.
75
76 The context diff format normally has a header for filenames and
77 modification times. Any or all of these may be specified using strings for
78 \var{fromfile}, \var{tofile}, \var{fromfiledate}, and \var{tofiledate}.
79 The modification times are normally expressed in the format returned by
80 \function{time.ctime()}. If not specified, the strings default to blanks.
81
82 \file{Tools/scripts/diff.py} is a command-line front-end for this
Raymond Hettinger132fa372003-06-11 07:50:44 +000083 function.
84
85 \versionadded{2.3}
Raymond Hettingere07b8352003-06-09 21:44:59 +000086\end{funcdesc}
87
Fred Drakebaf71422001-02-19 16:31:02 +000088\begin{funcdesc}{get_close_matches}{word, possibilities\optional{,
89 n\optional{, cutoff}}}
90 Return a list of the best ``good enough'' matches. \var{word} is a
91 sequence for which close matches are desired (typically a string),
92 and \var{possibilities} is a list of sequences against which to
93 match \var{word} (typically a list of strings).
94
95 Optional argument \var{n} (default \code{3}) is the maximum number
96 of close matches to return; \var{n} must be greater than \code{0}.
97
98 Optional argument \var{cutoff} (default \code{0.6}) is a float in
99 the range [0, 1]. Possibilities that don't score at least that
100 similar to \var{word} are ignored.
101
102 The best (no more than \var{n}) matches among the possibilities are
103 returned in a list, sorted by similarity score, most similar first.
104
105\begin{verbatim}
106>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
107['apple', 'ape']
108>>> import keyword
109>>> get_close_matches('wheel', keyword.kwlist)
110['while']
111>>> get_close_matches('apple', keyword.kwlist)
112[]
113>>> get_close_matches('accept', keyword.kwlist)
114['except']
115\end{verbatim}
116\end{funcdesc}
117
Fred Drake6943a292001-08-13 19:31:59 +0000118\begin{funcdesc}{ndiff}{a, b\optional{, linejunk\optional{,
119 charjunk}}}
120 Compare \var{a} and \var{b} (lists of strings); return a
Tim Peters8a9c2842001-09-22 21:30:22 +0000121 \class{Differ}-style delta (a generator generating the delta lines).
Fred Drakebaf71422001-02-19 16:31:02 +0000122
Fred Drake6943a292001-08-13 19:31:59 +0000123 Optional keyword parameters \var{linejunk} and \var{charjunk} are
124 for filter functions (or \code{None}):
125
Tim Peters81b92512002-04-29 01:37:32 +0000126 \var{linejunk}: A function that accepts a single string
127 argument, and returns true if the string is junk, or false if not.
128 The default is (\code{None}), starting with Python 2.3. Before then,
129 the default was the module-level function
Fred Drake6943a292001-08-13 19:31:59 +0000130 \function{IS_LINE_JUNK()}, which filters out lines without visible
131 characters, except for at most one pound character (\character{\#}).
Tim Peters81b92512002-04-29 01:37:32 +0000132 As of Python 2.3, the underlying \class{SequenceMatcher} class
133 does a dynamic analysis of which lines are so frequent as to
134 constitute noise, and this usually works better than the pre-2.3
135 default.
Fred Drake6943a292001-08-13 19:31:59 +0000136
Tim Peters81b92512002-04-29 01:37:32 +0000137 \var{charjunk}: A function that accepts a character (a string of
138 length 1), and returns if the character is junk, or false if not.
Fred Drake6943a292001-08-13 19:31:59 +0000139 The default is module-level function \function{IS_CHARACTER_JUNK()},
140 which filters out whitespace characters (a blank or tab; note: bad
141 idea to include newline in this!).
142
143 \file{Tools/scripts/ndiff.py} is a command-line front-end to this
144 function.
145
146\begin{verbatim}
147>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
Fred Drakeedb635f2002-12-06 18:52:28 +0000148... 'ore\ntree\nemu\n'.splitlines(1))
Fred Drake6943a292001-08-13 19:31:59 +0000149>>> print ''.join(diff),
150- one
Tim Peters8a9c2842001-09-22 21:30:22 +0000151? ^
Fred Drake6943a292001-08-13 19:31:59 +0000152+ ore
Tim Peters8a9c2842001-09-22 21:30:22 +0000153? ^
Fred Drake6943a292001-08-13 19:31:59 +0000154- two
155- three
Tim Peters8a9c2842001-09-22 21:30:22 +0000156? -
Fred Drake6943a292001-08-13 19:31:59 +0000157+ tree
158+ emu
159\end{verbatim}
160\end{funcdesc}
161
162\begin{funcdesc}{restore}{sequence, which}
163 Return one of the two sequences that generated a delta.
164
165 Given a \var{sequence} produced by \method{Differ.compare()} or
166 \function{ndiff()}, extract lines originating from file 1 or 2
167 (parameter \var{which}), stripping off line prefixes.
168
169 Example:
170
171\begin{verbatim}
172>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
173... 'ore\ntree\nemu\n'.splitlines(1))
Tim Peters8a9c2842001-09-22 21:30:22 +0000174>>> diff = list(diff) # materialize the generated delta into a list
Fred Drake6943a292001-08-13 19:31:59 +0000175>>> print ''.join(restore(diff, 1)),
176one
177two
178three
179>>> print ''.join(restore(diff, 2)),
180ore
181tree
182emu
183\end{verbatim}
184
185\end{funcdesc}
186
Raymond Hettingere07b8352003-06-09 21:44:59 +0000187\begin{funcdesc}{unified_diff}{a, b\optional{, fromfile\optional{, tofile
188 \optional{, fromfiledate\optional{, tofiledate\optional{, n
189 \optional{, lineterm}}}}}}}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000190 Compare \var{a} and \var{b} (lists of strings); return a
191 delta (a generator generating the delta lines) in unified diff
192 format.
193
194 Unified diffs are a compact way of showing just the lines that have
195 changed plus a few lines of context. The changes are shown in a
196 inline style (instead of separate before/after blocks). The number
197 of context lines is set by \var{n} which defaults to three.
198
199 By default, the diff control lines (those with \code{---}, \code{+++},
200 or \code{@@}) are created with a trailing newline. This is helpful so
201 that inputs created from \function{file.readlines()} result in diffs
202 that are suitable for use with \function{file.writelines()} since both
203 the inputs and outputs have trailing newlines.
204
205 For inputs that do not have trailing newlines, set the \var{lineterm}
206 argument to \code{""} so that the output will be uniformly newline free.
207
208 The context diff format normally has a header for filenames and
209 modification times. Any or all of these may be specified using strings for
210 \var{fromfile}, \var{tofile}, \var{fromfiledate}, and \var{tofiledate}.
211 The modification times are normally expressed in the format returned by
212 \function{time.ctime()}. If not specified, the strings default to blanks.
213
214 \file{Tools/scripts/diff.py} is a command-line front-end for this
Raymond Hettinger132fa372003-06-11 07:50:44 +0000215 function.
216
217 \versionadded{2.3}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000218\end{funcdesc}
Fred Drake6943a292001-08-13 19:31:59 +0000219
Fred Drake7f10cce2001-10-26 03:04:23 +0000220\begin{funcdesc}{IS_LINE_JUNK}{line}
221 Return true for ignorable lines. The line \var{line} is ignorable
222 if \var{line} is blank or contains a single \character{\#},
223 otherwise it is not ignorable. Used as a default for parameter
Tim Peters81b92512002-04-29 01:37:32 +0000224 \var{linejunk} in \function{ndiff()} before Python 2.3.
Fred Drake6943a292001-08-13 19:31:59 +0000225\end{funcdesc}
226
227
Fred Drake7f10cce2001-10-26 03:04:23 +0000228\begin{funcdesc}{IS_CHARACTER_JUNK}{ch}
229 Return true for ignorable characters. The character \var{ch} is
230 ignorable if \var{ch} is a space or tab, otherwise it is not
231 ignorable. Used as a default for parameter \var{charjunk} in
Fred Drake6943a292001-08-13 19:31:59 +0000232 \function{ndiff()}.
Fred Drake6943a292001-08-13 19:31:59 +0000233\end{funcdesc}
Fred Drakebaf71422001-02-19 16:31:02 +0000234
235
Fred Drake6fda3ac2001-04-10 18:41:16 +0000236\begin{seealso}
237 \seetitle{Pattern Matching: The Gestalt Approach}{Discussion of a
238 similar algorithm by John W. Ratcliff and D. E. Metzener.
239 This was published in
240 \citetitle[http://www.ddj.com/]{Dr. Dobb's Journal} in
241 July, 1988.}
242\end{seealso}
243
244
Fred Drakebaf71422001-02-19 16:31:02 +0000245\subsection{SequenceMatcher Objects \label{sequence-matcher}}
246
Fred Drake96d7a702001-05-11 01:08:13 +0000247The \class{SequenceMatcher} class has this constructor:
248
Fred Drakebaf71422001-02-19 16:31:02 +0000249\begin{classdesc}{SequenceMatcher}{\optional{isjunk\optional{,
250 a\optional{, b}}}}
251 Optional argument \var{isjunk} must be \code{None} (the default) or
252 a one-argument function that takes a sequence element and returns
253 true if and only if the element is ``junk'' and should be ignored.
Fred Drake7f10cce2001-10-26 03:04:23 +0000254 Passing \code{None} for \var{b} is equivalent to passing
255 \code{lambda x: 0}; in other words, no elements are ignored. For
256 example, pass:
Fred Drakebaf71422001-02-19 16:31:02 +0000257
258\begin{verbatim}
Fred Drake447f5452001-02-23 19:13:07 +0000259lambda x: x in " \t"
Fred Drakebaf71422001-02-19 16:31:02 +0000260\end{verbatim}
261
262 if you're comparing lines as sequences of characters, and don't want
263 to synch up on blanks or hard tabs.
264
265 The optional arguments \var{a} and \var{b} are sequences to be
266 compared; both default to empty strings. The elements of both
267 sequences must be hashable.
268\end{classdesc}
269
270
271\class{SequenceMatcher} objects have the following methods:
272
273\begin{methoddesc}{set_seqs}{a, b}
274 Set the two sequences to be compared.
275\end{methoddesc}
276
277\class{SequenceMatcher} computes and caches detailed information about
278the second sequence, so if you want to compare one sequence against
279many sequences, use \method{set_seq2()} to set the commonly used
280sequence once and call \method{set_seq1()} repeatedly, once for each
281of the other sequences.
282
283\begin{methoddesc}{set_seq1}{a}
284 Set the first sequence to be compared. The second sequence to be
285 compared is not changed.
286\end{methoddesc}
287
288\begin{methoddesc}{set_seq2}{b}
289 Set the second sequence to be compared. The first sequence to be
290 compared is not changed.
291\end{methoddesc}
292
293\begin{methoddesc}{find_longest_match}{alo, ahi, blo, bhi}
294 Find longest matching block in \code{\var{a}[\var{alo}:\var{ahi}]}
295 and \code{\var{b}[\var{blo}:\var{bhi}]}.
296
297 If \var{isjunk} was omitted or \code{None},
298 \method{get_longest_match()} returns \code{(\var{i}, \var{j},
299 \var{k})} such that \code{\var{a}[\var{i}:\var{i}+\var{k}]} is equal
Tim Peters8a9c2842001-09-22 21:30:22 +0000300 to \code{\var{b}[\var{j}:\var{j}+\var{k}]}, where
Fred Drakebaf71422001-02-19 16:31:02 +0000301 \code{\var{alo} <= \var{i} <= \var{i}+\var{k} <= \var{ahi}} and
302 \code{\var{blo} <= \var{j} <= \var{j}+\var{k} <= \var{bhi}}.
303 For all \code{(\var{i'}, \var{j'}, \var{k'})} meeting those
304 conditions, the additional conditions
305 \code{\var{k} >= \var{k'}},
306 \code{\var{i} <= \var{i'}},
307 and if \code{\var{i} == \var{i'}}, \code{\var{j} <= \var{j'}}
308 are also met.
309 In other words, of all maximal matching blocks, return one that
310 starts earliest in \var{a}, and of all those maximal matching blocks
311 that start earliest in \var{a}, return the one that starts earliest
312 in \var{b}.
313
314\begin{verbatim}
315>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
316>>> s.find_longest_match(0, 5, 0, 9)
317(0, 4, 5)
318\end{verbatim}
319
320 If \var{isjunk} was provided, first the longest matching block is
321 determined as above, but with the additional restriction that no
322 junk element appears in the block. Then that block is extended as
323 far as possible by matching (only) junk elements on both sides.
324 So the resulting block never matches on junk except as identical
325 junk happens to be adjacent to an interesting match.
326
327 Here's the same example as before, but considering blanks to be junk.
Tim Peters754ba582001-02-20 11:24:35 +0000328 That prevents \code{' abcd'} from matching the \code{' abcd'} at the
Fred Drakebaf71422001-02-19 16:31:02 +0000329 tail end of the second sequence directly. Instead only the
330 \code{'abcd'} can match, and matches the leftmost \code{'abcd'} in
331 the second sequence:
332
333\begin{verbatim}
334>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
335>>> s.find_longest_match(0, 5, 0, 9)
336(1, 0, 4)
337\end{verbatim}
338
339 If no blocks match, this returns \code{(\var{alo}, \var{blo}, 0)}.
340\end{methoddesc}
341
342\begin{methoddesc}{get_matching_blocks}{}
343 Return list of triples describing matching subsequences.
344 Each triple is of the form \code{(\var{i}, \var{j}, \var{n})}, and
345 means that \code{\var{a}[\var{i}:\var{i}+\var{n}] ==
346 \var{b}[\var{j}:\var{j}+\var{n}]}. The triples are monotonically
347 increasing in \var{i} and \var{j}.
348
349 The last triple is a dummy, and has the value \code{(len(\var{a}),
350 len(\var{b}), 0)}. It is the only triple with \code{\var{n} == 0}.
351 % Explain why a dummy is used!
352
353\begin{verbatim}
354>>> s = SequenceMatcher(None, "abxcd", "abcd")
355>>> s.get_matching_blocks()
356[(0, 0, 2), (3, 2, 2), (5, 4, 0)]
357\end{verbatim}
358\end{methoddesc}
359
360\begin{methoddesc}{get_opcodes}{}
361 Return list of 5-tuples describing how to turn \var{a} into \var{b}.
362 Each tuple is of the form \code{(\var{tag}, \var{i1}, \var{i2},
363 \var{j1}, \var{j2})}. The first tuple has \code{\var{i1} ==
364 \var{j1} == 0}, and remaining tuples have \var{i1} equal to the
365 \var{i2} from the preceeding tuple, and, likewise, \var{j1} equal to
366 the previous \var{j2}.
367
368 The \var{tag} values are strings, with these meanings:
369
370\begin{tableii}{l|l}{code}{Value}{Meaning}
371 \lineii{'replace'}{\code{\var{a}[\var{i1}:\var{i2}]} should be
372 replaced by \code{\var{b}[\var{j1}:\var{j2}]}.}
373 \lineii{'delete'}{\code{\var{a}[\var{i1}:\var{i2}]} should be
374 deleted. Note that \code{\var{j1} == \var{j2}} in
375 this case.}
376 \lineii{'insert'}{\code{\var{b}[\var{j1}:\var{j2}]} should be
Tim Peters8a9c2842001-09-22 21:30:22 +0000377 inserted at \code{\var{a}[\var{i1}:\var{i1}]}.
Fred Drakebaf71422001-02-19 16:31:02 +0000378 Note that \code{\var{i1} == \var{i2}} in this
379 case.}
380 \lineii{'equal'}{\code{\var{a}[\var{i1}:\var{i2}] ==
381 \var{b}[\var{j1}:\var{j2}]} (the sub-sequences are
382 equal).}
383\end{tableii}
384
385For example:
386
387\begin{verbatim}
388>>> a = "qabxcd"
389>>> b = "abycdf"
390>>> s = SequenceMatcher(None, a, b)
391>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
392... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
393... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
394 delete a[0:1] (q) b[0:0] ()
395 equal a[1:3] (ab) b[0:2] (ab)
396replace a[3:4] (x) b[2:3] (y)
397 equal a[4:6] (cd) b[3:5] (cd)
398 insert a[6:6] () b[5:6] (f)
399\end{verbatim}
400\end{methoddesc}
401
Raymond Hettinger132fa372003-06-11 07:50:44 +0000402\begin{methoddesc}{get_grouped_opcodes}{\optional{n}}
403 Return a generator of groups with up to \var{n} lines of context.
404
405 Starting with the groups returned by \method{get_opcodes()},
406 this method splits out smaller change clusters and eliminates
407 intervening ranges which have no changes.
408
409 The groups are returned in the same format as \method{get_opcodes()}.
410 \versionadded{2.3}
411\end{methoddesc}
412
Fred Drakebaf71422001-02-19 16:31:02 +0000413\begin{methoddesc}{ratio}{}
414 Return a measure of the sequences' similarity as a float in the
415 range [0, 1].
416
417 Where T is the total number of elements in both sequences, and M is
Fred Drake6943a292001-08-13 19:31:59 +0000418 the number of matches, this is 2.0*M / T. Note that this is
419 \code{1.0} if the sequences are identical, and \code{0.0} if they
420 have nothing in common.
Fred Drakebaf71422001-02-19 16:31:02 +0000421
422 This is expensive to compute if \method{get_matching_blocks()} or
423 \method{get_opcodes()} hasn't already been called, in which case you
424 may want to try \method{quick_ratio()} or
425 \method{real_quick_ratio()} first to get an upper bound.
426\end{methoddesc}
427
428\begin{methoddesc}{quick_ratio}{}
429 Return an upper bound on \method{ratio()} relatively quickly.
430
431 This isn't defined beyond that it is an upper bound on
432 \method{ratio()}, and is faster to compute.
433\end{methoddesc}
434
435\begin{methoddesc}{real_quick_ratio}{}
436 Return an upper bound on \method{ratio()} very quickly.
437
438 This isn't defined beyond that it is an upper bound on
439 \method{ratio()}, and is faster to compute than either
440 \method{ratio()} or \method{quick_ratio()}.
441\end{methoddesc}
442
Tim Peters754ba582001-02-20 11:24:35 +0000443The three methods that return the ratio of matching to total characters
444can give different results due to differing levels of approximation,
445although \method{quick_ratio()} and \method{real_quick_ratio()} are always
446at least as large as \method{ratio()}:
Fred Drakebaf71422001-02-19 16:31:02 +0000447
448\begin{verbatim}
449>>> s = SequenceMatcher(None, "abcd", "bcde")
450>>> s.ratio()
4510.75
452>>> s.quick_ratio()
4530.75
454>>> s.real_quick_ratio()
4551.0
456\end{verbatim}
457
458
Fred Drake6943a292001-08-13 19:31:59 +0000459\subsection{SequenceMatcher Examples \label{sequencematcher-examples}}
Fred Drakebaf71422001-02-19 16:31:02 +0000460
461
462This example compares two strings, considering blanks to be ``junk:''
463
464\begin{verbatim}
465>>> s = SequenceMatcher(lambda x: x == " ",
466... "private Thread currentThread;",
467... "private volatile Thread currentThread;")
468\end{verbatim}
469
470\method{ratio()} returns a float in [0, 1], measuring the similarity
471of the sequences. As a rule of thumb, a \method{ratio()} value over
4720.6 means the sequences are close matches:
473
474\begin{verbatim}
475>>> print round(s.ratio(), 3)
4760.866
477\end{verbatim}
478
479If you're only interested in where the sequences match,
480\method{get_matching_blocks()} is handy:
481
482\begin{verbatim}
483>>> for block in s.get_matching_blocks():
484... print "a[%d] and b[%d] match for %d elements" % block
485a[0] and b[0] match for 8 elements
486a[8] and b[17] match for 6 elements
487a[14] and b[23] match for 15 elements
488a[29] and b[38] match for 0 elements
489\end{verbatim}
490
491Note that the last tuple returned by \method{get_matching_blocks()} is
492always a dummy, \code{(len(\var{a}), len(\var{b}), 0)}, and this is
493the only case in which the last tuple element (number of elements
494matched) is \code{0}.
495
496If you want to know how to change the first sequence into the second,
497use \method{get_opcodes()}:
498
499\begin{verbatim}
500>>> for opcode in s.get_opcodes():
501... print "%6s a[%d:%d] b[%d:%d]" % opcode
502 equal a[0:8] b[0:8]
503insert a[8:8] b[8:17]
504 equal a[8:14] b[17:23]
505 equal a[14:29] b[23:38]
506\end{verbatim}
507
Fred Drakebaf71422001-02-19 16:31:02 +0000508See also the function \function{get_close_matches()} in this module,
509which shows how simple code building on \class{SequenceMatcher} can be
510used to do useful work.
Fred Drake6943a292001-08-13 19:31:59 +0000511
512
513\subsection{Differ Objects \label{differ-objects}}
514
515Note that \class{Differ}-generated deltas make no claim to be
516\strong{minimal} diffs. To the contrary, minimal diffs are often
517counter-intuitive, because they synch up anywhere possible, sometimes
518accidental matches 100 pages apart. Restricting synch points to
519contiguous matches preserves some notion of locality, at the
520occasional cost of producing a longer diff.
521
522The \class{Differ} class has this constructor:
523
524\begin{classdesc}{Differ}{\optional{linejunk\optional{, charjunk}}}
525 Optional keyword parameters \var{linejunk} and \var{charjunk} are
526 for filter functions (or \code{None}):
527
Tim Peters81b92512002-04-29 01:37:32 +0000528 \var{linejunk}: A function that accepts a single string
529 argument, and returns true if the string is junk. The default is
530 \code{None}, meaning that no line is considered junk.
Fred Drake6943a292001-08-13 19:31:59 +0000531
Tim Peters81b92512002-04-29 01:37:32 +0000532 \var{charjunk}: A function that accepts a single character argument
533 (a string of length 1), and returns true if the character is junk.
534 The default is \code{None}, meaning that no character is
535 considered junk.
Fred Drake6943a292001-08-13 19:31:59 +0000536\end{classdesc}
537
538\class{Differ} objects are used (deltas generated) via a single
539method:
540
541\begin{methoddesc}{compare}{a, b}
Tim Peters8a9c2842001-09-22 21:30:22 +0000542 Compare two sequences of lines, and generate the delta (a sequence
543 of lines).
Fred Drake6943a292001-08-13 19:31:59 +0000544
545 Each sequence must contain individual single-line strings ending
546 with newlines. Such sequences can be obtained from the
Tim Peters8a9c2842001-09-22 21:30:22 +0000547 \method{readlines()} method of file-like objects. The delta generated
548 also consists of newline-terminated strings, ready to be printed as-is
Fred Drake389aa172001-11-29 19:04:50 +0000549 via the \method{writelines()} method of a file-like object.
Fred Drake6943a292001-08-13 19:31:59 +0000550\end{methoddesc}
551
552
553\subsection{Differ Example \label{differ-examples}}
554
555This example compares two texts. First we set up the texts, sequences
556of individual single-line strings ending with newlines (such sequences
557can also be obtained from the \method{readlines()} method of file-like
558objects):
559
560\begin{verbatim}
561>>> text1 = ''' 1. Beautiful is better than ugly.
562... 2. Explicit is better than implicit.
563... 3. Simple is better than complex.
564... 4. Complex is better than complicated.
565... '''.splitlines(1)
566>>> len(text1)
5674
568>>> text1[0][-1]
569'\n'
570>>> text2 = ''' 1. Beautiful is better than ugly.
571... 3. Simple is better than complex.
572... 4. Complicated is better than complex.
573... 5. Flat is better than nested.
574... '''.splitlines(1)
575\end{verbatim}
576
577Next we instantiate a Differ object:
578
579\begin{verbatim}
580>>> d = Differ()
581\end{verbatim}
582
583Note that when instantiating a \class{Differ} object we may pass
584functions to filter out line and character ``junk.'' See the
585\method{Differ()} constructor for details.
586
587Finally, we compare the two:
588
589\begin{verbatim}
Tim Peters8a9c2842001-09-22 21:30:22 +0000590>>> result = list(d.compare(text1, text2))
Fred Drake6943a292001-08-13 19:31:59 +0000591\end{verbatim}
592
593\code{result} is a list of strings, so let's pretty-print it:
594
595\begin{verbatim}
596>>> from pprint import pprint
597>>> pprint(result)
598[' 1. Beautiful is better than ugly.\n',
599 '- 2. Explicit is better than implicit.\n',
600 '- 3. Simple is better than complex.\n',
601 '+ 3. Simple is better than complex.\n',
602 '? ++ \n',
603 '- 4. Complex is better than complicated.\n',
604 '? ^ ---- ^ \n',
605 '+ 4. Complicated is better than complex.\n',
606 '? ++++ ^ ^ \n',
607 '+ 5. Flat is better than nested.\n']
608\end{verbatim}
609
610As a single multi-line string it looks like this:
611
612\begin{verbatim}
613>>> import sys
614>>> sys.stdout.writelines(result)
615 1. Beautiful is better than ugly.
616- 2. Explicit is better than implicit.
617- 3. Simple is better than complex.
618+ 3. Simple is better than complex.
619? ++
620- 4. Complex is better than complicated.
621? ^ ---- ^
622+ 4. Complicated is better than complex.
623? ++++ ^ ^
624+ 5. Flat is better than nested.
625\end{verbatim}