blob: fc588ffdf8ef67c63dcbc3d45a146fc5212c2f20 [file] [log] [blame]
Fred Drakebaf71422001-02-19 16:31:02 +00001\section{\module{difflib} ---
2 Helpers for computing deltas}
3
4\declaremodule{standard}{difflib}
5\modulesynopsis{Helpers for computing differences between objects.}
Fred Drake34929f22003-12-30 16:12:27 +00006\moduleauthor{Tim Peters}{tim_one@users.sourceforge.net}
7\sectionauthor{Tim Peters}{tim_one@users.sourceforge.net}
Fred Drakebaf71422001-02-19 16:31:02 +00008% LaTeXification by Fred L. Drake, Jr. <fdrake@acm.org>.
9
Fred Drakeda00cda2001-04-10 19:56:09 +000010\versionadded{2.1}
11
12
Fred Drake6943a292001-08-13 19:31:59 +000013\begin{classdesc*}{SequenceMatcher}
14 This is a flexible class for comparing pairs of sequences of any
15 type, so long as the sequence elements are hashable. The basic
16 algorithm predates, and is a little fancier than, an algorithm
17 published in the late 1980's by Ratcliff and Obershelp under the
18 hyperbolic name ``gestalt pattern matching.'' The idea is to find
19 the longest contiguous matching subsequence that contains no
20 ``junk'' elements (the Ratcliff and Obershelp algorithm doesn't
21 address junk). The same idea is then applied recursively to the
22 pieces of the sequences to the left and to the right of the matching
23 subsequence. This does not yield minimal edit sequences, but does
24 tend to yield matches that ``look right'' to people.
25
26 \strong{Timing:} The basic Ratcliff-Obershelp algorithm is cubic
27 time in the worst case and quadratic time in the expected case.
28 \class{SequenceMatcher} is quadratic time for the worst case and has
29 expected-case behavior dependent in a complicated way on how many
30 elements the sequences have in common; best case time is linear.
31\end{classdesc*}
32
33\begin{classdesc*}{Differ}
34 This is a class for comparing sequences of lines of text, and
Tim Peters8a9c2842001-09-22 21:30:22 +000035 producing human-readable differences or deltas. Differ uses
Fred Drake6943a292001-08-13 19:31:59 +000036 \class{SequenceMatcher} both to compare sequences of lines, and to
37 compare sequences of characters within similar (near-matching)
38 lines.
39
40 Each line of a \class{Differ} delta begins with a two-letter code:
41
42\begin{tableii}{l|l}{code}{Code}{Meaning}
43 \lineii{'- '}{line unique to sequence 1}
44 \lineii{'+ '}{line unique to sequence 2}
45 \lineii{' '}{line common to both sequences}
46 \lineii{'? '}{line not present in either input sequence}
47\end{tableii}
48
49 Lines beginning with `\code{?~}' attempt to guide the eye to
50 intraline differences, and were not present in either input
51 sequence. These lines can be confusing if the sequences contain tab
52 characters.
53\end{classdesc*}
54
Raymond Hettingere07b8352003-06-09 21:44:59 +000055\begin{funcdesc}{context_diff}{a, b\optional{, fromfile\optional{, tofile
56 \optional{, fromfiledate\optional{, tofiledate\optional{, n
57 \optional{, lineterm}}}}}}}
Raymond Hettingere07b8352003-06-09 21:44:59 +000058 Compare \var{a} and \var{b} (lists of strings); return a
59 delta (a generator generating the delta lines) in context diff
60 format.
61
62 Context diffs are a compact way of showing just the lines that have
63 changed plus a few lines of context. The changes are shown in a
64 before/after style. The number of context lines is set by \var{n}
65 which defaults to three.
66
67 By default, the diff control lines (those with \code{***} or \code{---})
68 are created with a trailing newline. This is helpful so that inputs created
69 from \function{file.readlines()} result in diffs that are suitable for use
70 with \function{file.writelines()} since both the inputs and outputs have
71 trailing newlines.
72
73 For inputs that do not have trailing newlines, set the \var{lineterm}
74 argument to \code{""} so that the output will be uniformly newline free.
75
76 The context diff format normally has a header for filenames and
77 modification times. Any or all of these may be specified using strings for
78 \var{fromfile}, \var{tofile}, \var{fromfiledate}, and \var{tofiledate}.
79 The modification times are normally expressed in the format returned by
80 \function{time.ctime()}. If not specified, the strings default to blanks.
81
82 \file{Tools/scripts/diff.py} is a command-line front-end for this
Raymond Hettinger132fa372003-06-11 07:50:44 +000083 function.
84
85 \versionadded{2.3}
Raymond Hettingere07b8352003-06-09 21:44:59 +000086\end{funcdesc}
87
Fred Drakebaf71422001-02-19 16:31:02 +000088\begin{funcdesc}{get_close_matches}{word, possibilities\optional{,
89 n\optional{, cutoff}}}
90 Return a list of the best ``good enough'' matches. \var{word} is a
91 sequence for which close matches are desired (typically a string),
92 and \var{possibilities} is a list of sequences against which to
93 match \var{word} (typically a list of strings).
94
95 Optional argument \var{n} (default \code{3}) is the maximum number
96 of close matches to return; \var{n} must be greater than \code{0}.
97
98 Optional argument \var{cutoff} (default \code{0.6}) is a float in
99 the range [0, 1]. Possibilities that don't score at least that
100 similar to \var{word} are ignored.
101
102 The best (no more than \var{n}) matches among the possibilities are
103 returned in a list, sorted by similarity score, most similar first.
104
105\begin{verbatim}
106>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
107['apple', 'ape']
108>>> import keyword
109>>> get_close_matches('wheel', keyword.kwlist)
110['while']
111>>> get_close_matches('apple', keyword.kwlist)
112[]
113>>> get_close_matches('accept', keyword.kwlist)
114['except']
115\end{verbatim}
116\end{funcdesc}
117
Fred Drake6943a292001-08-13 19:31:59 +0000118\begin{funcdesc}{ndiff}{a, b\optional{, linejunk\optional{,
119 charjunk}}}
120 Compare \var{a} and \var{b} (lists of strings); return a
Tim Peters8a9c2842001-09-22 21:30:22 +0000121 \class{Differ}-style delta (a generator generating the delta lines).
Fred Drakebaf71422001-02-19 16:31:02 +0000122
Fred Drake6943a292001-08-13 19:31:59 +0000123 Optional keyword parameters \var{linejunk} and \var{charjunk} are
124 for filter functions (or \code{None}):
125
Tim Peters81b92512002-04-29 01:37:32 +0000126 \var{linejunk}: A function that accepts a single string
127 argument, and returns true if the string is junk, or false if not.
128 The default is (\code{None}), starting with Python 2.3. Before then,
129 the default was the module-level function
Fred Drake6943a292001-08-13 19:31:59 +0000130 \function{IS_LINE_JUNK()}, which filters out lines without visible
131 characters, except for at most one pound character (\character{\#}).
Tim Peters81b92512002-04-29 01:37:32 +0000132 As of Python 2.3, the underlying \class{SequenceMatcher} class
133 does a dynamic analysis of which lines are so frequent as to
134 constitute noise, and this usually works better than the pre-2.3
135 default.
Fred Drake6943a292001-08-13 19:31:59 +0000136
Tim Peters81b92512002-04-29 01:37:32 +0000137 \var{charjunk}: A function that accepts a character (a string of
138 length 1), and returns if the character is junk, or false if not.
Fred Drake6943a292001-08-13 19:31:59 +0000139 The default is module-level function \function{IS_CHARACTER_JUNK()},
140 which filters out whitespace characters (a blank or tab; note: bad
141 idea to include newline in this!).
142
143 \file{Tools/scripts/ndiff.py} is a command-line front-end to this
144 function.
145
146\begin{verbatim}
147>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
Fred Drakeedb635f2002-12-06 18:52:28 +0000148... 'ore\ntree\nemu\n'.splitlines(1))
Fred Drake6943a292001-08-13 19:31:59 +0000149>>> print ''.join(diff),
150- one
Tim Peters8a9c2842001-09-22 21:30:22 +0000151? ^
Fred Drake6943a292001-08-13 19:31:59 +0000152+ ore
Tim Peters8a9c2842001-09-22 21:30:22 +0000153? ^
Fred Drake6943a292001-08-13 19:31:59 +0000154- two
155- three
Tim Peters8a9c2842001-09-22 21:30:22 +0000156? -
Fred Drake6943a292001-08-13 19:31:59 +0000157+ tree
158+ emu
159\end{verbatim}
160\end{funcdesc}
161
162\begin{funcdesc}{restore}{sequence, which}
163 Return one of the two sequences that generated a delta.
164
165 Given a \var{sequence} produced by \method{Differ.compare()} or
166 \function{ndiff()}, extract lines originating from file 1 or 2
167 (parameter \var{which}), stripping off line prefixes.
168
169 Example:
170
171\begin{verbatim}
172>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
173... 'ore\ntree\nemu\n'.splitlines(1))
Tim Peters8a9c2842001-09-22 21:30:22 +0000174>>> diff = list(diff) # materialize the generated delta into a list
Fred Drake6943a292001-08-13 19:31:59 +0000175>>> print ''.join(restore(diff, 1)),
176one
177two
178three
179>>> print ''.join(restore(diff, 2)),
180ore
181tree
182emu
183\end{verbatim}
184
185\end{funcdesc}
186
Raymond Hettingere07b8352003-06-09 21:44:59 +0000187\begin{funcdesc}{unified_diff}{a, b\optional{, fromfile\optional{, tofile
188 \optional{, fromfiledate\optional{, tofiledate\optional{, n
189 \optional{, lineterm}}}}}}}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000190 Compare \var{a} and \var{b} (lists of strings); return a
191 delta (a generator generating the delta lines) in unified diff
192 format.
193
194 Unified diffs are a compact way of showing just the lines that have
195 changed plus a few lines of context. The changes are shown in a
196 inline style (instead of separate before/after blocks). The number
197 of context lines is set by \var{n} which defaults to three.
198
199 By default, the diff control lines (those with \code{---}, \code{+++},
200 or \code{@@}) are created with a trailing newline. This is helpful so
201 that inputs created from \function{file.readlines()} result in diffs
202 that are suitable for use with \function{file.writelines()} since both
203 the inputs and outputs have trailing newlines.
204
205 For inputs that do not have trailing newlines, set the \var{lineterm}
206 argument to \code{""} so that the output will be uniformly newline free.
207
208 The context diff format normally has a header for filenames and
209 modification times. Any or all of these may be specified using strings for
210 \var{fromfile}, \var{tofile}, \var{fromfiledate}, and \var{tofiledate}.
211 The modification times are normally expressed in the format returned by
212 \function{time.ctime()}. If not specified, the strings default to blanks.
213
214 \file{Tools/scripts/diff.py} is a command-line front-end for this
Raymond Hettinger132fa372003-06-11 07:50:44 +0000215 function.
216
217 \versionadded{2.3}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000218\end{funcdesc}
Fred Drake6943a292001-08-13 19:31:59 +0000219
Fred Drake7f10cce2001-10-26 03:04:23 +0000220\begin{funcdesc}{IS_LINE_JUNK}{line}
221 Return true for ignorable lines. The line \var{line} is ignorable
222 if \var{line} is blank or contains a single \character{\#},
223 otherwise it is not ignorable. Used as a default for parameter
Tim Peters81b92512002-04-29 01:37:32 +0000224 \var{linejunk} in \function{ndiff()} before Python 2.3.
Fred Drake6943a292001-08-13 19:31:59 +0000225\end{funcdesc}
226
227
Fred Drake7f10cce2001-10-26 03:04:23 +0000228\begin{funcdesc}{IS_CHARACTER_JUNK}{ch}
229 Return true for ignorable characters. The character \var{ch} is
230 ignorable if \var{ch} is a space or tab, otherwise it is not
231 ignorable. Used as a default for parameter \var{charjunk} in
Fred Drake6943a292001-08-13 19:31:59 +0000232 \function{ndiff()}.
Fred Drake6943a292001-08-13 19:31:59 +0000233\end{funcdesc}
Fred Drakebaf71422001-02-19 16:31:02 +0000234
235
Fred Drake6fda3ac2001-04-10 18:41:16 +0000236\begin{seealso}
Fred Drake1fe97502004-01-21 18:30:28 +0000237 \seetitle[http://www.ddj.com/documents/s=1103/ddj8807c/]
238 {Pattern Matching: The Gestalt Approach}{Discussion of a
Fred Drake6fda3ac2001-04-10 18:41:16 +0000239 similar algorithm by John W. Ratcliff and D. E. Metzener.
240 This was published in
241 \citetitle[http://www.ddj.com/]{Dr. Dobb's Journal} in
242 July, 1988.}
243\end{seealso}
244
245
Fred Drakebaf71422001-02-19 16:31:02 +0000246\subsection{SequenceMatcher Objects \label{sequence-matcher}}
247
Fred Drake96d7a702001-05-11 01:08:13 +0000248The \class{SequenceMatcher} class has this constructor:
249
Fred Drakebaf71422001-02-19 16:31:02 +0000250\begin{classdesc}{SequenceMatcher}{\optional{isjunk\optional{,
251 a\optional{, b}}}}
252 Optional argument \var{isjunk} must be \code{None} (the default) or
253 a one-argument function that takes a sequence element and returns
254 true if and only if the element is ``junk'' and should be ignored.
Fred Drake7f10cce2001-10-26 03:04:23 +0000255 Passing \code{None} for \var{b} is equivalent to passing
256 \code{lambda x: 0}; in other words, no elements are ignored. For
257 example, pass:
Fred Drakebaf71422001-02-19 16:31:02 +0000258
259\begin{verbatim}
Fred Drake447f5452001-02-23 19:13:07 +0000260lambda x: x in " \t"
Fred Drakebaf71422001-02-19 16:31:02 +0000261\end{verbatim}
262
263 if you're comparing lines as sequences of characters, and don't want
264 to synch up on blanks or hard tabs.
265
266 The optional arguments \var{a} and \var{b} are sequences to be
267 compared; both default to empty strings. The elements of both
268 sequences must be hashable.
269\end{classdesc}
270
271
272\class{SequenceMatcher} objects have the following methods:
273
274\begin{methoddesc}{set_seqs}{a, b}
275 Set the two sequences to be compared.
276\end{methoddesc}
277
278\class{SequenceMatcher} computes and caches detailed information about
279the second sequence, so if you want to compare one sequence against
280many sequences, use \method{set_seq2()} to set the commonly used
281sequence once and call \method{set_seq1()} repeatedly, once for each
282of the other sequences.
283
284\begin{methoddesc}{set_seq1}{a}
285 Set the first sequence to be compared. The second sequence to be
286 compared is not changed.
287\end{methoddesc}
288
289\begin{methoddesc}{set_seq2}{b}
290 Set the second sequence to be compared. The first sequence to be
291 compared is not changed.
292\end{methoddesc}
293
294\begin{methoddesc}{find_longest_match}{alo, ahi, blo, bhi}
295 Find longest matching block in \code{\var{a}[\var{alo}:\var{ahi}]}
296 and \code{\var{b}[\var{blo}:\var{bhi}]}.
297
298 If \var{isjunk} was omitted or \code{None},
299 \method{get_longest_match()} returns \code{(\var{i}, \var{j},
300 \var{k})} such that \code{\var{a}[\var{i}:\var{i}+\var{k}]} is equal
Tim Peters8a9c2842001-09-22 21:30:22 +0000301 to \code{\var{b}[\var{j}:\var{j}+\var{k}]}, where
Fred Drakebaf71422001-02-19 16:31:02 +0000302 \code{\var{alo} <= \var{i} <= \var{i}+\var{k} <= \var{ahi}} and
303 \code{\var{blo} <= \var{j} <= \var{j}+\var{k} <= \var{bhi}}.
304 For all \code{(\var{i'}, \var{j'}, \var{k'})} meeting those
305 conditions, the additional conditions
306 \code{\var{k} >= \var{k'}},
307 \code{\var{i} <= \var{i'}},
308 and if \code{\var{i} == \var{i'}}, \code{\var{j} <= \var{j'}}
309 are also met.
310 In other words, of all maximal matching blocks, return one that
311 starts earliest in \var{a}, and of all those maximal matching blocks
312 that start earliest in \var{a}, return the one that starts earliest
313 in \var{b}.
314
315\begin{verbatim}
316>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
317>>> s.find_longest_match(0, 5, 0, 9)
318(0, 4, 5)
319\end{verbatim}
320
321 If \var{isjunk} was provided, first the longest matching block is
322 determined as above, but with the additional restriction that no
323 junk element appears in the block. Then that block is extended as
324 far as possible by matching (only) junk elements on both sides.
325 So the resulting block never matches on junk except as identical
326 junk happens to be adjacent to an interesting match.
327
328 Here's the same example as before, but considering blanks to be junk.
Tim Peters754ba582001-02-20 11:24:35 +0000329 That prevents \code{' abcd'} from matching the \code{' abcd'} at the
Fred Drakebaf71422001-02-19 16:31:02 +0000330 tail end of the second sequence directly. Instead only the
331 \code{'abcd'} can match, and matches the leftmost \code{'abcd'} in
332 the second sequence:
333
334\begin{verbatim}
335>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
336>>> s.find_longest_match(0, 5, 0, 9)
337(1, 0, 4)
338\end{verbatim}
339
340 If no blocks match, this returns \code{(\var{alo}, \var{blo}, 0)}.
341\end{methoddesc}
342
343\begin{methoddesc}{get_matching_blocks}{}
344 Return list of triples describing matching subsequences.
345 Each triple is of the form \code{(\var{i}, \var{j}, \var{n})}, and
346 means that \code{\var{a}[\var{i}:\var{i}+\var{n}] ==
347 \var{b}[\var{j}:\var{j}+\var{n}]}. The triples are monotonically
348 increasing in \var{i} and \var{j}.
349
350 The last triple is a dummy, and has the value \code{(len(\var{a}),
351 len(\var{b}), 0)}. It is the only triple with \code{\var{n} == 0}.
352 % Explain why a dummy is used!
353
354\begin{verbatim}
355>>> s = SequenceMatcher(None, "abxcd", "abcd")
356>>> s.get_matching_blocks()
357[(0, 0, 2), (3, 2, 2), (5, 4, 0)]
358\end{verbatim}
359\end{methoddesc}
360
361\begin{methoddesc}{get_opcodes}{}
362 Return list of 5-tuples describing how to turn \var{a} into \var{b}.
363 Each tuple is of the form \code{(\var{tag}, \var{i1}, \var{i2},
364 \var{j1}, \var{j2})}. The first tuple has \code{\var{i1} ==
365 \var{j1} == 0}, and remaining tuples have \var{i1} equal to the
366 \var{i2} from the preceeding tuple, and, likewise, \var{j1} equal to
367 the previous \var{j2}.
368
369 The \var{tag} values are strings, with these meanings:
370
371\begin{tableii}{l|l}{code}{Value}{Meaning}
372 \lineii{'replace'}{\code{\var{a}[\var{i1}:\var{i2}]} should be
373 replaced by \code{\var{b}[\var{j1}:\var{j2}]}.}
374 \lineii{'delete'}{\code{\var{a}[\var{i1}:\var{i2}]} should be
375 deleted. Note that \code{\var{j1} == \var{j2}} in
376 this case.}
377 \lineii{'insert'}{\code{\var{b}[\var{j1}:\var{j2}]} should be
Tim Peters8a9c2842001-09-22 21:30:22 +0000378 inserted at \code{\var{a}[\var{i1}:\var{i1}]}.
Fred Drakebaf71422001-02-19 16:31:02 +0000379 Note that \code{\var{i1} == \var{i2}} in this
380 case.}
381 \lineii{'equal'}{\code{\var{a}[\var{i1}:\var{i2}] ==
382 \var{b}[\var{j1}:\var{j2}]} (the sub-sequences are
383 equal).}
384\end{tableii}
385
386For example:
387
388\begin{verbatim}
389>>> a = "qabxcd"
390>>> b = "abycdf"
391>>> s = SequenceMatcher(None, a, b)
392>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
393... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
394... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
395 delete a[0:1] (q) b[0:0] ()
396 equal a[1:3] (ab) b[0:2] (ab)
397replace a[3:4] (x) b[2:3] (y)
398 equal a[4:6] (cd) b[3:5] (cd)
399 insert a[6:6] () b[5:6] (f)
400\end{verbatim}
401\end{methoddesc}
402
Raymond Hettinger132fa372003-06-11 07:50:44 +0000403\begin{methoddesc}{get_grouped_opcodes}{\optional{n}}
404 Return a generator of groups with up to \var{n} lines of context.
405
406 Starting with the groups returned by \method{get_opcodes()},
407 this method splits out smaller change clusters and eliminates
408 intervening ranges which have no changes.
409
410 The groups are returned in the same format as \method{get_opcodes()}.
411 \versionadded{2.3}
412\end{methoddesc}
413
Fred Drakebaf71422001-02-19 16:31:02 +0000414\begin{methoddesc}{ratio}{}
415 Return a measure of the sequences' similarity as a float in the
416 range [0, 1].
417
418 Where T is the total number of elements in both sequences, and M is
Fred Drake6943a292001-08-13 19:31:59 +0000419 the number of matches, this is 2.0*M / T. Note that this is
420 \code{1.0} if the sequences are identical, and \code{0.0} if they
421 have nothing in common.
Fred Drakebaf71422001-02-19 16:31:02 +0000422
423 This is expensive to compute if \method{get_matching_blocks()} or
424 \method{get_opcodes()} hasn't already been called, in which case you
425 may want to try \method{quick_ratio()} or
426 \method{real_quick_ratio()} first to get an upper bound.
427\end{methoddesc}
428
429\begin{methoddesc}{quick_ratio}{}
430 Return an upper bound on \method{ratio()} relatively quickly.
431
432 This isn't defined beyond that it is an upper bound on
433 \method{ratio()}, and is faster to compute.
434\end{methoddesc}
435
436\begin{methoddesc}{real_quick_ratio}{}
437 Return an upper bound on \method{ratio()} very quickly.
438
439 This isn't defined beyond that it is an upper bound on
440 \method{ratio()}, and is faster to compute than either
441 \method{ratio()} or \method{quick_ratio()}.
442\end{methoddesc}
443
Tim Peters754ba582001-02-20 11:24:35 +0000444The three methods that return the ratio of matching to total characters
445can give different results due to differing levels of approximation,
446although \method{quick_ratio()} and \method{real_quick_ratio()} are always
447at least as large as \method{ratio()}:
Fred Drakebaf71422001-02-19 16:31:02 +0000448
449\begin{verbatim}
450>>> s = SequenceMatcher(None, "abcd", "bcde")
451>>> s.ratio()
4520.75
453>>> s.quick_ratio()
4540.75
455>>> s.real_quick_ratio()
4561.0
457\end{verbatim}
458
459
Fred Drake6943a292001-08-13 19:31:59 +0000460\subsection{SequenceMatcher Examples \label{sequencematcher-examples}}
Fred Drakebaf71422001-02-19 16:31:02 +0000461
462
463This example compares two strings, considering blanks to be ``junk:''
464
465\begin{verbatim}
466>>> s = SequenceMatcher(lambda x: x == " ",
467... "private Thread currentThread;",
468... "private volatile Thread currentThread;")
469\end{verbatim}
470
471\method{ratio()} returns a float in [0, 1], measuring the similarity
472of the sequences. As a rule of thumb, a \method{ratio()} value over
4730.6 means the sequences are close matches:
474
475\begin{verbatim}
476>>> print round(s.ratio(), 3)
4770.866
478\end{verbatim}
479
480If you're only interested in where the sequences match,
481\method{get_matching_blocks()} is handy:
482
483\begin{verbatim}
484>>> for block in s.get_matching_blocks():
485... print "a[%d] and b[%d] match for %d elements" % block
486a[0] and b[0] match for 8 elements
487a[8] and b[17] match for 6 elements
488a[14] and b[23] match for 15 elements
489a[29] and b[38] match for 0 elements
490\end{verbatim}
491
492Note that the last tuple returned by \method{get_matching_blocks()} is
493always a dummy, \code{(len(\var{a}), len(\var{b}), 0)}, and this is
494the only case in which the last tuple element (number of elements
495matched) is \code{0}.
496
497If you want to know how to change the first sequence into the second,
498use \method{get_opcodes()}:
499
500\begin{verbatim}
501>>> for opcode in s.get_opcodes():
502... print "%6s a[%d:%d] b[%d:%d]" % opcode
503 equal a[0:8] b[0:8]
504insert a[8:8] b[8:17]
505 equal a[8:14] b[17:23]
506 equal a[14:29] b[23:38]
507\end{verbatim}
508
Fred Drakebaf71422001-02-19 16:31:02 +0000509See also the function \function{get_close_matches()} in this module,
510which shows how simple code building on \class{SequenceMatcher} can be
511used to do useful work.
Fred Drake6943a292001-08-13 19:31:59 +0000512
513
514\subsection{Differ Objects \label{differ-objects}}
515
516Note that \class{Differ}-generated deltas make no claim to be
517\strong{minimal} diffs. To the contrary, minimal diffs are often
518counter-intuitive, because they synch up anywhere possible, sometimes
519accidental matches 100 pages apart. Restricting synch points to
520contiguous matches preserves some notion of locality, at the
521occasional cost of producing a longer diff.
522
523The \class{Differ} class has this constructor:
524
525\begin{classdesc}{Differ}{\optional{linejunk\optional{, charjunk}}}
526 Optional keyword parameters \var{linejunk} and \var{charjunk} are
527 for filter functions (or \code{None}):
528
Tim Peters81b92512002-04-29 01:37:32 +0000529 \var{linejunk}: A function that accepts a single string
530 argument, and returns true if the string is junk. The default is
531 \code{None}, meaning that no line is considered junk.
Fred Drake6943a292001-08-13 19:31:59 +0000532
Tim Peters81b92512002-04-29 01:37:32 +0000533 \var{charjunk}: A function that accepts a single character argument
534 (a string of length 1), and returns true if the character is junk.
535 The default is \code{None}, meaning that no character is
536 considered junk.
Fred Drake6943a292001-08-13 19:31:59 +0000537\end{classdesc}
538
539\class{Differ} objects are used (deltas generated) via a single
540method:
541
542\begin{methoddesc}{compare}{a, b}
Tim Peters8a9c2842001-09-22 21:30:22 +0000543 Compare two sequences of lines, and generate the delta (a sequence
544 of lines).
Fred Drake6943a292001-08-13 19:31:59 +0000545
546 Each sequence must contain individual single-line strings ending
547 with newlines. Such sequences can be obtained from the
Tim Peters8a9c2842001-09-22 21:30:22 +0000548 \method{readlines()} method of file-like objects. The delta generated
549 also consists of newline-terminated strings, ready to be printed as-is
Fred Drake389aa172001-11-29 19:04:50 +0000550 via the \method{writelines()} method of a file-like object.
Fred Drake6943a292001-08-13 19:31:59 +0000551\end{methoddesc}
552
553
554\subsection{Differ Example \label{differ-examples}}
555
556This example compares two texts. First we set up the texts, sequences
557of individual single-line strings ending with newlines (such sequences
558can also be obtained from the \method{readlines()} method of file-like
559objects):
560
561\begin{verbatim}
562>>> text1 = ''' 1. Beautiful is better than ugly.
563... 2. Explicit is better than implicit.
564... 3. Simple is better than complex.
565... 4. Complex is better than complicated.
566... '''.splitlines(1)
567>>> len(text1)
5684
569>>> text1[0][-1]
570'\n'
571>>> text2 = ''' 1. Beautiful is better than ugly.
572... 3. Simple is better than complex.
573... 4. Complicated is better than complex.
574... 5. Flat is better than nested.
575... '''.splitlines(1)
576\end{verbatim}
577
578Next we instantiate a Differ object:
579
580\begin{verbatim}
581>>> d = Differ()
582\end{verbatim}
583
584Note that when instantiating a \class{Differ} object we may pass
585functions to filter out line and character ``junk.'' See the
586\method{Differ()} constructor for details.
587
588Finally, we compare the two:
589
590\begin{verbatim}
Tim Peters8a9c2842001-09-22 21:30:22 +0000591>>> result = list(d.compare(text1, text2))
Fred Drake6943a292001-08-13 19:31:59 +0000592\end{verbatim}
593
594\code{result} is a list of strings, so let's pretty-print it:
595
596\begin{verbatim}
597>>> from pprint import pprint
598>>> pprint(result)
599[' 1. Beautiful is better than ugly.\n',
600 '- 2. Explicit is better than implicit.\n',
601 '- 3. Simple is better than complex.\n',
602 '+ 3. Simple is better than complex.\n',
603 '? ++ \n',
604 '- 4. Complex is better than complicated.\n',
605 '? ^ ---- ^ \n',
606 '+ 4. Complicated is better than complex.\n',
607 '? ++++ ^ ^ \n',
608 '+ 5. Flat is better than nested.\n']
609\end{verbatim}
610
611As a single multi-line string it looks like this:
612
613\begin{verbatim}
614>>> import sys
615>>> sys.stdout.writelines(result)
616 1. Beautiful is better than ugly.
617- 2. Explicit is better than implicit.
618- 3. Simple is better than complex.
619+ 3. Simple is better than complex.
620? ++
621- 4. Complex is better than complicated.
622? ^ ---- ^
623+ 4. Complicated is better than complex.
624? ++++ ^ ^
625+ 5. Flat is better than nested.
626\end{verbatim}