blob: 593416fb9516adc2bc8b5333da80ca5fd3d99487 [file] [log] [blame]
Fred Drakebaf71422001-02-19 16:31:02 +00001\section{\module{difflib} ---
2 Helpers for computing deltas}
3
4\declaremodule{standard}{difflib}
5\modulesynopsis{Helpers for computing differences between objects.}
6\moduleauthor{Tim Peters}{tim.one@home.com}
7\sectionauthor{Tim Peters}{tim.one@home.com}
8% LaTeXification by Fred L. Drake, Jr. <fdrake@acm.org>.
9
Fred Drakeda00cda2001-04-10 19:56:09 +000010\versionadded{2.1}
11
12
Fred Drake6943a292001-08-13 19:31:59 +000013\begin{classdesc*}{SequenceMatcher}
14 This is a flexible class for comparing pairs of sequences of any
15 type, so long as the sequence elements are hashable. The basic
16 algorithm predates, and is a little fancier than, an algorithm
17 published in the late 1980's by Ratcliff and Obershelp under the
18 hyperbolic name ``gestalt pattern matching.'' The idea is to find
19 the longest contiguous matching subsequence that contains no
20 ``junk'' elements (the Ratcliff and Obershelp algorithm doesn't
21 address junk). The same idea is then applied recursively to the
22 pieces of the sequences to the left and to the right of the matching
23 subsequence. This does not yield minimal edit sequences, but does
24 tend to yield matches that ``look right'' to people.
25
26 \strong{Timing:} The basic Ratcliff-Obershelp algorithm is cubic
27 time in the worst case and quadratic time in the expected case.
28 \class{SequenceMatcher} is quadratic time for the worst case and has
29 expected-case behavior dependent in a complicated way on how many
30 elements the sequences have in common; best case time is linear.
31\end{classdesc*}
32
33\begin{classdesc*}{Differ}
34 This is a class for comparing sequences of lines of text, and
Tim Peters8a9c2842001-09-22 21:30:22 +000035 producing human-readable differences or deltas. Differ uses
Fred Drake6943a292001-08-13 19:31:59 +000036 \class{SequenceMatcher} both to compare sequences of lines, and to
37 compare sequences of characters within similar (near-matching)
38 lines.
39
40 Each line of a \class{Differ} delta begins with a two-letter code:
41
42\begin{tableii}{l|l}{code}{Code}{Meaning}
43 \lineii{'- '}{line unique to sequence 1}
44 \lineii{'+ '}{line unique to sequence 2}
45 \lineii{' '}{line common to both sequences}
46 \lineii{'? '}{line not present in either input sequence}
47\end{tableii}
48
49 Lines beginning with `\code{?~}' attempt to guide the eye to
50 intraline differences, and were not present in either input
51 sequence. These lines can be confusing if the sequences contain tab
52 characters.
53\end{classdesc*}
54
Raymond Hettingere07b8352003-06-09 21:44:59 +000055\begin{funcdesc}{context_diff}{a, b\optional{, fromfile\optional{, tofile
56 \optional{, fromfiledate\optional{, tofiledate\optional{, n
57 \optional{, lineterm}}}}}}}
58
59 Compare \var{a} and \var{b} (lists of strings); return a
60 delta (a generator generating the delta lines) in context diff
61 format.
62
63 Context diffs are a compact way of showing just the lines that have
64 changed plus a few lines of context. The changes are shown in a
65 before/after style. The number of context lines is set by \var{n}
66 which defaults to three.
67
68 By default, the diff control lines (those with \code{***} or \code{---})
69 are created with a trailing newline. This is helpful so that inputs created
70 from \function{file.readlines()} result in diffs that are suitable for use
71 with \function{file.writelines()} since both the inputs and outputs have
72 trailing newlines.
73
74 For inputs that do not have trailing newlines, set the \var{lineterm}
75 argument to \code{""} so that the output will be uniformly newline free.
76
77 The context diff format normally has a header for filenames and
78 modification times. Any or all of these may be specified using strings for
79 \var{fromfile}, \var{tofile}, \var{fromfiledate}, and \var{tofiledate}.
80 The modification times are normally expressed in the format returned by
81 \function{time.ctime()}. If not specified, the strings default to blanks.
82
83 \file{Tools/scripts/diff.py} is a command-line front-end for this
Raymond Hettinger132fa372003-06-11 07:50:44 +000084 function.
85
86 \versionadded{2.3}
Raymond Hettingere07b8352003-06-09 21:44:59 +000087\end{funcdesc}
88
Fred Drakebaf71422001-02-19 16:31:02 +000089\begin{funcdesc}{get_close_matches}{word, possibilities\optional{,
90 n\optional{, cutoff}}}
91 Return a list of the best ``good enough'' matches. \var{word} is a
92 sequence for which close matches are desired (typically a string),
93 and \var{possibilities} is a list of sequences against which to
94 match \var{word} (typically a list of strings).
95
96 Optional argument \var{n} (default \code{3}) is the maximum number
97 of close matches to return; \var{n} must be greater than \code{0}.
98
99 Optional argument \var{cutoff} (default \code{0.6}) is a float in
100 the range [0, 1]. Possibilities that don't score at least that
101 similar to \var{word} are ignored.
102
103 The best (no more than \var{n}) matches among the possibilities are
104 returned in a list, sorted by similarity score, most similar first.
105
106\begin{verbatim}
107>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
108['apple', 'ape']
109>>> import keyword
110>>> get_close_matches('wheel', keyword.kwlist)
111['while']
112>>> get_close_matches('apple', keyword.kwlist)
113[]
114>>> get_close_matches('accept', keyword.kwlist)
115['except']
116\end{verbatim}
117\end{funcdesc}
118
Fred Drake6943a292001-08-13 19:31:59 +0000119\begin{funcdesc}{ndiff}{a, b\optional{, linejunk\optional{,
120 charjunk}}}
121 Compare \var{a} and \var{b} (lists of strings); return a
Tim Peters8a9c2842001-09-22 21:30:22 +0000122 \class{Differ}-style delta (a generator generating the delta lines).
Fred Drakebaf71422001-02-19 16:31:02 +0000123
Fred Drake6943a292001-08-13 19:31:59 +0000124 Optional keyword parameters \var{linejunk} and \var{charjunk} are
125 for filter functions (or \code{None}):
126
Tim Peters81b92512002-04-29 01:37:32 +0000127 \var{linejunk}: A function that accepts a single string
128 argument, and returns true if the string is junk, or false if not.
129 The default is (\code{None}), starting with Python 2.3. Before then,
130 the default was the module-level function
Fred Drake6943a292001-08-13 19:31:59 +0000131 \function{IS_LINE_JUNK()}, which filters out lines without visible
132 characters, except for at most one pound character (\character{\#}).
Tim Peters81b92512002-04-29 01:37:32 +0000133 As of Python 2.3, the underlying \class{SequenceMatcher} class
134 does a dynamic analysis of which lines are so frequent as to
135 constitute noise, and this usually works better than the pre-2.3
136 default.
Fred Drake6943a292001-08-13 19:31:59 +0000137
Tim Peters81b92512002-04-29 01:37:32 +0000138 \var{charjunk}: A function that accepts a character (a string of
139 length 1), and returns if the character is junk, or false if not.
Fred Drake6943a292001-08-13 19:31:59 +0000140 The default is module-level function \function{IS_CHARACTER_JUNK()},
141 which filters out whitespace characters (a blank or tab; note: bad
142 idea to include newline in this!).
143
144 \file{Tools/scripts/ndiff.py} is a command-line front-end to this
145 function.
146
147\begin{verbatim}
148>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
Fred Drakeedb635f2002-12-06 18:52:28 +0000149... 'ore\ntree\nemu\n'.splitlines(1))
Fred Drake6943a292001-08-13 19:31:59 +0000150>>> print ''.join(diff),
151- one
Tim Peters8a9c2842001-09-22 21:30:22 +0000152? ^
Fred Drake6943a292001-08-13 19:31:59 +0000153+ ore
Tim Peters8a9c2842001-09-22 21:30:22 +0000154? ^
Fred Drake6943a292001-08-13 19:31:59 +0000155- two
156- three
Tim Peters8a9c2842001-09-22 21:30:22 +0000157? -
Fred Drake6943a292001-08-13 19:31:59 +0000158+ tree
159+ emu
160\end{verbatim}
161\end{funcdesc}
162
163\begin{funcdesc}{restore}{sequence, which}
164 Return one of the two sequences that generated a delta.
165
166 Given a \var{sequence} produced by \method{Differ.compare()} or
167 \function{ndiff()}, extract lines originating from file 1 or 2
168 (parameter \var{which}), stripping off line prefixes.
169
170 Example:
171
172\begin{verbatim}
173>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
174... 'ore\ntree\nemu\n'.splitlines(1))
Tim Peters8a9c2842001-09-22 21:30:22 +0000175>>> diff = list(diff) # materialize the generated delta into a list
Fred Drake6943a292001-08-13 19:31:59 +0000176>>> print ''.join(restore(diff, 1)),
177one
178two
179three
180>>> print ''.join(restore(diff, 2)),
181ore
182tree
183emu
184\end{verbatim}
185
186\end{funcdesc}
187
Raymond Hettingere07b8352003-06-09 21:44:59 +0000188\begin{funcdesc}{unified_diff}{a, b\optional{, fromfile\optional{, tofile
189 \optional{, fromfiledate\optional{, tofiledate\optional{, n
190 \optional{, lineterm}}}}}}}
191
192 Compare \var{a} and \var{b} (lists of strings); return a
193 delta (a generator generating the delta lines) in unified diff
194 format.
195
196 Unified diffs are a compact way of showing just the lines that have
197 changed plus a few lines of context. The changes are shown in a
198 inline style (instead of separate before/after blocks). The number
199 of context lines is set by \var{n} which defaults to three.
200
201 By default, the diff control lines (those with \code{---}, \code{+++},
202 or \code{@@}) are created with a trailing newline. This is helpful so
203 that inputs created from \function{file.readlines()} result in diffs
204 that are suitable for use with \function{file.writelines()} since both
205 the inputs and outputs have trailing newlines.
206
207 For inputs that do not have trailing newlines, set the \var{lineterm}
208 argument to \code{""} so that the output will be uniformly newline free.
209
210 The context diff format normally has a header for filenames and
211 modification times. Any or all of these may be specified using strings for
212 \var{fromfile}, \var{tofile}, \var{fromfiledate}, and \var{tofiledate}.
213 The modification times are normally expressed in the format returned by
214 \function{time.ctime()}. If not specified, the strings default to blanks.
215
216 \file{Tools/scripts/diff.py} is a command-line front-end for this
Raymond Hettinger132fa372003-06-11 07:50:44 +0000217 function.
218
219 \versionadded{2.3}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000220\end{funcdesc}
Fred Drake6943a292001-08-13 19:31:59 +0000221
Fred Drake7f10cce2001-10-26 03:04:23 +0000222\begin{funcdesc}{IS_LINE_JUNK}{line}
223 Return true for ignorable lines. The line \var{line} is ignorable
224 if \var{line} is blank or contains a single \character{\#},
225 otherwise it is not ignorable. Used as a default for parameter
Tim Peters81b92512002-04-29 01:37:32 +0000226 \var{linejunk} in \function{ndiff()} before Python 2.3.
Fred Drake6943a292001-08-13 19:31:59 +0000227\end{funcdesc}
228
229
Fred Drake7f10cce2001-10-26 03:04:23 +0000230\begin{funcdesc}{IS_CHARACTER_JUNK}{ch}
231 Return true for ignorable characters. The character \var{ch} is
232 ignorable if \var{ch} is a space or tab, otherwise it is not
233 ignorable. Used as a default for parameter \var{charjunk} in
Fred Drake6943a292001-08-13 19:31:59 +0000234 \function{ndiff()}.
Fred Drake6943a292001-08-13 19:31:59 +0000235\end{funcdesc}
Fred Drakebaf71422001-02-19 16:31:02 +0000236
237
Fred Drake6fda3ac2001-04-10 18:41:16 +0000238\begin{seealso}
239 \seetitle{Pattern Matching: The Gestalt Approach}{Discussion of a
240 similar algorithm by John W. Ratcliff and D. E. Metzener.
241 This was published in
242 \citetitle[http://www.ddj.com/]{Dr. Dobb's Journal} in
243 July, 1988.}
244\end{seealso}
245
246
Fred Drakebaf71422001-02-19 16:31:02 +0000247\subsection{SequenceMatcher Objects \label{sequence-matcher}}
248
Fred Drake96d7a702001-05-11 01:08:13 +0000249The \class{SequenceMatcher} class has this constructor:
250
Fred Drakebaf71422001-02-19 16:31:02 +0000251\begin{classdesc}{SequenceMatcher}{\optional{isjunk\optional{,
252 a\optional{, b}}}}
253 Optional argument \var{isjunk} must be \code{None} (the default) or
254 a one-argument function that takes a sequence element and returns
255 true if and only if the element is ``junk'' and should be ignored.
Fred Drake7f10cce2001-10-26 03:04:23 +0000256 Passing \code{None} for \var{b} is equivalent to passing
257 \code{lambda x: 0}; in other words, no elements are ignored. For
258 example, pass:
Fred Drakebaf71422001-02-19 16:31:02 +0000259
260\begin{verbatim}
Fred Drake447f5452001-02-23 19:13:07 +0000261lambda x: x in " \t"
Fred Drakebaf71422001-02-19 16:31:02 +0000262\end{verbatim}
263
264 if you're comparing lines as sequences of characters, and don't want
265 to synch up on blanks or hard tabs.
266
267 The optional arguments \var{a} and \var{b} are sequences to be
268 compared; both default to empty strings. The elements of both
269 sequences must be hashable.
270\end{classdesc}
271
272
273\class{SequenceMatcher} objects have the following methods:
274
275\begin{methoddesc}{set_seqs}{a, b}
276 Set the two sequences to be compared.
277\end{methoddesc}
278
279\class{SequenceMatcher} computes and caches detailed information about
280the second sequence, so if you want to compare one sequence against
281many sequences, use \method{set_seq2()} to set the commonly used
282sequence once and call \method{set_seq1()} repeatedly, once for each
283of the other sequences.
284
285\begin{methoddesc}{set_seq1}{a}
286 Set the first sequence to be compared. The second sequence to be
287 compared is not changed.
288\end{methoddesc}
289
290\begin{methoddesc}{set_seq2}{b}
291 Set the second sequence to be compared. The first sequence to be
292 compared is not changed.
293\end{methoddesc}
294
295\begin{methoddesc}{find_longest_match}{alo, ahi, blo, bhi}
296 Find longest matching block in \code{\var{a}[\var{alo}:\var{ahi}]}
297 and \code{\var{b}[\var{blo}:\var{bhi}]}.
298
299 If \var{isjunk} was omitted or \code{None},
300 \method{get_longest_match()} returns \code{(\var{i}, \var{j},
301 \var{k})} such that \code{\var{a}[\var{i}:\var{i}+\var{k}]} is equal
Tim Peters8a9c2842001-09-22 21:30:22 +0000302 to \code{\var{b}[\var{j}:\var{j}+\var{k}]}, where
Fred Drakebaf71422001-02-19 16:31:02 +0000303 \code{\var{alo} <= \var{i} <= \var{i}+\var{k} <= \var{ahi}} and
304 \code{\var{blo} <= \var{j} <= \var{j}+\var{k} <= \var{bhi}}.
305 For all \code{(\var{i'}, \var{j'}, \var{k'})} meeting those
306 conditions, the additional conditions
307 \code{\var{k} >= \var{k'}},
308 \code{\var{i} <= \var{i'}},
309 and if \code{\var{i} == \var{i'}}, \code{\var{j} <= \var{j'}}
310 are also met.
311 In other words, of all maximal matching blocks, return one that
312 starts earliest in \var{a}, and of all those maximal matching blocks
313 that start earliest in \var{a}, return the one that starts earliest
314 in \var{b}.
315
316\begin{verbatim}
317>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
318>>> s.find_longest_match(0, 5, 0, 9)
319(0, 4, 5)
320\end{verbatim}
321
322 If \var{isjunk} was provided, first the longest matching block is
323 determined as above, but with the additional restriction that no
324 junk element appears in the block. Then that block is extended as
325 far as possible by matching (only) junk elements on both sides.
326 So the resulting block never matches on junk except as identical
327 junk happens to be adjacent to an interesting match.
328
329 Here's the same example as before, but considering blanks to be junk.
Tim Peters754ba582001-02-20 11:24:35 +0000330 That prevents \code{' abcd'} from matching the \code{' abcd'} at the
Fred Drakebaf71422001-02-19 16:31:02 +0000331 tail end of the second sequence directly. Instead only the
332 \code{'abcd'} can match, and matches the leftmost \code{'abcd'} in
333 the second sequence:
334
335\begin{verbatim}
336>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
337>>> s.find_longest_match(0, 5, 0, 9)
338(1, 0, 4)
339\end{verbatim}
340
341 If no blocks match, this returns \code{(\var{alo}, \var{blo}, 0)}.
342\end{methoddesc}
343
344\begin{methoddesc}{get_matching_blocks}{}
345 Return list of triples describing matching subsequences.
346 Each triple is of the form \code{(\var{i}, \var{j}, \var{n})}, and
347 means that \code{\var{a}[\var{i}:\var{i}+\var{n}] ==
348 \var{b}[\var{j}:\var{j}+\var{n}]}. The triples are monotonically
349 increasing in \var{i} and \var{j}.
350
351 The last triple is a dummy, and has the value \code{(len(\var{a}),
352 len(\var{b}), 0)}. It is the only triple with \code{\var{n} == 0}.
353 % Explain why a dummy is used!
354
355\begin{verbatim}
356>>> s = SequenceMatcher(None, "abxcd", "abcd")
357>>> s.get_matching_blocks()
358[(0, 0, 2), (3, 2, 2), (5, 4, 0)]
359\end{verbatim}
360\end{methoddesc}
361
362\begin{methoddesc}{get_opcodes}{}
363 Return list of 5-tuples describing how to turn \var{a} into \var{b}.
364 Each tuple is of the form \code{(\var{tag}, \var{i1}, \var{i2},
365 \var{j1}, \var{j2})}. The first tuple has \code{\var{i1} ==
366 \var{j1} == 0}, and remaining tuples have \var{i1} equal to the
367 \var{i2} from the preceeding tuple, and, likewise, \var{j1} equal to
368 the previous \var{j2}.
369
370 The \var{tag} values are strings, with these meanings:
371
372\begin{tableii}{l|l}{code}{Value}{Meaning}
373 \lineii{'replace'}{\code{\var{a}[\var{i1}:\var{i2}]} should be
374 replaced by \code{\var{b}[\var{j1}:\var{j2}]}.}
375 \lineii{'delete'}{\code{\var{a}[\var{i1}:\var{i2}]} should be
376 deleted. Note that \code{\var{j1} == \var{j2}} in
377 this case.}
378 \lineii{'insert'}{\code{\var{b}[\var{j1}:\var{j2}]} should be
Tim Peters8a9c2842001-09-22 21:30:22 +0000379 inserted at \code{\var{a}[\var{i1}:\var{i1}]}.
Fred Drakebaf71422001-02-19 16:31:02 +0000380 Note that \code{\var{i1} == \var{i2}} in this
381 case.}
382 \lineii{'equal'}{\code{\var{a}[\var{i1}:\var{i2}] ==
383 \var{b}[\var{j1}:\var{j2}]} (the sub-sequences are
384 equal).}
385\end{tableii}
386
387For example:
388
389\begin{verbatim}
390>>> a = "qabxcd"
391>>> b = "abycdf"
392>>> s = SequenceMatcher(None, a, b)
393>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
394... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
395... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
396 delete a[0:1] (q) b[0:0] ()
397 equal a[1:3] (ab) b[0:2] (ab)
398replace a[3:4] (x) b[2:3] (y)
399 equal a[4:6] (cd) b[3:5] (cd)
400 insert a[6:6] () b[5:6] (f)
401\end{verbatim}
402\end{methoddesc}
403
Raymond Hettinger132fa372003-06-11 07:50:44 +0000404\begin{methoddesc}{get_grouped_opcodes}{\optional{n}}
405 Return a generator of groups with up to \var{n} lines of context.
406
407 Starting with the groups returned by \method{get_opcodes()},
408 this method splits out smaller change clusters and eliminates
409 intervening ranges which have no changes.
410
411 The groups are returned in the same format as \method{get_opcodes()}.
412 \versionadded{2.3}
413\end{methoddesc}
414
Fred Drakebaf71422001-02-19 16:31:02 +0000415\begin{methoddesc}{ratio}{}
416 Return a measure of the sequences' similarity as a float in the
417 range [0, 1].
418
419 Where T is the total number of elements in both sequences, and M is
Fred Drake6943a292001-08-13 19:31:59 +0000420 the number of matches, this is 2.0*M / T. Note that this is
421 \code{1.0} if the sequences are identical, and \code{0.0} if they
422 have nothing in common.
Fred Drakebaf71422001-02-19 16:31:02 +0000423
424 This is expensive to compute if \method{get_matching_blocks()} or
425 \method{get_opcodes()} hasn't already been called, in which case you
426 may want to try \method{quick_ratio()} or
427 \method{real_quick_ratio()} first to get an upper bound.
428\end{methoddesc}
429
430\begin{methoddesc}{quick_ratio}{}
431 Return an upper bound on \method{ratio()} relatively quickly.
432
433 This isn't defined beyond that it is an upper bound on
434 \method{ratio()}, and is faster to compute.
435\end{methoddesc}
436
437\begin{methoddesc}{real_quick_ratio}{}
438 Return an upper bound on \method{ratio()} very quickly.
439
440 This isn't defined beyond that it is an upper bound on
441 \method{ratio()}, and is faster to compute than either
442 \method{ratio()} or \method{quick_ratio()}.
443\end{methoddesc}
444
Tim Peters754ba582001-02-20 11:24:35 +0000445The three methods that return the ratio of matching to total characters
446can give different results due to differing levels of approximation,
447although \method{quick_ratio()} and \method{real_quick_ratio()} are always
448at least as large as \method{ratio()}:
Fred Drakebaf71422001-02-19 16:31:02 +0000449
450\begin{verbatim}
451>>> s = SequenceMatcher(None, "abcd", "bcde")
452>>> s.ratio()
4530.75
454>>> s.quick_ratio()
4550.75
456>>> s.real_quick_ratio()
4571.0
458\end{verbatim}
459
460
Fred Drake6943a292001-08-13 19:31:59 +0000461\subsection{SequenceMatcher Examples \label{sequencematcher-examples}}
Fred Drakebaf71422001-02-19 16:31:02 +0000462
463
464This example compares two strings, considering blanks to be ``junk:''
465
466\begin{verbatim}
467>>> s = SequenceMatcher(lambda x: x == " ",
468... "private Thread currentThread;",
469... "private volatile Thread currentThread;")
470\end{verbatim}
471
472\method{ratio()} returns a float in [0, 1], measuring the similarity
473of the sequences. As a rule of thumb, a \method{ratio()} value over
4740.6 means the sequences are close matches:
475
476\begin{verbatim}
477>>> print round(s.ratio(), 3)
4780.866
479\end{verbatim}
480
481If you're only interested in where the sequences match,
482\method{get_matching_blocks()} is handy:
483
484\begin{verbatim}
485>>> for block in s.get_matching_blocks():
486... print "a[%d] and b[%d] match for %d elements" % block
487a[0] and b[0] match for 8 elements
488a[8] and b[17] match for 6 elements
489a[14] and b[23] match for 15 elements
490a[29] and b[38] match for 0 elements
491\end{verbatim}
492
493Note that the last tuple returned by \method{get_matching_blocks()} is
494always a dummy, \code{(len(\var{a}), len(\var{b}), 0)}, and this is
495the only case in which the last tuple element (number of elements
496matched) is \code{0}.
497
498If you want to know how to change the first sequence into the second,
499use \method{get_opcodes()}:
500
501\begin{verbatim}
502>>> for opcode in s.get_opcodes():
503... print "%6s a[%d:%d] b[%d:%d]" % opcode
504 equal a[0:8] b[0:8]
505insert a[8:8] b[8:17]
506 equal a[8:14] b[17:23]
507 equal a[14:29] b[23:38]
508\end{verbatim}
509
Fred Drakebaf71422001-02-19 16:31:02 +0000510See also the function \function{get_close_matches()} in this module,
511which shows how simple code building on \class{SequenceMatcher} can be
512used to do useful work.
Fred Drake6943a292001-08-13 19:31:59 +0000513
514
515\subsection{Differ Objects \label{differ-objects}}
516
517Note that \class{Differ}-generated deltas make no claim to be
518\strong{minimal} diffs. To the contrary, minimal diffs are often
519counter-intuitive, because they synch up anywhere possible, sometimes
520accidental matches 100 pages apart. Restricting synch points to
521contiguous matches preserves some notion of locality, at the
522occasional cost of producing a longer diff.
523
524The \class{Differ} class has this constructor:
525
526\begin{classdesc}{Differ}{\optional{linejunk\optional{, charjunk}}}
527 Optional keyword parameters \var{linejunk} and \var{charjunk} are
528 for filter functions (or \code{None}):
529
Tim Peters81b92512002-04-29 01:37:32 +0000530 \var{linejunk}: A function that accepts a single string
531 argument, and returns true if the string is junk. The default is
532 \code{None}, meaning that no line is considered junk.
Fred Drake6943a292001-08-13 19:31:59 +0000533
Tim Peters81b92512002-04-29 01:37:32 +0000534 \var{charjunk}: A function that accepts a single character argument
535 (a string of length 1), and returns true if the character is junk.
536 The default is \code{None}, meaning that no character is
537 considered junk.
Fred Drake6943a292001-08-13 19:31:59 +0000538\end{classdesc}
539
540\class{Differ} objects are used (deltas generated) via a single
541method:
542
543\begin{methoddesc}{compare}{a, b}
Tim Peters8a9c2842001-09-22 21:30:22 +0000544 Compare two sequences of lines, and generate the delta (a sequence
545 of lines).
Fred Drake6943a292001-08-13 19:31:59 +0000546
547 Each sequence must contain individual single-line strings ending
548 with newlines. Such sequences can be obtained from the
Tim Peters8a9c2842001-09-22 21:30:22 +0000549 \method{readlines()} method of file-like objects. The delta generated
550 also consists of newline-terminated strings, ready to be printed as-is
Fred Drake389aa172001-11-29 19:04:50 +0000551 via the \method{writelines()} method of a file-like object.
Fred Drake6943a292001-08-13 19:31:59 +0000552\end{methoddesc}
553
554
555\subsection{Differ Example \label{differ-examples}}
556
557This example compares two texts. First we set up the texts, sequences
558of individual single-line strings ending with newlines (such sequences
559can also be obtained from the \method{readlines()} method of file-like
560objects):
561
562\begin{verbatim}
563>>> text1 = ''' 1. Beautiful is better than ugly.
564... 2. Explicit is better than implicit.
565... 3. Simple is better than complex.
566... 4. Complex is better than complicated.
567... '''.splitlines(1)
568>>> len(text1)
5694
570>>> text1[0][-1]
571'\n'
572>>> text2 = ''' 1. Beautiful is better than ugly.
573... 3. Simple is better than complex.
574... 4. Complicated is better than complex.
575... 5. Flat is better than nested.
576... '''.splitlines(1)
577\end{verbatim}
578
579Next we instantiate a Differ object:
580
581\begin{verbatim}
582>>> d = Differ()
583\end{verbatim}
584
585Note that when instantiating a \class{Differ} object we may pass
586functions to filter out line and character ``junk.'' See the
587\method{Differ()} constructor for details.
588
589Finally, we compare the two:
590
591\begin{verbatim}
Tim Peters8a9c2842001-09-22 21:30:22 +0000592>>> result = list(d.compare(text1, text2))
Fred Drake6943a292001-08-13 19:31:59 +0000593\end{verbatim}
594
595\code{result} is a list of strings, so let's pretty-print it:
596
597\begin{verbatim}
598>>> from pprint import pprint
599>>> pprint(result)
600[' 1. Beautiful is better than ugly.\n',
601 '- 2. Explicit is better than implicit.\n',
602 '- 3. Simple is better than complex.\n',
603 '+ 3. Simple is better than complex.\n',
604 '? ++ \n',
605 '- 4. Complex is better than complicated.\n',
606 '? ^ ---- ^ \n',
607 '+ 4. Complicated is better than complex.\n',
608 '? ++++ ^ ^ \n',
609 '+ 5. Flat is better than nested.\n']
610\end{verbatim}
611
612As a single multi-line string it looks like this:
613
614\begin{verbatim}
615>>> import sys
616>>> sys.stdout.writelines(result)
617 1. Beautiful is better than ugly.
618- 2. Explicit is better than implicit.
619- 3. Simple is better than complex.
620+ 3. Simple is better than complex.
621? ++
622- 4. Complex is better than complicated.
623? ^ ---- ^
624+ 4. Complicated is better than complex.
625? ++++ ^ ^
626+ 5. Flat is better than nested.
627\end{verbatim}