blob: 4256e877e35ce31a10ad73e282b3bf547e7b1030 [file] [log] [blame]
Fred Drakebaf71422001-02-19 16:31:02 +00001\section{\module{difflib} ---
2 Helpers for computing deltas}
3
4\declaremodule{standard}{difflib}
5\modulesynopsis{Helpers for computing differences between objects.}
Fred Drake34929f22003-12-30 16:12:27 +00006\moduleauthor{Tim Peters}{tim_one@users.sourceforge.net}
7\sectionauthor{Tim Peters}{tim_one@users.sourceforge.net}
Fred Drakebaf71422001-02-19 16:31:02 +00008% LaTeXification by Fred L. Drake, Jr. <fdrake@acm.org>.
9
Fred Drakeda00cda2001-04-10 19:56:09 +000010\versionadded{2.1}
11
12
Fred Drake6943a292001-08-13 19:31:59 +000013\begin{classdesc*}{SequenceMatcher}
14 This is a flexible class for comparing pairs of sequences of any
15 type, so long as the sequence elements are hashable. The basic
16 algorithm predates, and is a little fancier than, an algorithm
17 published in the late 1980's by Ratcliff and Obershelp under the
18 hyperbolic name ``gestalt pattern matching.'' The idea is to find
19 the longest contiguous matching subsequence that contains no
20 ``junk'' elements (the Ratcliff and Obershelp algorithm doesn't
21 address junk). The same idea is then applied recursively to the
22 pieces of the sequences to the left and to the right of the matching
23 subsequence. This does not yield minimal edit sequences, but does
24 tend to yield matches that ``look right'' to people.
25
26 \strong{Timing:} The basic Ratcliff-Obershelp algorithm is cubic
27 time in the worst case and quadratic time in the expected case.
28 \class{SequenceMatcher} is quadratic time for the worst case and has
29 expected-case behavior dependent in a complicated way on how many
30 elements the sequences have in common; best case time is linear.
31\end{classdesc*}
32
33\begin{classdesc*}{Differ}
34 This is a class for comparing sequences of lines of text, and
Tim Peters8a9c2842001-09-22 21:30:22 +000035 producing human-readable differences or deltas. Differ uses
Fred Drake6943a292001-08-13 19:31:59 +000036 \class{SequenceMatcher} both to compare sequences of lines, and to
37 compare sequences of characters within similar (near-matching)
38 lines.
39
40 Each line of a \class{Differ} delta begins with a two-letter code:
41
42\begin{tableii}{l|l}{code}{Code}{Meaning}
43 \lineii{'- '}{line unique to sequence 1}
44 \lineii{'+ '}{line unique to sequence 2}
45 \lineii{' '}{line common to both sequences}
46 \lineii{'? '}{line not present in either input sequence}
47\end{tableii}
48
49 Lines beginning with `\code{?~}' attempt to guide the eye to
50 intraline differences, and were not present in either input
51 sequence. These lines can be confusing if the sequences contain tab
52 characters.
53\end{classdesc*}
54
Martin v. Löwise064b412004-08-29 16:34:40 +000055\begin{classdesc*}{HtmlDiff}
56
57 This class can be used to create an HTML table (or a complete HTML file
58 containing the table) showing a side by side, line by line comparision
59 of text with inter-line and intra-line change highlights. The table can
60 be generated in either full or contextual difference mode.
61
62 The constructor for this class is:
63
64 \begin{funcdesc}{__init__}{
65 \optional{, tabsize
66 \optional{, wrapcolumn
67 \optional{, linejunk
68 \optional{, charjunk}}}}}
69
70 Initializes instance of \class{HtmlDiff}.
71
72 \var{tabsize} is an optional keyword argument to specify tab stop spacing
73 and defaults to \code{8}.
74
75 \var{wrapcolumn} is an optional keyword to specify column number where
76 lines are broken and wrapped, defaults to \code{None} where lines are not
77 wrapped.
78
79 \var{linejunk} and \var{charjunk} are optional keyword arguments passed
80 into \code{ndiff()} (used to by \class{HtmlDiff} to generate the
81 side by side HTML differences). See \code{ndiff()} documentation for
82 argument default values and descriptions.
83 \end{funcdesc}
84
85 The following methods are public:
86
87 \begin{funcdesc}{make_file}{fromlines, tolines
88 \optional{, fromdesc
89 \optional{, todesc
90 \optional{, context
91 \optional{, numlines}}}}}
92 Compares \var{fromlines} and \var{tolines} (lists of strings) and returns
93 a string which is a complete HTML file containing a table showing line by
94 line differences with inter-line and intra-line changes highlighted.
95
96 \var{fromdesc} and \var{todesc} are optional keyword arguments to specify
97 from/to file column header strings (both default to an empty string).
98
99 \var{context} and \var{numlines} are both optional keyword arguments.
100 Set \var{context} to \code{True} when contextual differences are to be
101 shown, else the default is \code{False} to show the full files.
102 \var{numlines} defaults to \code{5}. When \var{context} is \code{True}
103 \var{numlines} controls the number of context lines which surround the
104 difference highlights. When \var{context} is \code{False} \var{numlines}
105 controls the number of lines which are shown before a difference
106 highlight when using the "next" hyperlinks (setting to zero would cause
107 the "next" hyperlinks to place the next difference highlight at the top of
108 the browser without any leading context).
109 \end{funcdesc}
110
111 \begin{funcdesc}{make_table}{fromlines, tolines
112 \optional{, fromdesc
113 \optional{, todesc
114 \optional{, context}}}}
115 Compares \var{fromlines} and \var{tolines} (lists of strings) and returns
116 a string which is a complete HTML table showing line by line differences
117 with inter-line and intra-line changes highlighted.
118
119 The arguments of this method are a subset of those for the
120 \code{make_file} method. Refer to the \code{make_file} method
121 documentation.
122 \end{funcdesc}
123
124 \file{Tools/scripts/ndiff.py} is a command-line front-end to this class
125 and contains a good example of its use.
126\end{classdesc*}
127
Raymond Hettingere07b8352003-06-09 21:44:59 +0000128\begin{funcdesc}{context_diff}{a, b\optional{, fromfile\optional{, tofile
129 \optional{, fromfiledate\optional{, tofiledate\optional{, n
130 \optional{, lineterm}}}}}}}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000131 Compare \var{a} and \var{b} (lists of strings); return a
132 delta (a generator generating the delta lines) in context diff
133 format.
134
135 Context diffs are a compact way of showing just the lines that have
136 changed plus a few lines of context. The changes are shown in a
137 before/after style. The number of context lines is set by \var{n}
138 which defaults to three.
139
140 By default, the diff control lines (those with \code{***} or \code{---})
141 are created with a trailing newline. This is helpful so that inputs created
142 from \function{file.readlines()} result in diffs that are suitable for use
143 with \function{file.writelines()} since both the inputs and outputs have
144 trailing newlines.
145
146 For inputs that do not have trailing newlines, set the \var{lineterm}
147 argument to \code{""} so that the output will be uniformly newline free.
148
149 The context diff format normally has a header for filenames and
150 modification times. Any or all of these may be specified using strings for
151 \var{fromfile}, \var{tofile}, \var{fromfiledate}, and \var{tofiledate}.
152 The modification times are normally expressed in the format returned by
153 \function{time.ctime()}. If not specified, the strings default to blanks.
154
155 \file{Tools/scripts/diff.py} is a command-line front-end for this
Raymond Hettinger132fa372003-06-11 07:50:44 +0000156 function.
157
158 \versionadded{2.3}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000159\end{funcdesc}
160
Fred Drakebaf71422001-02-19 16:31:02 +0000161\begin{funcdesc}{get_close_matches}{word, possibilities\optional{,
162 n\optional{, cutoff}}}
163 Return a list of the best ``good enough'' matches. \var{word} is a
164 sequence for which close matches are desired (typically a string),
165 and \var{possibilities} is a list of sequences against which to
166 match \var{word} (typically a list of strings).
167
168 Optional argument \var{n} (default \code{3}) is the maximum number
169 of close matches to return; \var{n} must be greater than \code{0}.
170
171 Optional argument \var{cutoff} (default \code{0.6}) is a float in
172 the range [0, 1]. Possibilities that don't score at least that
173 similar to \var{word} are ignored.
174
175 The best (no more than \var{n}) matches among the possibilities are
176 returned in a list, sorted by similarity score, most similar first.
177
178\begin{verbatim}
179>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
180['apple', 'ape']
181>>> import keyword
182>>> get_close_matches('wheel', keyword.kwlist)
183['while']
184>>> get_close_matches('apple', keyword.kwlist)
185[]
186>>> get_close_matches('accept', keyword.kwlist)
187['except']
188\end{verbatim}
189\end{funcdesc}
190
Fred Drake6943a292001-08-13 19:31:59 +0000191\begin{funcdesc}{ndiff}{a, b\optional{, linejunk\optional{,
192 charjunk}}}
193 Compare \var{a} and \var{b} (lists of strings); return a
Tim Peters8a9c2842001-09-22 21:30:22 +0000194 \class{Differ}-style delta (a generator generating the delta lines).
Fred Drakebaf71422001-02-19 16:31:02 +0000195
Fred Drake6943a292001-08-13 19:31:59 +0000196 Optional keyword parameters \var{linejunk} and \var{charjunk} are
197 for filter functions (or \code{None}):
198
Tim Peters81b92512002-04-29 01:37:32 +0000199 \var{linejunk}: A function that accepts a single string
200 argument, and returns true if the string is junk, or false if not.
201 The default is (\code{None}), starting with Python 2.3. Before then,
202 the default was the module-level function
Fred Drake6943a292001-08-13 19:31:59 +0000203 \function{IS_LINE_JUNK()}, which filters out lines without visible
204 characters, except for at most one pound character (\character{\#}).
Tim Peters81b92512002-04-29 01:37:32 +0000205 As of Python 2.3, the underlying \class{SequenceMatcher} class
206 does a dynamic analysis of which lines are so frequent as to
207 constitute noise, and this usually works better than the pre-2.3
208 default.
Fred Drake6943a292001-08-13 19:31:59 +0000209
Tim Peters81b92512002-04-29 01:37:32 +0000210 \var{charjunk}: A function that accepts a character (a string of
211 length 1), and returns if the character is junk, or false if not.
Fred Drake6943a292001-08-13 19:31:59 +0000212 The default is module-level function \function{IS_CHARACTER_JUNK()},
213 which filters out whitespace characters (a blank or tab; note: bad
214 idea to include newline in this!).
215
216 \file{Tools/scripts/ndiff.py} is a command-line front-end to this
217 function.
218
219\begin{verbatim}
220>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
Fred Drakeedb635f2002-12-06 18:52:28 +0000221... 'ore\ntree\nemu\n'.splitlines(1))
Fred Drake6943a292001-08-13 19:31:59 +0000222>>> print ''.join(diff),
223- one
Tim Peters8a9c2842001-09-22 21:30:22 +0000224? ^
Fred Drake6943a292001-08-13 19:31:59 +0000225+ ore
Tim Peters8a9c2842001-09-22 21:30:22 +0000226? ^
Fred Drake6943a292001-08-13 19:31:59 +0000227- two
228- three
Tim Peters8a9c2842001-09-22 21:30:22 +0000229? -
Fred Drake6943a292001-08-13 19:31:59 +0000230+ tree
231+ emu
232\end{verbatim}
233\end{funcdesc}
234
235\begin{funcdesc}{restore}{sequence, which}
236 Return one of the two sequences that generated a delta.
237
238 Given a \var{sequence} produced by \method{Differ.compare()} or
239 \function{ndiff()}, extract lines originating from file 1 or 2
240 (parameter \var{which}), stripping off line prefixes.
241
242 Example:
243
244\begin{verbatim}
245>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
246... 'ore\ntree\nemu\n'.splitlines(1))
Tim Peters8a9c2842001-09-22 21:30:22 +0000247>>> diff = list(diff) # materialize the generated delta into a list
Fred Drake6943a292001-08-13 19:31:59 +0000248>>> print ''.join(restore(diff, 1)),
249one
250two
251three
252>>> print ''.join(restore(diff, 2)),
253ore
254tree
255emu
256\end{verbatim}
257
258\end{funcdesc}
259
Raymond Hettingere07b8352003-06-09 21:44:59 +0000260\begin{funcdesc}{unified_diff}{a, b\optional{, fromfile\optional{, tofile
261 \optional{, fromfiledate\optional{, tofiledate\optional{, n
262 \optional{, lineterm}}}}}}}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000263 Compare \var{a} and \var{b} (lists of strings); return a
264 delta (a generator generating the delta lines) in unified diff
265 format.
266
267 Unified diffs are a compact way of showing just the lines that have
268 changed plus a few lines of context. The changes are shown in a
269 inline style (instead of separate before/after blocks). The number
270 of context lines is set by \var{n} which defaults to three.
271
272 By default, the diff control lines (those with \code{---}, \code{+++},
273 or \code{@@}) are created with a trailing newline. This is helpful so
274 that inputs created from \function{file.readlines()} result in diffs
275 that are suitable for use with \function{file.writelines()} since both
276 the inputs and outputs have trailing newlines.
277
278 For inputs that do not have trailing newlines, set the \var{lineterm}
279 argument to \code{""} so that the output will be uniformly newline free.
280
281 The context diff format normally has a header for filenames and
282 modification times. Any or all of these may be specified using strings for
283 \var{fromfile}, \var{tofile}, \var{fromfiledate}, and \var{tofiledate}.
284 The modification times are normally expressed in the format returned by
285 \function{time.ctime()}. If not specified, the strings default to blanks.
286
287 \file{Tools/scripts/diff.py} is a command-line front-end for this
Raymond Hettinger132fa372003-06-11 07:50:44 +0000288 function.
289
290 \versionadded{2.3}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000291\end{funcdesc}
Fred Drake6943a292001-08-13 19:31:59 +0000292
Fred Drake7f10cce2001-10-26 03:04:23 +0000293\begin{funcdesc}{IS_LINE_JUNK}{line}
294 Return true for ignorable lines. The line \var{line} is ignorable
295 if \var{line} is blank or contains a single \character{\#},
296 otherwise it is not ignorable. Used as a default for parameter
Tim Peters81b92512002-04-29 01:37:32 +0000297 \var{linejunk} in \function{ndiff()} before Python 2.3.
Fred Drake6943a292001-08-13 19:31:59 +0000298\end{funcdesc}
299
300
Fred Drake7f10cce2001-10-26 03:04:23 +0000301\begin{funcdesc}{IS_CHARACTER_JUNK}{ch}
302 Return true for ignorable characters. The character \var{ch} is
303 ignorable if \var{ch} is a space or tab, otherwise it is not
304 ignorable. Used as a default for parameter \var{charjunk} in
Fred Drake6943a292001-08-13 19:31:59 +0000305 \function{ndiff()}.
Fred Drake6943a292001-08-13 19:31:59 +0000306\end{funcdesc}
Fred Drakebaf71422001-02-19 16:31:02 +0000307
308
Fred Drake6fda3ac2001-04-10 18:41:16 +0000309\begin{seealso}
Fred Drake1fe97502004-01-21 18:30:28 +0000310 \seetitle[http://www.ddj.com/documents/s=1103/ddj8807c/]
311 {Pattern Matching: The Gestalt Approach}{Discussion of a
Fred Drake6fda3ac2001-04-10 18:41:16 +0000312 similar algorithm by John W. Ratcliff and D. E. Metzener.
313 This was published in
314 \citetitle[http://www.ddj.com/]{Dr. Dobb's Journal} in
315 July, 1988.}
316\end{seealso}
317
318
Fred Drakebaf71422001-02-19 16:31:02 +0000319\subsection{SequenceMatcher Objects \label{sequence-matcher}}
320
Fred Drake96d7a702001-05-11 01:08:13 +0000321The \class{SequenceMatcher} class has this constructor:
322
Fred Drakebaf71422001-02-19 16:31:02 +0000323\begin{classdesc}{SequenceMatcher}{\optional{isjunk\optional{,
324 a\optional{, b}}}}
325 Optional argument \var{isjunk} must be \code{None} (the default) or
326 a one-argument function that takes a sequence element and returns
327 true if and only if the element is ``junk'' and should be ignored.
Fred Drake7f10cce2001-10-26 03:04:23 +0000328 Passing \code{None} for \var{b} is equivalent to passing
329 \code{lambda x: 0}; in other words, no elements are ignored. For
330 example, pass:
Fred Drakebaf71422001-02-19 16:31:02 +0000331
332\begin{verbatim}
Fred Drake447f5452001-02-23 19:13:07 +0000333lambda x: x in " \t"
Fred Drakebaf71422001-02-19 16:31:02 +0000334\end{verbatim}
335
336 if you're comparing lines as sequences of characters, and don't want
337 to synch up on blanks or hard tabs.
338
339 The optional arguments \var{a} and \var{b} are sequences to be
340 compared; both default to empty strings. The elements of both
341 sequences must be hashable.
342\end{classdesc}
343
344
345\class{SequenceMatcher} objects have the following methods:
346
347\begin{methoddesc}{set_seqs}{a, b}
348 Set the two sequences to be compared.
349\end{methoddesc}
350
351\class{SequenceMatcher} computes and caches detailed information about
352the second sequence, so if you want to compare one sequence against
353many sequences, use \method{set_seq2()} to set the commonly used
354sequence once and call \method{set_seq1()} repeatedly, once for each
355of the other sequences.
356
357\begin{methoddesc}{set_seq1}{a}
358 Set the first sequence to be compared. The second sequence to be
359 compared is not changed.
360\end{methoddesc}
361
362\begin{methoddesc}{set_seq2}{b}
363 Set the second sequence to be compared. The first sequence to be
364 compared is not changed.
365\end{methoddesc}
366
367\begin{methoddesc}{find_longest_match}{alo, ahi, blo, bhi}
368 Find longest matching block in \code{\var{a}[\var{alo}:\var{ahi}]}
369 and \code{\var{b}[\var{blo}:\var{bhi}]}.
370
371 If \var{isjunk} was omitted or \code{None},
372 \method{get_longest_match()} returns \code{(\var{i}, \var{j},
373 \var{k})} such that \code{\var{a}[\var{i}:\var{i}+\var{k}]} is equal
Tim Peters8a9c2842001-09-22 21:30:22 +0000374 to \code{\var{b}[\var{j}:\var{j}+\var{k}]}, where
Fred Drakebaf71422001-02-19 16:31:02 +0000375 \code{\var{alo} <= \var{i} <= \var{i}+\var{k} <= \var{ahi}} and
376 \code{\var{blo} <= \var{j} <= \var{j}+\var{k} <= \var{bhi}}.
377 For all \code{(\var{i'}, \var{j'}, \var{k'})} meeting those
378 conditions, the additional conditions
379 \code{\var{k} >= \var{k'}},
380 \code{\var{i} <= \var{i'}},
381 and if \code{\var{i} == \var{i'}}, \code{\var{j} <= \var{j'}}
382 are also met.
383 In other words, of all maximal matching blocks, return one that
384 starts earliest in \var{a}, and of all those maximal matching blocks
385 that start earliest in \var{a}, return the one that starts earliest
386 in \var{b}.
387
388\begin{verbatim}
389>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
390>>> s.find_longest_match(0, 5, 0, 9)
391(0, 4, 5)
392\end{verbatim}
393
394 If \var{isjunk} was provided, first the longest matching block is
395 determined as above, but with the additional restriction that no
396 junk element appears in the block. Then that block is extended as
397 far as possible by matching (only) junk elements on both sides.
398 So the resulting block never matches on junk except as identical
399 junk happens to be adjacent to an interesting match.
400
401 Here's the same example as before, but considering blanks to be junk.
Tim Peters754ba582001-02-20 11:24:35 +0000402 That prevents \code{' abcd'} from matching the \code{' abcd'} at the
Fred Drakebaf71422001-02-19 16:31:02 +0000403 tail end of the second sequence directly. Instead only the
404 \code{'abcd'} can match, and matches the leftmost \code{'abcd'} in
405 the second sequence:
406
407\begin{verbatim}
408>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
409>>> s.find_longest_match(0, 5, 0, 9)
410(1, 0, 4)
411\end{verbatim}
412
413 If no blocks match, this returns \code{(\var{alo}, \var{blo}, 0)}.
414\end{methoddesc}
415
416\begin{methoddesc}{get_matching_blocks}{}
417 Return list of triples describing matching subsequences.
418 Each triple is of the form \code{(\var{i}, \var{j}, \var{n})}, and
419 means that \code{\var{a}[\var{i}:\var{i}+\var{n}] ==
420 \var{b}[\var{j}:\var{j}+\var{n}]}. The triples are monotonically
421 increasing in \var{i} and \var{j}.
422
423 The last triple is a dummy, and has the value \code{(len(\var{a}),
424 len(\var{b}), 0)}. It is the only triple with \code{\var{n} == 0}.
425 % Explain why a dummy is used!
426
427\begin{verbatim}
428>>> s = SequenceMatcher(None, "abxcd", "abcd")
429>>> s.get_matching_blocks()
430[(0, 0, 2), (3, 2, 2), (5, 4, 0)]
431\end{verbatim}
432\end{methoddesc}
433
434\begin{methoddesc}{get_opcodes}{}
435 Return list of 5-tuples describing how to turn \var{a} into \var{b}.
436 Each tuple is of the form \code{(\var{tag}, \var{i1}, \var{i2},
437 \var{j1}, \var{j2})}. The first tuple has \code{\var{i1} ==
438 \var{j1} == 0}, and remaining tuples have \var{i1} equal to the
439 \var{i2} from the preceeding tuple, and, likewise, \var{j1} equal to
440 the previous \var{j2}.
441
442 The \var{tag} values are strings, with these meanings:
443
444\begin{tableii}{l|l}{code}{Value}{Meaning}
445 \lineii{'replace'}{\code{\var{a}[\var{i1}:\var{i2}]} should be
446 replaced by \code{\var{b}[\var{j1}:\var{j2}]}.}
447 \lineii{'delete'}{\code{\var{a}[\var{i1}:\var{i2}]} should be
448 deleted. Note that \code{\var{j1} == \var{j2}} in
449 this case.}
450 \lineii{'insert'}{\code{\var{b}[\var{j1}:\var{j2}]} should be
Tim Peters8a9c2842001-09-22 21:30:22 +0000451 inserted at \code{\var{a}[\var{i1}:\var{i1}]}.
Fred Drakebaf71422001-02-19 16:31:02 +0000452 Note that \code{\var{i1} == \var{i2}} in this
453 case.}
454 \lineii{'equal'}{\code{\var{a}[\var{i1}:\var{i2}] ==
455 \var{b}[\var{j1}:\var{j2}]} (the sub-sequences are
456 equal).}
457\end{tableii}
458
459For example:
460
461\begin{verbatim}
462>>> a = "qabxcd"
463>>> b = "abycdf"
464>>> s = SequenceMatcher(None, a, b)
465>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
466... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
467... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
468 delete a[0:1] (q) b[0:0] ()
469 equal a[1:3] (ab) b[0:2] (ab)
470replace a[3:4] (x) b[2:3] (y)
471 equal a[4:6] (cd) b[3:5] (cd)
472 insert a[6:6] () b[5:6] (f)
473\end{verbatim}
474\end{methoddesc}
475
Raymond Hettinger132fa372003-06-11 07:50:44 +0000476\begin{methoddesc}{get_grouped_opcodes}{\optional{n}}
477 Return a generator of groups with up to \var{n} lines of context.
478
479 Starting with the groups returned by \method{get_opcodes()},
480 this method splits out smaller change clusters and eliminates
481 intervening ranges which have no changes.
482
483 The groups are returned in the same format as \method{get_opcodes()}.
484 \versionadded{2.3}
485\end{methoddesc}
486
Fred Drakebaf71422001-02-19 16:31:02 +0000487\begin{methoddesc}{ratio}{}
488 Return a measure of the sequences' similarity as a float in the
489 range [0, 1].
490
491 Where T is the total number of elements in both sequences, and M is
Fred Drake6943a292001-08-13 19:31:59 +0000492 the number of matches, this is 2.0*M / T. Note that this is
493 \code{1.0} if the sequences are identical, and \code{0.0} if they
494 have nothing in common.
Fred Drakebaf71422001-02-19 16:31:02 +0000495
496 This is expensive to compute if \method{get_matching_blocks()} or
497 \method{get_opcodes()} hasn't already been called, in which case you
498 may want to try \method{quick_ratio()} or
499 \method{real_quick_ratio()} first to get an upper bound.
500\end{methoddesc}
501
502\begin{methoddesc}{quick_ratio}{}
503 Return an upper bound on \method{ratio()} relatively quickly.
504
505 This isn't defined beyond that it is an upper bound on
506 \method{ratio()}, and is faster to compute.
507\end{methoddesc}
508
509\begin{methoddesc}{real_quick_ratio}{}
510 Return an upper bound on \method{ratio()} very quickly.
511
512 This isn't defined beyond that it is an upper bound on
513 \method{ratio()}, and is faster to compute than either
514 \method{ratio()} or \method{quick_ratio()}.
515\end{methoddesc}
516
Tim Peters754ba582001-02-20 11:24:35 +0000517The three methods that return the ratio of matching to total characters
518can give different results due to differing levels of approximation,
519although \method{quick_ratio()} and \method{real_quick_ratio()} are always
520at least as large as \method{ratio()}:
Fred Drakebaf71422001-02-19 16:31:02 +0000521
522\begin{verbatim}
523>>> s = SequenceMatcher(None, "abcd", "bcde")
524>>> s.ratio()
5250.75
526>>> s.quick_ratio()
5270.75
528>>> s.real_quick_ratio()
5291.0
530\end{verbatim}
531
532
Fred Drake6943a292001-08-13 19:31:59 +0000533\subsection{SequenceMatcher Examples \label{sequencematcher-examples}}
Fred Drakebaf71422001-02-19 16:31:02 +0000534
535
536This example compares two strings, considering blanks to be ``junk:''
537
538\begin{verbatim}
539>>> s = SequenceMatcher(lambda x: x == " ",
540... "private Thread currentThread;",
541... "private volatile Thread currentThread;")
542\end{verbatim}
543
544\method{ratio()} returns a float in [0, 1], measuring the similarity
545of the sequences. As a rule of thumb, a \method{ratio()} value over
5460.6 means the sequences are close matches:
547
548\begin{verbatim}
549>>> print round(s.ratio(), 3)
5500.866
551\end{verbatim}
552
553If you're only interested in where the sequences match,
554\method{get_matching_blocks()} is handy:
555
556\begin{verbatim}
557>>> for block in s.get_matching_blocks():
558... print "a[%d] and b[%d] match for %d elements" % block
559a[0] and b[0] match for 8 elements
560a[8] and b[17] match for 6 elements
561a[14] and b[23] match for 15 elements
562a[29] and b[38] match for 0 elements
563\end{verbatim}
564
565Note that the last tuple returned by \method{get_matching_blocks()} is
566always a dummy, \code{(len(\var{a}), len(\var{b}), 0)}, and this is
567the only case in which the last tuple element (number of elements
568matched) is \code{0}.
569
570If you want to know how to change the first sequence into the second,
571use \method{get_opcodes()}:
572
573\begin{verbatim}
574>>> for opcode in s.get_opcodes():
575... print "%6s a[%d:%d] b[%d:%d]" % opcode
576 equal a[0:8] b[0:8]
577insert a[8:8] b[8:17]
578 equal a[8:14] b[17:23]
579 equal a[14:29] b[23:38]
580\end{verbatim}
581
Fred Drakebaf71422001-02-19 16:31:02 +0000582See also the function \function{get_close_matches()} in this module,
583which shows how simple code building on \class{SequenceMatcher} can be
584used to do useful work.
Fred Drake6943a292001-08-13 19:31:59 +0000585
586
587\subsection{Differ Objects \label{differ-objects}}
588
589Note that \class{Differ}-generated deltas make no claim to be
590\strong{minimal} diffs. To the contrary, minimal diffs are often
591counter-intuitive, because they synch up anywhere possible, sometimes
592accidental matches 100 pages apart. Restricting synch points to
593contiguous matches preserves some notion of locality, at the
594occasional cost of producing a longer diff.
595
596The \class{Differ} class has this constructor:
597
598\begin{classdesc}{Differ}{\optional{linejunk\optional{, charjunk}}}
599 Optional keyword parameters \var{linejunk} and \var{charjunk} are
600 for filter functions (or \code{None}):
601
Tim Peters81b92512002-04-29 01:37:32 +0000602 \var{linejunk}: A function that accepts a single string
603 argument, and returns true if the string is junk. The default is
604 \code{None}, meaning that no line is considered junk.
Fred Drake6943a292001-08-13 19:31:59 +0000605
Tim Peters81b92512002-04-29 01:37:32 +0000606 \var{charjunk}: A function that accepts a single character argument
607 (a string of length 1), and returns true if the character is junk.
608 The default is \code{None}, meaning that no character is
609 considered junk.
Fred Drake6943a292001-08-13 19:31:59 +0000610\end{classdesc}
611
612\class{Differ} objects are used (deltas generated) via a single
613method:
614
615\begin{methoddesc}{compare}{a, b}
Tim Peters8a9c2842001-09-22 21:30:22 +0000616 Compare two sequences of lines, and generate the delta (a sequence
617 of lines).
Fred Drake6943a292001-08-13 19:31:59 +0000618
619 Each sequence must contain individual single-line strings ending
620 with newlines. Such sequences can be obtained from the
Tim Peters8a9c2842001-09-22 21:30:22 +0000621 \method{readlines()} method of file-like objects. The delta generated
622 also consists of newline-terminated strings, ready to be printed as-is
Fred Drake389aa172001-11-29 19:04:50 +0000623 via the \method{writelines()} method of a file-like object.
Fred Drake6943a292001-08-13 19:31:59 +0000624\end{methoddesc}
625
626
627\subsection{Differ Example \label{differ-examples}}
628
629This example compares two texts. First we set up the texts, sequences
630of individual single-line strings ending with newlines (such sequences
631can also be obtained from the \method{readlines()} method of file-like
632objects):
633
634\begin{verbatim}
635>>> text1 = ''' 1. Beautiful is better than ugly.
636... 2. Explicit is better than implicit.
637... 3. Simple is better than complex.
638... 4. Complex is better than complicated.
639... '''.splitlines(1)
640>>> len(text1)
6414
642>>> text1[0][-1]
643'\n'
644>>> text2 = ''' 1. Beautiful is better than ugly.
645... 3. Simple is better than complex.
646... 4. Complicated is better than complex.
647... 5. Flat is better than nested.
648... '''.splitlines(1)
649\end{verbatim}
650
651Next we instantiate a Differ object:
652
653\begin{verbatim}
654>>> d = Differ()
655\end{verbatim}
656
657Note that when instantiating a \class{Differ} object we may pass
658functions to filter out line and character ``junk.'' See the
659\method{Differ()} constructor for details.
660
661Finally, we compare the two:
662
663\begin{verbatim}
Tim Peters8a9c2842001-09-22 21:30:22 +0000664>>> result = list(d.compare(text1, text2))
Fred Drake6943a292001-08-13 19:31:59 +0000665\end{verbatim}
666
667\code{result} is a list of strings, so let's pretty-print it:
668
669\begin{verbatim}
670>>> from pprint import pprint
671>>> pprint(result)
672[' 1. Beautiful is better than ugly.\n',
673 '- 2. Explicit is better than implicit.\n',
674 '- 3. Simple is better than complex.\n',
675 '+ 3. Simple is better than complex.\n',
676 '? ++ \n',
677 '- 4. Complex is better than complicated.\n',
678 '? ^ ---- ^ \n',
679 '+ 4. Complicated is better than complex.\n',
680 '? ++++ ^ ^ \n',
681 '+ 5. Flat is better than nested.\n']
682\end{verbatim}
683
684As a single multi-line string it looks like this:
685
686\begin{verbatim}
687>>> import sys
688>>> sys.stdout.writelines(result)
689 1. Beautiful is better than ugly.
690- 2. Explicit is better than implicit.
691- 3. Simple is better than complex.
692+ 3. Simple is better than complex.
693? ++
694- 4. Complex is better than complicated.
695? ^ ---- ^
696+ 4. Complicated is better than complex.
697? ++++ ^ ^
698+ 5. Flat is better than nested.
699\end{verbatim}