blob: e7d7f81e471290c560453fe5aa74e7f1b340d039 [file] [log] [blame]
Fred Drakebaf71422001-02-19 16:31:02 +00001\section{\module{difflib} ---
2 Helpers for computing deltas}
3
4\declaremodule{standard}{difflib}
5\modulesynopsis{Helpers for computing differences between objects.}
Fred Drake34929f22003-12-30 16:12:27 +00006\moduleauthor{Tim Peters}{tim_one@users.sourceforge.net}
7\sectionauthor{Tim Peters}{tim_one@users.sourceforge.net}
Fred Drakebaf71422001-02-19 16:31:02 +00008% LaTeXification by Fred L. Drake, Jr. <fdrake@acm.org>.
9
Fred Drakeda00cda2001-04-10 19:56:09 +000010\versionadded{2.1}
11
12
Fred Drake6943a292001-08-13 19:31:59 +000013\begin{classdesc*}{SequenceMatcher}
14 This is a flexible class for comparing pairs of sequences of any
15 type, so long as the sequence elements are hashable. The basic
16 algorithm predates, and is a little fancier than, an algorithm
17 published in the late 1980's by Ratcliff and Obershelp under the
18 hyperbolic name ``gestalt pattern matching.'' The idea is to find
19 the longest contiguous matching subsequence that contains no
20 ``junk'' elements (the Ratcliff and Obershelp algorithm doesn't
21 address junk). The same idea is then applied recursively to the
22 pieces of the sequences to the left and to the right of the matching
23 subsequence. This does not yield minimal edit sequences, but does
24 tend to yield matches that ``look right'' to people.
25
26 \strong{Timing:} The basic Ratcliff-Obershelp algorithm is cubic
27 time in the worst case and quadratic time in the expected case.
28 \class{SequenceMatcher} is quadratic time for the worst case and has
29 expected-case behavior dependent in a complicated way on how many
30 elements the sequences have in common; best case time is linear.
31\end{classdesc*}
32
33\begin{classdesc*}{Differ}
34 This is a class for comparing sequences of lines of text, and
Tim Peters8a9c2842001-09-22 21:30:22 +000035 producing human-readable differences or deltas. Differ uses
Fred Drake6943a292001-08-13 19:31:59 +000036 \class{SequenceMatcher} both to compare sequences of lines, and to
37 compare sequences of characters within similar (near-matching)
38 lines.
39
40 Each line of a \class{Differ} delta begins with a two-letter code:
41
42\begin{tableii}{l|l}{code}{Code}{Meaning}
43 \lineii{'- '}{line unique to sequence 1}
44 \lineii{'+ '}{line unique to sequence 2}
45 \lineii{' '}{line common to both sequences}
46 \lineii{'? '}{line not present in either input sequence}
47\end{tableii}
48
49 Lines beginning with `\code{?~}' attempt to guide the eye to
50 intraline differences, and were not present in either input
51 sequence. These lines can be confusing if the sequences contain tab
52 characters.
53\end{classdesc*}
54
Martin v. Löwise064b412004-08-29 16:34:40 +000055\begin{classdesc*}{HtmlDiff}
56
57 This class can be used to create an HTML table (or a complete HTML file
Andrew M. Kuchling55be9ea2004-09-10 12:59:54 +000058 containing the table) showing a side by side, line by line comparison
Tim Peters2ee80992004-09-12 03:21:00 +000059 of text with inter-line and intra-line change highlights. The table can
Martin v. Löwise064b412004-08-29 16:34:40 +000060 be generated in either full or contextual difference mode.
61
62 The constructor for this class is:
63
64 \begin{funcdesc}{__init__}{
65 \optional{, tabsize
66 \optional{, wrapcolumn
67 \optional{, linejunk
68 \optional{, charjunk}}}}}
69
Tim Peters2ee80992004-09-12 03:21:00 +000070 Initializes instance of \class{HtmlDiff}.
Martin v. Löwise064b412004-08-29 16:34:40 +000071
Tim Peters2ee80992004-09-12 03:21:00 +000072 \var{tabsize} is an optional keyword argument to specify tab stop spacing
Martin v. Löwise064b412004-08-29 16:34:40 +000073 and defaults to \code{8}.
74
Tim Peters2ee80992004-09-12 03:21:00 +000075 \var{wrapcolumn} is an optional keyword to specify column number where
76 lines are broken and wrapped, defaults to \code{None} where lines are not
Martin v. Löwise064b412004-08-29 16:34:40 +000077 wrapped.
78
Tim Peters2ee80992004-09-12 03:21:00 +000079 \var{linejunk} and \var{charjunk} are optional keyword arguments passed
80 into \code{ndiff()} (used by \class{HtmlDiff} to generate the
81 side by side HTML differences). See \code{ndiff()} documentation for
Martin v. Löwise064b412004-08-29 16:34:40 +000082 argument default values and descriptions.
Tim Peters2ee80992004-09-12 03:21:00 +000083
84 \end{funcdesc}
Martin v. Löwise064b412004-08-29 16:34:40 +000085
86 The following methods are public:
87
88 \begin{funcdesc}{make_file}{fromlines, tolines
89 \optional{, fromdesc
90 \optional{, todesc
91 \optional{, context
92 \optional{, numlines}}}}}
93 Compares \var{fromlines} and \var{tolines} (lists of strings) and returns
94 a string which is a complete HTML file containing a table showing line by
95 line differences with inter-line and intra-line changes highlighted.
Tim Peters2ee80992004-09-12 03:21:00 +000096
97 \var{fromdesc} and \var{todesc} are optional keyword arguments to specify
98 from/to file column header strings (both default to an empty string).
99
Martin v. Löwise064b412004-08-29 16:34:40 +0000100 \var{context} and \var{numlines} are both optional keyword arguments.
Tim Peters2ee80992004-09-12 03:21:00 +0000101 Set \var{context} to \code{True} when contextual differences are to be
102 shown, else the default is \code{False} to show the full files.
Martin v. Löwise064b412004-08-29 16:34:40 +0000103 \var{numlines} defaults to \code{5}. When \var{context} is \code{True}
Tim Peters2ee80992004-09-12 03:21:00 +0000104 \var{numlines} controls the number of context lines which surround the
Martin v. Löwise064b412004-08-29 16:34:40 +0000105 difference highlights. When \var{context} is \code{False} \var{numlines}
Tim Peters2ee80992004-09-12 03:21:00 +0000106 controls the number of lines which are shown before a difference
Martin v. Löwise064b412004-08-29 16:34:40 +0000107 highlight when using the "next" hyperlinks (setting to zero would cause
108 the "next" hyperlinks to place the next difference highlight at the top of
109 the browser without any leading context).
Tim Peters2ee80992004-09-12 03:21:00 +0000110 \end{funcdesc}
Martin v. Löwise064b412004-08-29 16:34:40 +0000111
112 \begin{funcdesc}{make_table}{fromlines, tolines
113 \optional{, fromdesc
114 \optional{, todesc
115 \optional{, context}}}}
116 Compares \var{fromlines} and \var{tolines} (lists of strings) and returns
117 a string which is a complete HTML table showing line by line differences
118 with inter-line and intra-line changes highlighted.
Martin v. Löwise064b412004-08-29 16:34:40 +0000119
Tim Peters2ee80992004-09-12 03:21:00 +0000120 The arguments of this method are a subset of those for the
121 \code{make_file} method. Refer to the \code{make_file} method
122 documentation.
123 \end{funcdesc}
124
125 \file{Tools/scripts/diff.py} is a command-line front-end to this class
Martin v. Löwise064b412004-08-29 16:34:40 +0000126 and contains a good example of its use.
Tim Peters2ee80992004-09-12 03:21:00 +0000127
128 \versionadded{2.4}
Martin v. Löwise064b412004-08-29 16:34:40 +0000129\end{classdesc*}
130
Raymond Hettingere07b8352003-06-09 21:44:59 +0000131\begin{funcdesc}{context_diff}{a, b\optional{, fromfile\optional{, tofile
132 \optional{, fromfiledate\optional{, tofiledate\optional{, n
133 \optional{, lineterm}}}}}}}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000134 Compare \var{a} and \var{b} (lists of strings); return a
135 delta (a generator generating the delta lines) in context diff
136 format.
Tim Peters2ee80992004-09-12 03:21:00 +0000137
Raymond Hettingere07b8352003-06-09 21:44:59 +0000138 Context diffs are a compact way of showing just the lines that have
139 changed plus a few lines of context. The changes are shown in a
140 before/after style. The number of context lines is set by \var{n}
141 which defaults to three.
142
143 By default, the diff control lines (those with \code{***} or \code{---})
144 are created with a trailing newline. This is helpful so that inputs created
145 from \function{file.readlines()} result in diffs that are suitable for use
146 with \function{file.writelines()} since both the inputs and outputs have
147 trailing newlines.
148
149 For inputs that do not have trailing newlines, set the \var{lineterm}
150 argument to \code{""} so that the output will be uniformly newline free.
151
152 The context diff format normally has a header for filenames and
153 modification times. Any or all of these may be specified using strings for
154 \var{fromfile}, \var{tofile}, \var{fromfiledate}, and \var{tofiledate}.
155 The modification times are normally expressed in the format returned by
156 \function{time.ctime()}. If not specified, the strings default to blanks.
157
158 \file{Tools/scripts/diff.py} is a command-line front-end for this
Raymond Hettinger132fa372003-06-11 07:50:44 +0000159 function.
160
161 \versionadded{2.3}
Tim Peters2ee80992004-09-12 03:21:00 +0000162\end{funcdesc}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000163
Fred Drakebaf71422001-02-19 16:31:02 +0000164\begin{funcdesc}{get_close_matches}{word, possibilities\optional{,
165 n\optional{, cutoff}}}
166 Return a list of the best ``good enough'' matches. \var{word} is a
167 sequence for which close matches are desired (typically a string),
168 and \var{possibilities} is a list of sequences against which to
169 match \var{word} (typically a list of strings).
170
171 Optional argument \var{n} (default \code{3}) is the maximum number
172 of close matches to return; \var{n} must be greater than \code{0}.
173
174 Optional argument \var{cutoff} (default \code{0.6}) is a float in
175 the range [0, 1]. Possibilities that don't score at least that
176 similar to \var{word} are ignored.
177
178 The best (no more than \var{n}) matches among the possibilities are
179 returned in a list, sorted by similarity score, most similar first.
180
181\begin{verbatim}
182>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
183['apple', 'ape']
184>>> import keyword
185>>> get_close_matches('wheel', keyword.kwlist)
186['while']
187>>> get_close_matches('apple', keyword.kwlist)
188[]
189>>> get_close_matches('accept', keyword.kwlist)
190['except']
191\end{verbatim}
192\end{funcdesc}
193
Fred Drake6943a292001-08-13 19:31:59 +0000194\begin{funcdesc}{ndiff}{a, b\optional{, linejunk\optional{,
195 charjunk}}}
196 Compare \var{a} and \var{b} (lists of strings); return a
Tim Peters8a9c2842001-09-22 21:30:22 +0000197 \class{Differ}-style delta (a generator generating the delta lines).
Fred Drakebaf71422001-02-19 16:31:02 +0000198
Fred Drake6943a292001-08-13 19:31:59 +0000199 Optional keyword parameters \var{linejunk} and \var{charjunk} are
200 for filter functions (or \code{None}):
201
Tim Peters81b92512002-04-29 01:37:32 +0000202 \var{linejunk}: A function that accepts a single string
203 argument, and returns true if the string is junk, or false if not.
204 The default is (\code{None}), starting with Python 2.3. Before then,
205 the default was the module-level function
Fred Drake6943a292001-08-13 19:31:59 +0000206 \function{IS_LINE_JUNK()}, which filters out lines without visible
207 characters, except for at most one pound character (\character{\#}).
Tim Peters81b92512002-04-29 01:37:32 +0000208 As of Python 2.3, the underlying \class{SequenceMatcher} class
209 does a dynamic analysis of which lines are so frequent as to
210 constitute noise, and this usually works better than the pre-2.3
211 default.
Fred Drake6943a292001-08-13 19:31:59 +0000212
Tim Peters81b92512002-04-29 01:37:32 +0000213 \var{charjunk}: A function that accepts a character (a string of
214 length 1), and returns if the character is junk, or false if not.
Fred Drake6943a292001-08-13 19:31:59 +0000215 The default is module-level function \function{IS_CHARACTER_JUNK()},
216 which filters out whitespace characters (a blank or tab; note: bad
217 idea to include newline in this!).
218
219 \file{Tools/scripts/ndiff.py} is a command-line front-end to this
220 function.
221
222\begin{verbatim}
223>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
Fred Drakeedb635f2002-12-06 18:52:28 +0000224... 'ore\ntree\nemu\n'.splitlines(1))
Fred Drake6943a292001-08-13 19:31:59 +0000225>>> print ''.join(diff),
226- one
Tim Peters8a9c2842001-09-22 21:30:22 +0000227? ^
Fred Drake6943a292001-08-13 19:31:59 +0000228+ ore
Tim Peters8a9c2842001-09-22 21:30:22 +0000229? ^
Fred Drake6943a292001-08-13 19:31:59 +0000230- two
231- three
Tim Peters8a9c2842001-09-22 21:30:22 +0000232? -
Fred Drake6943a292001-08-13 19:31:59 +0000233+ tree
234+ emu
235\end{verbatim}
236\end{funcdesc}
237
238\begin{funcdesc}{restore}{sequence, which}
239 Return one of the two sequences that generated a delta.
240
241 Given a \var{sequence} produced by \method{Differ.compare()} or
242 \function{ndiff()}, extract lines originating from file 1 or 2
243 (parameter \var{which}), stripping off line prefixes.
244
245 Example:
246
247\begin{verbatim}
248>>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1),
249... 'ore\ntree\nemu\n'.splitlines(1))
Tim Peters8a9c2842001-09-22 21:30:22 +0000250>>> diff = list(diff) # materialize the generated delta into a list
Fred Drake6943a292001-08-13 19:31:59 +0000251>>> print ''.join(restore(diff, 1)),
252one
253two
254three
255>>> print ''.join(restore(diff, 2)),
256ore
257tree
258emu
259\end{verbatim}
260
261\end{funcdesc}
262
Raymond Hettingere07b8352003-06-09 21:44:59 +0000263\begin{funcdesc}{unified_diff}{a, b\optional{, fromfile\optional{, tofile
264 \optional{, fromfiledate\optional{, tofiledate\optional{, n
265 \optional{, lineterm}}}}}}}
Raymond Hettingere07b8352003-06-09 21:44:59 +0000266 Compare \var{a} and \var{b} (lists of strings); return a
267 delta (a generator generating the delta lines) in unified diff
268 format.
Tim Peters2ee80992004-09-12 03:21:00 +0000269
Raymond Hettingere07b8352003-06-09 21:44:59 +0000270 Unified diffs are a compact way of showing just the lines that have
271 changed plus a few lines of context. The changes are shown in a
272 inline style (instead of separate before/after blocks). The number
273 of context lines is set by \var{n} which defaults to three.
274
275 By default, the diff control lines (those with \code{---}, \code{+++},
276 or \code{@@}) are created with a trailing newline. This is helpful so
277 that inputs created from \function{file.readlines()} result in diffs
278 that are suitable for use with \function{file.writelines()} since both
279 the inputs and outputs have trailing newlines.
280
281 For inputs that do not have trailing newlines, set the \var{lineterm}
282 argument to \code{""} so that the output will be uniformly newline free.
283
284 The context diff format normally has a header for filenames and
285 modification times. Any or all of these may be specified using strings for
286 \var{fromfile}, \var{tofile}, \var{fromfiledate}, and \var{tofiledate}.
287 The modification times are normally expressed in the format returned by
288 \function{time.ctime()}. If not specified, the strings default to blanks.
289
290 \file{Tools/scripts/diff.py} is a command-line front-end for this
Raymond Hettinger132fa372003-06-11 07:50:44 +0000291 function.
292
Tim Peters2ee80992004-09-12 03:21:00 +0000293 \versionadded{2.3}
294\end{funcdesc}
Fred Drake6943a292001-08-13 19:31:59 +0000295
Fred Drake7f10cce2001-10-26 03:04:23 +0000296\begin{funcdesc}{IS_LINE_JUNK}{line}
297 Return true for ignorable lines. The line \var{line} is ignorable
298 if \var{line} is blank or contains a single \character{\#},
299 otherwise it is not ignorable. Used as a default for parameter
Tim Peters81b92512002-04-29 01:37:32 +0000300 \var{linejunk} in \function{ndiff()} before Python 2.3.
Fred Drake6943a292001-08-13 19:31:59 +0000301\end{funcdesc}
302
303
Fred Drake7f10cce2001-10-26 03:04:23 +0000304\begin{funcdesc}{IS_CHARACTER_JUNK}{ch}
305 Return true for ignorable characters. The character \var{ch} is
306 ignorable if \var{ch} is a space or tab, otherwise it is not
307 ignorable. Used as a default for parameter \var{charjunk} in
Fred Drake6943a292001-08-13 19:31:59 +0000308 \function{ndiff()}.
Fred Drake6943a292001-08-13 19:31:59 +0000309\end{funcdesc}
Fred Drakebaf71422001-02-19 16:31:02 +0000310
311
Fred Drake6fda3ac2001-04-10 18:41:16 +0000312\begin{seealso}
Fred Drake1fe97502004-01-21 18:30:28 +0000313 \seetitle[http://www.ddj.com/documents/s=1103/ddj8807c/]
314 {Pattern Matching: The Gestalt Approach}{Discussion of a
Fred Drake6fda3ac2001-04-10 18:41:16 +0000315 similar algorithm by John W. Ratcliff and D. E. Metzener.
316 This was published in
317 \citetitle[http://www.ddj.com/]{Dr. Dobb's Journal} in
318 July, 1988.}
319\end{seealso}
320
321
Fred Drakebaf71422001-02-19 16:31:02 +0000322\subsection{SequenceMatcher Objects \label{sequence-matcher}}
323
Fred Drake96d7a702001-05-11 01:08:13 +0000324The \class{SequenceMatcher} class has this constructor:
325
Fred Drakebaf71422001-02-19 16:31:02 +0000326\begin{classdesc}{SequenceMatcher}{\optional{isjunk\optional{,
327 a\optional{, b}}}}
328 Optional argument \var{isjunk} must be \code{None} (the default) or
329 a one-argument function that takes a sequence element and returns
330 true if and only if the element is ``junk'' and should be ignored.
Fred Drake7f10cce2001-10-26 03:04:23 +0000331 Passing \code{None} for \var{b} is equivalent to passing
332 \code{lambda x: 0}; in other words, no elements are ignored. For
333 example, pass:
Fred Drakebaf71422001-02-19 16:31:02 +0000334
335\begin{verbatim}
Fred Drake447f5452001-02-23 19:13:07 +0000336lambda x: x in " \t"
Fred Drakebaf71422001-02-19 16:31:02 +0000337\end{verbatim}
338
339 if you're comparing lines as sequences of characters, and don't want
340 to synch up on blanks or hard tabs.
341
342 The optional arguments \var{a} and \var{b} are sequences to be
343 compared; both default to empty strings. The elements of both
344 sequences must be hashable.
345\end{classdesc}
346
347
348\class{SequenceMatcher} objects have the following methods:
349
350\begin{methoddesc}{set_seqs}{a, b}
351 Set the two sequences to be compared.
352\end{methoddesc}
353
354\class{SequenceMatcher} computes and caches detailed information about
355the second sequence, so if you want to compare one sequence against
356many sequences, use \method{set_seq2()} to set the commonly used
357sequence once and call \method{set_seq1()} repeatedly, once for each
358of the other sequences.
359
360\begin{methoddesc}{set_seq1}{a}
361 Set the first sequence to be compared. The second sequence to be
362 compared is not changed.
363\end{methoddesc}
364
365\begin{methoddesc}{set_seq2}{b}
366 Set the second sequence to be compared. The first sequence to be
367 compared is not changed.
368\end{methoddesc}
369
370\begin{methoddesc}{find_longest_match}{alo, ahi, blo, bhi}
371 Find longest matching block in \code{\var{a}[\var{alo}:\var{ahi}]}
372 and \code{\var{b}[\var{blo}:\var{bhi}]}.
373
374 If \var{isjunk} was omitted or \code{None},
375 \method{get_longest_match()} returns \code{(\var{i}, \var{j},
376 \var{k})} such that \code{\var{a}[\var{i}:\var{i}+\var{k}]} is equal
Tim Peters8a9c2842001-09-22 21:30:22 +0000377 to \code{\var{b}[\var{j}:\var{j}+\var{k}]}, where
Fred Drakebaf71422001-02-19 16:31:02 +0000378 \code{\var{alo} <= \var{i} <= \var{i}+\var{k} <= \var{ahi}} and
379 \code{\var{blo} <= \var{j} <= \var{j}+\var{k} <= \var{bhi}}.
380 For all \code{(\var{i'}, \var{j'}, \var{k'})} meeting those
381 conditions, the additional conditions
382 \code{\var{k} >= \var{k'}},
383 \code{\var{i} <= \var{i'}},
384 and if \code{\var{i} == \var{i'}}, \code{\var{j} <= \var{j'}}
385 are also met.
386 In other words, of all maximal matching blocks, return one that
387 starts earliest in \var{a}, and of all those maximal matching blocks
388 that start earliest in \var{a}, return the one that starts earliest
389 in \var{b}.
390
391\begin{verbatim}
392>>> s = SequenceMatcher(None, " abcd", "abcd abcd")
393>>> s.find_longest_match(0, 5, 0, 9)
394(0, 4, 5)
395\end{verbatim}
396
397 If \var{isjunk} was provided, first the longest matching block is
398 determined as above, but with the additional restriction that no
399 junk element appears in the block. Then that block is extended as
400 far as possible by matching (only) junk elements on both sides.
401 So the resulting block never matches on junk except as identical
402 junk happens to be adjacent to an interesting match.
403
404 Here's the same example as before, but considering blanks to be junk.
Tim Peters754ba582001-02-20 11:24:35 +0000405 That prevents \code{' abcd'} from matching the \code{' abcd'} at the
Fred Drakebaf71422001-02-19 16:31:02 +0000406 tail end of the second sequence directly. Instead only the
407 \code{'abcd'} can match, and matches the leftmost \code{'abcd'} in
408 the second sequence:
409
410\begin{verbatim}
411>>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd")
412>>> s.find_longest_match(0, 5, 0, 9)
413(1, 0, 4)
414\end{verbatim}
415
416 If no blocks match, this returns \code{(\var{alo}, \var{blo}, 0)}.
417\end{methoddesc}
418
419\begin{methoddesc}{get_matching_blocks}{}
420 Return list of triples describing matching subsequences.
421 Each triple is of the form \code{(\var{i}, \var{j}, \var{n})}, and
422 means that \code{\var{a}[\var{i}:\var{i}+\var{n}] ==
423 \var{b}[\var{j}:\var{j}+\var{n}]}. The triples are monotonically
424 increasing in \var{i} and \var{j}.
425
426 The last triple is a dummy, and has the value \code{(len(\var{a}),
427 len(\var{b}), 0)}. It is the only triple with \code{\var{n} == 0}.
428 % Explain why a dummy is used!
429
430\begin{verbatim}
431>>> s = SequenceMatcher(None, "abxcd", "abcd")
432>>> s.get_matching_blocks()
433[(0, 0, 2), (3, 2, 2), (5, 4, 0)]
434\end{verbatim}
435\end{methoddesc}
436
437\begin{methoddesc}{get_opcodes}{}
438 Return list of 5-tuples describing how to turn \var{a} into \var{b}.
439 Each tuple is of the form \code{(\var{tag}, \var{i1}, \var{i2},
440 \var{j1}, \var{j2})}. The first tuple has \code{\var{i1} ==
441 \var{j1} == 0}, and remaining tuples have \var{i1} equal to the
442 \var{i2} from the preceeding tuple, and, likewise, \var{j1} equal to
443 the previous \var{j2}.
444
445 The \var{tag} values are strings, with these meanings:
446
447\begin{tableii}{l|l}{code}{Value}{Meaning}
448 \lineii{'replace'}{\code{\var{a}[\var{i1}:\var{i2}]} should be
449 replaced by \code{\var{b}[\var{j1}:\var{j2}]}.}
450 \lineii{'delete'}{\code{\var{a}[\var{i1}:\var{i2}]} should be
451 deleted. Note that \code{\var{j1} == \var{j2}} in
452 this case.}
453 \lineii{'insert'}{\code{\var{b}[\var{j1}:\var{j2}]} should be
Tim Peters8a9c2842001-09-22 21:30:22 +0000454 inserted at \code{\var{a}[\var{i1}:\var{i1}]}.
Fred Drakebaf71422001-02-19 16:31:02 +0000455 Note that \code{\var{i1} == \var{i2}} in this
456 case.}
457 \lineii{'equal'}{\code{\var{a}[\var{i1}:\var{i2}] ==
458 \var{b}[\var{j1}:\var{j2}]} (the sub-sequences are
459 equal).}
460\end{tableii}
461
462For example:
463
464\begin{verbatim}
465>>> a = "qabxcd"
466>>> b = "abycdf"
467>>> s = SequenceMatcher(None, a, b)
468>>> for tag, i1, i2, j1, j2 in s.get_opcodes():
469... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" %
470... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2]))
471 delete a[0:1] (q) b[0:0] ()
472 equal a[1:3] (ab) b[0:2] (ab)
473replace a[3:4] (x) b[2:3] (y)
474 equal a[4:6] (cd) b[3:5] (cd)
475 insert a[6:6] () b[5:6] (f)
476\end{verbatim}
477\end{methoddesc}
478
Raymond Hettinger132fa372003-06-11 07:50:44 +0000479\begin{methoddesc}{get_grouped_opcodes}{\optional{n}}
480 Return a generator of groups with up to \var{n} lines of context.
481
482 Starting with the groups returned by \method{get_opcodes()},
483 this method splits out smaller change clusters and eliminates
484 intervening ranges which have no changes.
485
486 The groups are returned in the same format as \method{get_opcodes()}.
Tim Peters2ee80992004-09-12 03:21:00 +0000487 \versionadded{2.3}
Raymond Hettinger132fa372003-06-11 07:50:44 +0000488\end{methoddesc}
489
Fred Drakebaf71422001-02-19 16:31:02 +0000490\begin{methoddesc}{ratio}{}
491 Return a measure of the sequences' similarity as a float in the
492 range [0, 1].
493
494 Where T is the total number of elements in both sequences, and M is
Fred Drake6943a292001-08-13 19:31:59 +0000495 the number of matches, this is 2.0*M / T. Note that this is
496 \code{1.0} if the sequences are identical, and \code{0.0} if they
497 have nothing in common.
Fred Drakebaf71422001-02-19 16:31:02 +0000498
499 This is expensive to compute if \method{get_matching_blocks()} or
500 \method{get_opcodes()} hasn't already been called, in which case you
501 may want to try \method{quick_ratio()} or
502 \method{real_quick_ratio()} first to get an upper bound.
503\end{methoddesc}
504
505\begin{methoddesc}{quick_ratio}{}
506 Return an upper bound on \method{ratio()} relatively quickly.
507
508 This isn't defined beyond that it is an upper bound on
509 \method{ratio()}, and is faster to compute.
510\end{methoddesc}
511
512\begin{methoddesc}{real_quick_ratio}{}
513 Return an upper bound on \method{ratio()} very quickly.
514
515 This isn't defined beyond that it is an upper bound on
516 \method{ratio()}, and is faster to compute than either
517 \method{ratio()} or \method{quick_ratio()}.
518\end{methoddesc}
519
Tim Peters754ba582001-02-20 11:24:35 +0000520The three methods that return the ratio of matching to total characters
521can give different results due to differing levels of approximation,
522although \method{quick_ratio()} and \method{real_quick_ratio()} are always
523at least as large as \method{ratio()}:
Fred Drakebaf71422001-02-19 16:31:02 +0000524
525\begin{verbatim}
526>>> s = SequenceMatcher(None, "abcd", "bcde")
527>>> s.ratio()
5280.75
529>>> s.quick_ratio()
5300.75
531>>> s.real_quick_ratio()
5321.0
533\end{verbatim}
534
535
Fred Drake6943a292001-08-13 19:31:59 +0000536\subsection{SequenceMatcher Examples \label{sequencematcher-examples}}
Fred Drakebaf71422001-02-19 16:31:02 +0000537
538
539This example compares two strings, considering blanks to be ``junk:''
540
541\begin{verbatim}
542>>> s = SequenceMatcher(lambda x: x == " ",
543... "private Thread currentThread;",
544... "private volatile Thread currentThread;")
545\end{verbatim}
546
547\method{ratio()} returns a float in [0, 1], measuring the similarity
548of the sequences. As a rule of thumb, a \method{ratio()} value over
5490.6 means the sequences are close matches:
550
551\begin{verbatim}
552>>> print round(s.ratio(), 3)
5530.866
554\end{verbatim}
555
556If you're only interested in where the sequences match,
557\method{get_matching_blocks()} is handy:
558
559\begin{verbatim}
560>>> for block in s.get_matching_blocks():
561... print "a[%d] and b[%d] match for %d elements" % block
562a[0] and b[0] match for 8 elements
563a[8] and b[17] match for 6 elements
564a[14] and b[23] match for 15 elements
565a[29] and b[38] match for 0 elements
566\end{verbatim}
567
568Note that the last tuple returned by \method{get_matching_blocks()} is
569always a dummy, \code{(len(\var{a}), len(\var{b}), 0)}, and this is
570the only case in which the last tuple element (number of elements
571matched) is \code{0}.
572
573If you want to know how to change the first sequence into the second,
574use \method{get_opcodes()}:
575
576\begin{verbatim}
577>>> for opcode in s.get_opcodes():
578... print "%6s a[%d:%d] b[%d:%d]" % opcode
579 equal a[0:8] b[0:8]
580insert a[8:8] b[8:17]
581 equal a[8:14] b[17:23]
582 equal a[14:29] b[23:38]
583\end{verbatim}
584
Fred Drakebaf71422001-02-19 16:31:02 +0000585See also the function \function{get_close_matches()} in this module,
586which shows how simple code building on \class{SequenceMatcher} can be
587used to do useful work.
Fred Drake6943a292001-08-13 19:31:59 +0000588
589
590\subsection{Differ Objects \label{differ-objects}}
591
592Note that \class{Differ}-generated deltas make no claim to be
593\strong{minimal} diffs. To the contrary, minimal diffs are often
594counter-intuitive, because they synch up anywhere possible, sometimes
595accidental matches 100 pages apart. Restricting synch points to
596contiguous matches preserves some notion of locality, at the
597occasional cost of producing a longer diff.
598
599The \class{Differ} class has this constructor:
600
601\begin{classdesc}{Differ}{\optional{linejunk\optional{, charjunk}}}
602 Optional keyword parameters \var{linejunk} and \var{charjunk} are
603 for filter functions (or \code{None}):
604
Tim Peters81b92512002-04-29 01:37:32 +0000605 \var{linejunk}: A function that accepts a single string
606 argument, and returns true if the string is junk. The default is
607 \code{None}, meaning that no line is considered junk.
Fred Drake6943a292001-08-13 19:31:59 +0000608
Tim Peters81b92512002-04-29 01:37:32 +0000609 \var{charjunk}: A function that accepts a single character argument
610 (a string of length 1), and returns true if the character is junk.
611 The default is \code{None}, meaning that no character is
612 considered junk.
Fred Drake6943a292001-08-13 19:31:59 +0000613\end{classdesc}
614
615\class{Differ} objects are used (deltas generated) via a single
616method:
617
618\begin{methoddesc}{compare}{a, b}
Tim Peters8a9c2842001-09-22 21:30:22 +0000619 Compare two sequences of lines, and generate the delta (a sequence
620 of lines).
Fred Drake6943a292001-08-13 19:31:59 +0000621
622 Each sequence must contain individual single-line strings ending
623 with newlines. Such sequences can be obtained from the
Tim Peters8a9c2842001-09-22 21:30:22 +0000624 \method{readlines()} method of file-like objects. The delta generated
625 also consists of newline-terminated strings, ready to be printed as-is
Fred Drake389aa172001-11-29 19:04:50 +0000626 via the \method{writelines()} method of a file-like object.
Fred Drake6943a292001-08-13 19:31:59 +0000627\end{methoddesc}
628
629
630\subsection{Differ Example \label{differ-examples}}
631
632This example compares two texts. First we set up the texts, sequences
633of individual single-line strings ending with newlines (such sequences
634can also be obtained from the \method{readlines()} method of file-like
635objects):
636
637\begin{verbatim}
638>>> text1 = ''' 1. Beautiful is better than ugly.
639... 2. Explicit is better than implicit.
640... 3. Simple is better than complex.
641... 4. Complex is better than complicated.
642... '''.splitlines(1)
643>>> len(text1)
6444
645>>> text1[0][-1]
646'\n'
647>>> text2 = ''' 1. Beautiful is better than ugly.
648... 3. Simple is better than complex.
649... 4. Complicated is better than complex.
650... 5. Flat is better than nested.
651... '''.splitlines(1)
652\end{verbatim}
653
654Next we instantiate a Differ object:
655
656\begin{verbatim}
657>>> d = Differ()
658\end{verbatim}
659
660Note that when instantiating a \class{Differ} object we may pass
661functions to filter out line and character ``junk.'' See the
662\method{Differ()} constructor for details.
663
664Finally, we compare the two:
665
666\begin{verbatim}
Tim Peters8a9c2842001-09-22 21:30:22 +0000667>>> result = list(d.compare(text1, text2))
Fred Drake6943a292001-08-13 19:31:59 +0000668\end{verbatim}
669
670\code{result} is a list of strings, so let's pretty-print it:
671
672\begin{verbatim}
673>>> from pprint import pprint
674>>> pprint(result)
675[' 1. Beautiful is better than ugly.\n',
676 '- 2. Explicit is better than implicit.\n',
677 '- 3. Simple is better than complex.\n',
678 '+ 3. Simple is better than complex.\n',
679 '? ++ \n',
680 '- 4. Complex is better than complicated.\n',
681 '? ^ ---- ^ \n',
682 '+ 4. Complicated is better than complex.\n',
683 '? ++++ ^ ^ \n',
684 '+ 5. Flat is better than nested.\n']
685\end{verbatim}
686
687As a single multi-line string it looks like this:
688
689\begin{verbatim}
690>>> import sys
691>>> sys.stdout.writelines(result)
692 1. Beautiful is better than ugly.
693- 2. Explicit is better than implicit.
694- 3. Simple is better than complex.
695+ 3. Simple is better than complex.
696? ++
697- 4. Complex is better than complicated.
698? ^ ---- ^
699+ 4. Complicated is better than complex.
700? ++++ ^ ^
701+ 5. Flat is better than nested.
702\end{verbatim}