| \section{\module{difflib} --- |
| Helpers for computing deltas} |
| |
| \declaremodule{standard}{difflib} |
| \modulesynopsis{Helpers for computing differences between objects.} |
| \moduleauthor{Tim Peters}{tim.one@home.com} |
| \sectionauthor{Tim Peters}{tim.one@home.com} |
| % LaTeXification by Fred L. Drake, Jr. <fdrake@acm.org>. |
| |
| \versionadded{2.1} |
| |
| |
| \begin{classdesc*}{SequenceMatcher} |
| This is a flexible class for comparing pairs of sequences of any |
| type, so long as the sequence elements are hashable. The basic |
| algorithm predates, and is a little fancier than, an algorithm |
| published in the late 1980's by Ratcliff and Obershelp under the |
| hyperbolic name ``gestalt pattern matching.'' The idea is to find |
| the longest contiguous matching subsequence that contains no |
| ``junk'' elements (the Ratcliff and Obershelp algorithm doesn't |
| address junk). The same idea is then applied recursively to the |
| pieces of the sequences to the left and to the right of the matching |
| subsequence. This does not yield minimal edit sequences, but does |
| tend to yield matches that ``look right'' to people. |
| |
| \strong{Timing:} The basic Ratcliff-Obershelp algorithm is cubic |
| time in the worst case and quadratic time in the expected case. |
| \class{SequenceMatcher} is quadratic time for the worst case and has |
| expected-case behavior dependent in a complicated way on how many |
| elements the sequences have in common; best case time is linear. |
| \end{classdesc*} |
| |
| \begin{classdesc*}{Differ} |
| This is a class for comparing sequences of lines of text, and |
| producing human-readable differences or deltas. Differ uses |
| \class{SequenceMatcher} both to compare sequences of lines, and to |
| compare sequences of characters within similar (near-matching) |
| lines. |
| |
| Each line of a \class{Differ} delta begins with a two-letter code: |
| |
| \begin{tableii}{l|l}{code}{Code}{Meaning} |
| \lineii{'- '}{line unique to sequence 1} |
| \lineii{'+ '}{line unique to sequence 2} |
| \lineii{' '}{line common to both sequences} |
| \lineii{'? '}{line not present in either input sequence} |
| \end{tableii} |
| |
| Lines beginning with `\code{?~}' attempt to guide the eye to |
| intraline differences, and were not present in either input |
| sequence. These lines can be confusing if the sequences contain tab |
| characters. |
| \end{classdesc*} |
| |
| \begin{funcdesc}{get_close_matches}{word, possibilities\optional{, |
| n\optional{, cutoff}}} |
| Return a list of the best ``good enough'' matches. \var{word} is a |
| sequence for which close matches are desired (typically a string), |
| and \var{possibilities} is a list of sequences against which to |
| match \var{word} (typically a list of strings). |
| |
| Optional argument \var{n} (default \code{3}) is the maximum number |
| of close matches to return; \var{n} must be greater than \code{0}. |
| |
| Optional argument \var{cutoff} (default \code{0.6}) is a float in |
| the range [0, 1]. Possibilities that don't score at least that |
| similar to \var{word} are ignored. |
| |
| The best (no more than \var{n}) matches among the possibilities are |
| returned in a list, sorted by similarity score, most similar first. |
| |
| \begin{verbatim} |
| >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy']) |
| ['apple', 'ape'] |
| >>> import keyword |
| >>> get_close_matches('wheel', keyword.kwlist) |
| ['while'] |
| >>> get_close_matches('apple', keyword.kwlist) |
| [] |
| >>> get_close_matches('accept', keyword.kwlist) |
| ['except'] |
| \end{verbatim} |
| \end{funcdesc} |
| |
| \begin{funcdesc}{ndiff}{a, b\optional{, linejunk\optional{, |
| charjunk}}} |
| Compare \var{a} and \var{b} (lists of strings); return a |
| \class{Differ}-style delta (a generator generating the delta lines). |
| |
| Optional keyword parameters \var{linejunk} and \var{charjunk} are |
| for filter functions (or \code{None}): |
| |
| \var{linejunk}: A function that should accept a single string |
| argument, and return true if the string is junk (or false if it is |
| not). The default is module-level function |
| \function{IS_LINE_JUNK()}, which filters out lines without visible |
| characters, except for at most one pound character (\character{\#}). |
| |
| \var{charjunk}: A function that should accept a string of length 1. |
| The default is module-level function \function{IS_CHARACTER_JUNK()}, |
| which filters out whitespace characters (a blank or tab; note: bad |
| idea to include newline in this!). |
| |
| \file{Tools/scripts/ndiff.py} is a command-line front-end to this |
| function. |
| |
| \begin{verbatim} |
| >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1), |
| ... 'ore\ntree\nemu\n'.splitlines(1))) |
| >>> print ''.join(diff), |
| - one |
| ? ^ |
| + ore |
| ? ^ |
| - two |
| - three |
| ? - |
| + tree |
| + emu |
| \end{verbatim} |
| \end{funcdesc} |
| |
| \begin{funcdesc}{restore}{sequence, which} |
| Return one of the two sequences that generated a delta. |
| |
| Given a \var{sequence} produced by \method{Differ.compare()} or |
| \function{ndiff()}, extract lines originating from file 1 or 2 |
| (parameter \var{which}), stripping off line prefixes. |
| |
| Example: |
| |
| \begin{verbatim} |
| >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(1), |
| ... 'ore\ntree\nemu\n'.splitlines(1)) |
| >>> diff = list(diff) # materialize the generated delta into a list |
| >>> print ''.join(restore(diff, 1)), |
| one |
| two |
| three |
| >>> print ''.join(restore(diff, 2)), |
| ore |
| tree |
| emu |
| \end{verbatim} |
| |
| \end{funcdesc} |
| |
| |
| \begin{funcdesc}{IS_LINE_JUNK}{line} |
| Return true for ignorable lines. The line \var{line} is ignorable |
| if \var{line} is blank or contains a single \character{\#}, |
| otherwise it is not ignorable. Used as a default for parameter |
| \var{linejunk} in \function{ndiff()}. |
| \end{funcdesc} |
| |
| |
| \begin{funcdesc}{IS_CHARACTER_JUNK}{ch} |
| Return true for ignorable characters. The character \var{ch} is |
| ignorable if \var{ch} is a space or tab, otherwise it is not |
| ignorable. Used as a default for parameter \var{charjunk} in |
| \function{ndiff()}. |
| \end{funcdesc} |
| |
| |
| \begin{seealso} |
| \seetitle{Pattern Matching: The Gestalt Approach}{Discussion of a |
| similar algorithm by John W. Ratcliff and D. E. Metzener. |
| This was published in |
| \citetitle[http://www.ddj.com/]{Dr. Dobb's Journal} in |
| July, 1988.} |
| \end{seealso} |
| |
| |
| \subsection{SequenceMatcher Objects \label{sequence-matcher}} |
| |
| The \class{SequenceMatcher} class has this constructor: |
| |
| \begin{classdesc}{SequenceMatcher}{\optional{isjunk\optional{, |
| a\optional{, b}}}} |
| Optional argument \var{isjunk} must be \code{None} (the default) or |
| a one-argument function that takes a sequence element and returns |
| true if and only if the element is ``junk'' and should be ignored. |
| Passing \code{None} for \var{b} is equivalent to passing |
| \code{lambda x: 0}; in other words, no elements are ignored. For |
| example, pass: |
| |
| \begin{verbatim} |
| lambda x: x in " \t" |
| \end{verbatim} |
| |
| if you're comparing lines as sequences of characters, and don't want |
| to synch up on blanks or hard tabs. |
| |
| The optional arguments \var{a} and \var{b} are sequences to be |
| compared; both default to empty strings. The elements of both |
| sequences must be hashable. |
| \end{classdesc} |
| |
| |
| \class{SequenceMatcher} objects have the following methods: |
| |
| \begin{methoddesc}{set_seqs}{a, b} |
| Set the two sequences to be compared. |
| \end{methoddesc} |
| |
| \class{SequenceMatcher} computes and caches detailed information about |
| the second sequence, so if you want to compare one sequence against |
| many sequences, use \method{set_seq2()} to set the commonly used |
| sequence once and call \method{set_seq1()} repeatedly, once for each |
| of the other sequences. |
| |
| \begin{methoddesc}{set_seq1}{a} |
| Set the first sequence to be compared. The second sequence to be |
| compared is not changed. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{set_seq2}{b} |
| Set the second sequence to be compared. The first sequence to be |
| compared is not changed. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{find_longest_match}{alo, ahi, blo, bhi} |
| Find longest matching block in \code{\var{a}[\var{alo}:\var{ahi}]} |
| and \code{\var{b}[\var{blo}:\var{bhi}]}. |
| |
| If \var{isjunk} was omitted or \code{None}, |
| \method{get_longest_match()} returns \code{(\var{i}, \var{j}, |
| \var{k})} such that \code{\var{a}[\var{i}:\var{i}+\var{k}]} is equal |
| to \code{\var{b}[\var{j}:\var{j}+\var{k}]}, where |
| \code{\var{alo} <= \var{i} <= \var{i}+\var{k} <= \var{ahi}} and |
| \code{\var{blo} <= \var{j} <= \var{j}+\var{k} <= \var{bhi}}. |
| For all \code{(\var{i'}, \var{j'}, \var{k'})} meeting those |
| conditions, the additional conditions |
| \code{\var{k} >= \var{k'}}, |
| \code{\var{i} <= \var{i'}}, |
| and if \code{\var{i} == \var{i'}}, \code{\var{j} <= \var{j'}} |
| are also met. |
| In other words, of all maximal matching blocks, return one that |
| starts earliest in \var{a}, and of all those maximal matching blocks |
| that start earliest in \var{a}, return the one that starts earliest |
| in \var{b}. |
| |
| \begin{verbatim} |
| >>> s = SequenceMatcher(None, " abcd", "abcd abcd") |
| >>> s.find_longest_match(0, 5, 0, 9) |
| (0, 4, 5) |
| \end{verbatim} |
| |
| If \var{isjunk} was provided, first the longest matching block is |
| determined as above, but with the additional restriction that no |
| junk element appears in the block. Then that block is extended as |
| far as possible by matching (only) junk elements on both sides. |
| So the resulting block never matches on junk except as identical |
| junk happens to be adjacent to an interesting match. |
| |
| Here's the same example as before, but considering blanks to be junk. |
| That prevents \code{' abcd'} from matching the \code{' abcd'} at the |
| tail end of the second sequence directly. Instead only the |
| \code{'abcd'} can match, and matches the leftmost \code{'abcd'} in |
| the second sequence: |
| |
| \begin{verbatim} |
| >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd") |
| >>> s.find_longest_match(0, 5, 0, 9) |
| (1, 0, 4) |
| \end{verbatim} |
| |
| If no blocks match, this returns \code{(\var{alo}, \var{blo}, 0)}. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{get_matching_blocks}{} |
| Return list of triples describing matching subsequences. |
| Each triple is of the form \code{(\var{i}, \var{j}, \var{n})}, and |
| means that \code{\var{a}[\var{i}:\var{i}+\var{n}] == |
| \var{b}[\var{j}:\var{j}+\var{n}]}. The triples are monotonically |
| increasing in \var{i} and \var{j}. |
| |
| The last triple is a dummy, and has the value \code{(len(\var{a}), |
| len(\var{b}), 0)}. It is the only triple with \code{\var{n} == 0}. |
| % Explain why a dummy is used! |
| |
| \begin{verbatim} |
| >>> s = SequenceMatcher(None, "abxcd", "abcd") |
| >>> s.get_matching_blocks() |
| [(0, 0, 2), (3, 2, 2), (5, 4, 0)] |
| \end{verbatim} |
| \end{methoddesc} |
| |
| \begin{methoddesc}{get_opcodes}{} |
| Return list of 5-tuples describing how to turn \var{a} into \var{b}. |
| Each tuple is of the form \code{(\var{tag}, \var{i1}, \var{i2}, |
| \var{j1}, \var{j2})}. The first tuple has \code{\var{i1} == |
| \var{j1} == 0}, and remaining tuples have \var{i1} equal to the |
| \var{i2} from the preceeding tuple, and, likewise, \var{j1} equal to |
| the previous \var{j2}. |
| |
| The \var{tag} values are strings, with these meanings: |
| |
| \begin{tableii}{l|l}{code}{Value}{Meaning} |
| \lineii{'replace'}{\code{\var{a}[\var{i1}:\var{i2}]} should be |
| replaced by \code{\var{b}[\var{j1}:\var{j2}]}.} |
| \lineii{'delete'}{\code{\var{a}[\var{i1}:\var{i2}]} should be |
| deleted. Note that \code{\var{j1} == \var{j2}} in |
| this case.} |
| \lineii{'insert'}{\code{\var{b}[\var{j1}:\var{j2}]} should be |
| inserted at \code{\var{a}[\var{i1}:\var{i1}]}. |
| Note that \code{\var{i1} == \var{i2}} in this |
| case.} |
| \lineii{'equal'}{\code{\var{a}[\var{i1}:\var{i2}] == |
| \var{b}[\var{j1}:\var{j2}]} (the sub-sequences are |
| equal).} |
| \end{tableii} |
| |
| For example: |
| |
| \begin{verbatim} |
| >>> a = "qabxcd" |
| >>> b = "abycdf" |
| >>> s = SequenceMatcher(None, a, b) |
| >>> for tag, i1, i2, j1, j2 in s.get_opcodes(): |
| ... print ("%7s a[%d:%d] (%s) b[%d:%d] (%s)" % |
| ... (tag, i1, i2, a[i1:i2], j1, j2, b[j1:j2])) |
| delete a[0:1] (q) b[0:0] () |
| equal a[1:3] (ab) b[0:2] (ab) |
| replace a[3:4] (x) b[2:3] (y) |
| equal a[4:6] (cd) b[3:5] (cd) |
| insert a[6:6] () b[5:6] (f) |
| \end{verbatim} |
| \end{methoddesc} |
| |
| \begin{methoddesc}{ratio}{} |
| Return a measure of the sequences' similarity as a float in the |
| range [0, 1]. |
| |
| Where T is the total number of elements in both sequences, and M is |
| the number of matches, this is 2.0*M / T. Note that this is |
| \code{1.0} if the sequences are identical, and \code{0.0} if they |
| have nothing in common. |
| |
| This is expensive to compute if \method{get_matching_blocks()} or |
| \method{get_opcodes()} hasn't already been called, in which case you |
| may want to try \method{quick_ratio()} or |
| \method{real_quick_ratio()} first to get an upper bound. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{quick_ratio}{} |
| Return an upper bound on \method{ratio()} relatively quickly. |
| |
| This isn't defined beyond that it is an upper bound on |
| \method{ratio()}, and is faster to compute. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{real_quick_ratio}{} |
| Return an upper bound on \method{ratio()} very quickly. |
| |
| This isn't defined beyond that it is an upper bound on |
| \method{ratio()}, and is faster to compute than either |
| \method{ratio()} or \method{quick_ratio()}. |
| \end{methoddesc} |
| |
| The three methods that return the ratio of matching to total characters |
| can give different results due to differing levels of approximation, |
| although \method{quick_ratio()} and \method{real_quick_ratio()} are always |
| at least as large as \method{ratio()}: |
| |
| \begin{verbatim} |
| >>> s = SequenceMatcher(None, "abcd", "bcde") |
| >>> s.ratio() |
| 0.75 |
| >>> s.quick_ratio() |
| 0.75 |
| >>> s.real_quick_ratio() |
| 1.0 |
| \end{verbatim} |
| |
| |
| \subsection{SequenceMatcher Examples \label{sequencematcher-examples}} |
| |
| |
| This example compares two strings, considering blanks to be ``junk:'' |
| |
| \begin{verbatim} |
| >>> s = SequenceMatcher(lambda x: x == " ", |
| ... "private Thread currentThread;", |
| ... "private volatile Thread currentThread;") |
| \end{verbatim} |
| |
| \method{ratio()} returns a float in [0, 1], measuring the similarity |
| of the sequences. As a rule of thumb, a \method{ratio()} value over |
| 0.6 means the sequences are close matches: |
| |
| \begin{verbatim} |
| >>> print round(s.ratio(), 3) |
| 0.866 |
| \end{verbatim} |
| |
| If you're only interested in where the sequences match, |
| \method{get_matching_blocks()} is handy: |
| |
| \begin{verbatim} |
| >>> for block in s.get_matching_blocks(): |
| ... print "a[%d] and b[%d] match for %d elements" % block |
| a[0] and b[0] match for 8 elements |
| a[8] and b[17] match for 6 elements |
| a[14] and b[23] match for 15 elements |
| a[29] and b[38] match for 0 elements |
| \end{verbatim} |
| |
| Note that the last tuple returned by \method{get_matching_blocks()} is |
| always a dummy, \code{(len(\var{a}), len(\var{b}), 0)}, and this is |
| the only case in which the last tuple element (number of elements |
| matched) is \code{0}. |
| |
| If you want to know how to change the first sequence into the second, |
| use \method{get_opcodes()}: |
| |
| \begin{verbatim} |
| >>> for opcode in s.get_opcodes(): |
| ... print "%6s a[%d:%d] b[%d:%d]" % opcode |
| equal a[0:8] b[0:8] |
| insert a[8:8] b[8:17] |
| equal a[8:14] b[17:23] |
| equal a[14:29] b[23:38] |
| \end{verbatim} |
| |
| See also the function \function{get_close_matches()} in this module, |
| which shows how simple code building on \class{SequenceMatcher} can be |
| used to do useful work. |
| |
| |
| \subsection{Differ Objects \label{differ-objects}} |
| |
| Note that \class{Differ}-generated deltas make no claim to be |
| \strong{minimal} diffs. To the contrary, minimal diffs are often |
| counter-intuitive, because they synch up anywhere possible, sometimes |
| accidental matches 100 pages apart. Restricting synch points to |
| contiguous matches preserves some notion of locality, at the |
| occasional cost of producing a longer diff. |
| |
| The \class{Differ} class has this constructor: |
| |
| \begin{classdesc}{Differ}{\optional{linejunk\optional{, charjunk}}} |
| Optional keyword parameters \var{linejunk} and \var{charjunk} are |
| for filter functions (or \code{None}): |
| |
| \var{linejunk}: A function that should accept a single string |
| argument, and return true if the string is junk. The default is |
| module-level function \function{IS_LINE_JUNK()}, which filters out |
| lines without visible characters, except for at most one pound |
| character (\character{\#}). |
| |
| \var{charjunk}: A function that should accept a string of length 1. |
| The default is module-level function \function{IS_CHARACTER_JUNK()}, |
| which filters out whitespace characters (a blank or tab; note: bad |
| idea to include newline in this!). |
| \end{classdesc} |
| |
| \class{Differ} objects are used (deltas generated) via a single |
| method: |
| |
| \begin{methoddesc}{compare}{a, b} |
| Compare two sequences of lines, and generate the delta (a sequence |
| of lines). |
| |
| Each sequence must contain individual single-line strings ending |
| with newlines. Such sequences can be obtained from the |
| \method{readlines()} method of file-like objects. The delta generated |
| also consists of newline-terminated strings, ready to be printed as-is |
| via the \method{writelines()} method of a file-like object. |
| \end{methoddesc} |
| |
| |
| \subsection{Differ Example \label{differ-examples}} |
| |
| This example compares two texts. First we set up the texts, sequences |
| of individual single-line strings ending with newlines (such sequences |
| can also be obtained from the \method{readlines()} method of file-like |
| objects): |
| |
| \begin{verbatim} |
| >>> text1 = ''' 1. Beautiful is better than ugly. |
| ... 2. Explicit is better than implicit. |
| ... 3. Simple is better than complex. |
| ... 4. Complex is better than complicated. |
| ... '''.splitlines(1) |
| >>> len(text1) |
| 4 |
| >>> text1[0][-1] |
| '\n' |
| >>> text2 = ''' 1. Beautiful is better than ugly. |
| ... 3. Simple is better than complex. |
| ... 4. Complicated is better than complex. |
| ... 5. Flat is better than nested. |
| ... '''.splitlines(1) |
| \end{verbatim} |
| |
| Next we instantiate a Differ object: |
| |
| \begin{verbatim} |
| >>> d = Differ() |
| \end{verbatim} |
| |
| Note that when instantiating a \class{Differ} object we may pass |
| functions to filter out line and character ``junk.'' See the |
| \method{Differ()} constructor for details. |
| |
| Finally, we compare the two: |
| |
| \begin{verbatim} |
| >>> result = list(d.compare(text1, text2)) |
| \end{verbatim} |
| |
| \code{result} is a list of strings, so let's pretty-print it: |
| |
| \begin{verbatim} |
| >>> from pprint import pprint |
| >>> pprint(result) |
| [' 1. Beautiful is better than ugly.\n', |
| '- 2. Explicit is better than implicit.\n', |
| '- 3. Simple is better than complex.\n', |
| '+ 3. Simple is better than complex.\n', |
| '? ++ \n', |
| '- 4. Complex is better than complicated.\n', |
| '? ^ ---- ^ \n', |
| '+ 4. Complicated is better than complex.\n', |
| '? ++++ ^ ^ \n', |
| '+ 5. Flat is better than nested.\n'] |
| \end{verbatim} |
| |
| As a single multi-line string it looks like this: |
| |
| \begin{verbatim} |
| >>> import sys |
| >>> sys.stdout.writelines(result) |
| 1. Beautiful is better than ugly. |
| - 2. Explicit is better than implicit. |
| - 3. Simple is better than complex. |
| + 3. Simple is better than complex. |
| ? ++ |
| - 4. Complex is better than complicated. |
| ? ^ ---- ^ |
| + 4. Complicated is better than complex. |
| ? ++++ ^ ^ |
| + 5. Flat is better than nested. |
| \end{verbatim} |