| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1 | :mod:`difflib` --- Helpers for computing deltas | 
 | 2 | =============================================== | 
 | 3 |  | 
 | 4 | .. module:: difflib | 
 | 5 |    :synopsis: Helpers for computing differences between objects. | 
 | 6 | .. moduleauthor:: Tim Peters <tim_one@users.sourceforge.net> | 
 | 7 | .. sectionauthor:: Tim Peters <tim_one@users.sourceforge.net> | 
| Christian Heimes | 5b5e81c | 2007-12-31 16:14:33 +0000 | [diff] [blame] | 8 | .. Markup by Fred L. Drake, Jr. <fdrake@acm.org> | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 9 |  | 
| Andrew Kuchling | 2e3743c | 2014-03-19 16:23:01 -0400 | [diff] [blame] | 10 | **Source code:** :source:`Lib/difflib.py` | 
 | 11 |  | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 12 | .. testsetup:: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 13 |  | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 14 |    import sys | 
 | 15 |    from difflib import * | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 16 |  | 
| Georg Brandl | 9afde1c | 2007-11-01 20:32:30 +0000 | [diff] [blame] | 17 | This module provides classes and functions for comparing sequences. It | 
 | 18 | can be used for example, for comparing files, and can produce difference | 
 | 19 | information in various formats, including HTML and context and unified | 
 | 20 | diffs. For comparing directories and files, see also, the :mod:`filecmp` module. | 
 | 21 |  | 
| Terry Reedy | 99f9637 | 2010-11-25 06:12:34 +0000 | [diff] [blame] | 22 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 23 | .. class:: SequenceMatcher | 
 | 24 |  | 
 | 25 |    This is a flexible class for comparing pairs of sequences of any type, so long | 
| Guido van Rossum | 2cc30da | 2007-11-02 23:46:40 +0000 | [diff] [blame] | 26 |    as the sequence elements are :term:`hashable`.  The basic algorithm predates, and is a | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 27 |    little fancier than, an algorithm published in the late 1980's by Ratcliff and | 
 | 28 |    Obershelp under the hyperbolic name "gestalt pattern matching."  The idea is to | 
 | 29 |    find the longest contiguous matching subsequence that contains no "junk" | 
| Andrew Kuchling | c51da2b | 2014-03-19 16:43:06 -0400 | [diff] [blame] | 30 |    elements; these "junk" elements are ones that are uninteresting in some | 
 | 31 |    sense, such as blank lines or whitespace.  (Handling junk is an | 
 | 32 |    extension to the Ratcliff and Obershelp algorithm.) The same | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 33 |    idea is then applied recursively to the pieces of the sequences to the left and | 
 | 34 |    to the right of the matching subsequence.  This does not yield minimal edit | 
 | 35 |    sequences, but does tend to yield matches that "look right" to people. | 
 | 36 |  | 
 | 37 |    **Timing:** The basic Ratcliff-Obershelp algorithm is cubic time in the worst | 
 | 38 |    case and quadratic time in the expected case. :class:`SequenceMatcher` is | 
 | 39 |    quadratic time for the worst case and has expected-case behavior dependent in a | 
 | 40 |    complicated way on how many elements the sequences have in common; best case | 
 | 41 |    time is linear. | 
 | 42 |  | 
| Terry Reedy | 99f9637 | 2010-11-25 06:12:34 +0000 | [diff] [blame] | 43 |    **Automatic junk heuristic:** :class:`SequenceMatcher` supports a heuristic that | 
 | 44 |    automatically treats certain sequence items as junk. The heuristic counts how many | 
 | 45 |    times each individual item appears in the sequence. If an item's duplicates (after | 
 | 46 |    the first one) account for more than 1% of the sequence and the sequence is at least | 
 | 47 |    200 items long, this item is marked as "popular" and is treated as junk for | 
 | 48 |    the purpose of sequence matching. This heuristic can be turned off by setting | 
 | 49 |    the ``autojunk`` argument to ``False`` when creating the :class:`SequenceMatcher`. | 
 | 50 |  | 
| Terry Reedy | dc9b17d | 2010-11-27 20:52:14 +0000 | [diff] [blame] | 51 |    .. versionadded:: 3.2 | 
 | 52 |       The *autojunk* parameter. | 
 | 53 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 54 |  | 
 | 55 | .. class:: Differ | 
 | 56 |  | 
 | 57 |    This is a class for comparing sequences of lines of text, and producing | 
 | 58 |    human-readable differences or deltas.  Differ uses :class:`SequenceMatcher` | 
 | 59 |    both to compare sequences of lines, and to compare sequences of characters | 
 | 60 |    within similar (near-matching) lines. | 
 | 61 |  | 
 | 62 |    Each line of a :class:`Differ` delta begins with a two-letter code: | 
 | 63 |  | 
 | 64 |    +----------+-------------------------------------------+ | 
 | 65 |    | Code     | Meaning                                   | | 
 | 66 |    +==========+===========================================+ | 
 | 67 |    | ``'- '`` | line unique to sequence 1                 | | 
 | 68 |    +----------+-------------------------------------------+ | 
 | 69 |    | ``'+ '`` | line unique to sequence 2                 | | 
 | 70 |    +----------+-------------------------------------------+ | 
 | 71 |    | ``'  '`` | line common to both sequences             | | 
 | 72 |    +----------+-------------------------------------------+ | 
 | 73 |    | ``'? '`` | line not present in either input sequence | | 
 | 74 |    +----------+-------------------------------------------+ | 
 | 75 |  | 
 | 76 |    Lines beginning with '``?``' attempt to guide the eye to intraline differences, | 
 | 77 |    and were not present in either input sequence. These lines can be confusing if | 
 | 78 |    the sequences contain tab characters. | 
 | 79 |  | 
 | 80 |  | 
 | 81 | .. class:: HtmlDiff | 
 | 82 |  | 
 | 83 |    This class can be used to create an HTML table (or a complete HTML file | 
 | 84 |    containing the table) showing a side by side, line by line comparison of text | 
 | 85 |    with inter-line and intra-line change highlights.  The table can be generated in | 
 | 86 |    either full or contextual difference mode. | 
 | 87 |  | 
 | 88 |    The constructor for this class is: | 
 | 89 |  | 
 | 90 |  | 
| Georg Brandl | c2a4f4f | 2009-04-10 09:03:43 +0000 | [diff] [blame] | 91 |    .. method:: __init__(tabsize=8, wrapcolumn=None, linejunk=None, charjunk=IS_CHARACTER_JUNK) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 92 |  | 
 | 93 |       Initializes instance of :class:`HtmlDiff`. | 
 | 94 |  | 
 | 95 |       *tabsize* is an optional keyword argument to specify tab stop spacing and | 
 | 96 |       defaults to ``8``. | 
 | 97 |  | 
 | 98 |       *wrapcolumn* is an optional keyword to specify column number where lines are | 
 | 99 |       broken and wrapped, defaults to ``None`` where lines are not wrapped. | 
 | 100 |  | 
 | 101 |       *linejunk* and *charjunk* are optional keyword arguments passed into ``ndiff()`` | 
 | 102 |       (used by :class:`HtmlDiff` to generate the side by side HTML differences).  See | 
 | 103 |       ``ndiff()`` documentation for argument default values and descriptions. | 
 | 104 |  | 
 | 105 |    The following methods are public: | 
 | 106 |  | 
| Berker Peksag | 102029d | 2015-03-15 01:18:47 +0200 | [diff] [blame] | 107 |    .. method:: make_file(fromlines, tolines, fromdesc='', todesc='', context=False, \ | 
 | 108 |                          numlines=5, *, charset='utf-8') | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 109 |  | 
 | 110 |       Compares *fromlines* and *tolines* (lists of strings) and returns a string which | 
 | 111 |       is a complete HTML file containing a table showing line by line differences with | 
 | 112 |       inter-line and intra-line changes highlighted. | 
 | 113 |  | 
 | 114 |       *fromdesc* and *todesc* are optional keyword arguments to specify from/to file | 
 | 115 |       column header strings (both default to an empty string). | 
 | 116 |  | 
 | 117 |       *context* and *numlines* are both optional keyword arguments. Set *context* to | 
 | 118 |       ``True`` when contextual differences are to be shown, else the default is | 
 | 119 |       ``False`` to show the full files. *numlines* defaults to ``5``.  When *context* | 
 | 120 |       is ``True`` *numlines* controls the number of context lines which surround the | 
 | 121 |       difference highlights.  When *context* is ``False`` *numlines* controls the | 
 | 122 |       number of lines which are shown before a difference highlight when using the | 
 | 123 |       "next" hyperlinks (setting to zero would cause the "next" hyperlinks to place | 
 | 124 |       the next difference highlight at the top of the browser without any leading | 
 | 125 |       context). | 
 | 126 |  | 
| Berker Peksag | 102029d | 2015-03-15 01:18:47 +0200 | [diff] [blame] | 127 |       .. versionchanged:: 3.5 | 
 | 128 |          *charset* keyword-only argument was added.  The default charset of | 
 | 129 |          HTML document changed from ``'ISO-8859-1'`` to ``'utf-8'``. | 
 | 130 |  | 
| Georg Brandl | c2a4f4f | 2009-04-10 09:03:43 +0000 | [diff] [blame] | 131 |    .. method:: make_table(fromlines, tolines, fromdesc='', todesc='', context=False, numlines=5) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 132 |  | 
 | 133 |       Compares *fromlines* and *tolines* (lists of strings) and returns a string which | 
 | 134 |       is a complete HTML table showing line by line differences with inter-line and | 
 | 135 |       intra-line changes highlighted. | 
 | 136 |  | 
 | 137 |       The arguments for this method are the same as those for the :meth:`make_file` | 
 | 138 |       method. | 
 | 139 |  | 
 | 140 |    :file:`Tools/scripts/diff.py` is a command-line front-end to this class and | 
 | 141 |    contains a good example of its use. | 
 | 142 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 143 |  | 
| Georg Brandl | c2a4f4f | 2009-04-10 09:03:43 +0000 | [diff] [blame] | 144 | .. function:: context_diff(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', n=3, lineterm='\\n') | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 145 |  | 
| Georg Brandl | 9afde1c | 2007-11-01 20:32:30 +0000 | [diff] [blame] | 146 |    Compare *a* and *b* (lists of strings); return a delta (a :term:`generator` | 
 | 147 |    generating the delta lines) in context diff format. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 148 |  | 
 | 149 |    Context diffs are a compact way of showing just the lines that have changed plus | 
 | 150 |    a few lines of context.  The changes are shown in a before/after style.  The | 
 | 151 |    number of context lines is set by *n* which defaults to three. | 
 | 152 |  | 
 | 153 |    By default, the diff control lines (those with ``***`` or ``---``) are created | 
 | 154 |    with a trailing newline.  This is helpful so that inputs created from | 
| Serhiy Storchaka | bfdcd43 | 2013-10-13 23:09:14 +0300 | [diff] [blame] | 155 |    :func:`io.IOBase.readlines` result in diffs that are suitable for use with | 
 | 156 |    :func:`io.IOBase.writelines` since both the inputs and outputs have trailing | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 157 |    newlines. | 
 | 158 |  | 
 | 159 |    For inputs that do not have trailing newlines, set the *lineterm* argument to | 
 | 160 |    ``""`` so that the output will be uniformly newline free. | 
 | 161 |  | 
 | 162 |    The context diff format normally has a header for filenames and modification | 
 | 163 |    times.  Any or all of these may be specified using strings for *fromfile*, | 
| R. David Murray | b2416e5 | 2010-04-12 16:58:02 +0000 | [diff] [blame] | 164 |    *tofile*, *fromfiledate*, and *tofiledate*.  The modification times are normally | 
 | 165 |    expressed in the ISO 8601 format. If not specified, the | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 166 |    strings default to blanks. | 
 | 167 |  | 
| Christian Heimes | 8640e74 | 2008-02-23 16:23:06 +0000 | [diff] [blame] | 168 |       >>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n'] | 
 | 169 |       >>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n'] | 
 | 170 |       >>> for line in context_diff(s1, s2, fromfile='before.py', tofile='after.py'): | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 171 |       ...     sys.stdout.write(line)  # doctest: +NORMALIZE_WHITESPACE | 
| Christian Heimes | 8640e74 | 2008-02-23 16:23:06 +0000 | [diff] [blame] | 172 |       *** before.py | 
 | 173 |       --- after.py | 
 | 174 |       *************** | 
 | 175 |       *** 1,4 **** | 
 | 176 |       ! bacon | 
 | 177 |       ! eggs | 
 | 178 |       ! ham | 
 | 179 |         guido | 
 | 180 |       --- 1,4 ---- | 
 | 181 |       ! python | 
 | 182 |       ! eggy | 
 | 183 |       ! hamster | 
 | 184 |         guido | 
 | 185 |  | 
 | 186 |    See :ref:`difflib-interface` for a more detailed example. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 187 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 188 |  | 
| Georg Brandl | c2a4f4f | 2009-04-10 09:03:43 +0000 | [diff] [blame] | 189 | .. function:: get_close_matches(word, possibilities, n=3, cutoff=0.6) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 190 |  | 
 | 191 |    Return a list of the best "good enough" matches.  *word* is a sequence for which | 
 | 192 |    close matches are desired (typically a string), and *possibilities* is a list of | 
 | 193 |    sequences against which to match *word* (typically a list of strings). | 
 | 194 |  | 
 | 195 |    Optional argument *n* (default ``3``) is the maximum number of close matches to | 
 | 196 |    return; *n* must be greater than ``0``. | 
 | 197 |  | 
 | 198 |    Optional argument *cutoff* (default ``0.6``) is a float in the range [0, 1]. | 
 | 199 |    Possibilities that don't score at least that similar to *word* are ignored. | 
 | 200 |  | 
 | 201 |    The best (no more than *n*) matches among the possibilities are returned in a | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 202 |    list, sorted by similarity score, most similar first. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 203 |  | 
 | 204 |       >>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy']) | 
 | 205 |       ['apple', 'ape'] | 
 | 206 |       >>> import keyword | 
 | 207 |       >>> get_close_matches('wheel', keyword.kwlist) | 
 | 208 |       ['while'] | 
 | 209 |       >>> get_close_matches('apple', keyword.kwlist) | 
 | 210 |       [] | 
 | 211 |       >>> get_close_matches('accept', keyword.kwlist) | 
 | 212 |       ['except'] | 
 | 213 |  | 
 | 214 |  | 
| Georg Brandl | c2a4f4f | 2009-04-10 09:03:43 +0000 | [diff] [blame] | 215 | .. function:: ndiff(a, b, linejunk=None, charjunk=IS_CHARACTER_JUNK) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 216 |  | 
| Georg Brandl | 9afde1c | 2007-11-01 20:32:30 +0000 | [diff] [blame] | 217 |    Compare *a* and *b* (lists of strings); return a :class:`Differ`\ -style | 
 | 218 |    delta (a :term:`generator` generating the delta lines). | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 219 |  | 
| Andrew Kuchling | c51da2b | 2014-03-19 16:43:06 -0400 | [diff] [blame] | 220 |    Optional keyword parameters *linejunk* and *charjunk* are filtering functions | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 221 |    (or ``None``): | 
 | 222 |  | 
| Georg Brandl | e6bcc91 | 2008-05-12 18:05:20 +0000 | [diff] [blame] | 223 |    *linejunk*: A function that accepts a single string argument, and returns | 
 | 224 |    true if the string is junk, or false if not. The default is ``None``. There | 
 | 225 |    is also a module-level function :func:`IS_LINE_JUNK`, which filters out lines | 
 | 226 |    without visible characters, except for at most one pound character (``'#'``) | 
 | 227 |    -- however the underlying :class:`SequenceMatcher` class does a dynamic | 
 | 228 |    analysis of which lines are so frequent as to constitute noise, and this | 
 | 229 |    usually works better than using this function. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 230 |  | 
 | 231 |    *charjunk*: A function that accepts a character (a string of length 1), and | 
 | 232 |    returns if the character is junk, or false if not. The default is module-level | 
 | 233 |    function :func:`IS_CHARACTER_JUNK`, which filters out whitespace characters (a | 
| Andrew Kuchling | c51da2b | 2014-03-19 16:43:06 -0400 | [diff] [blame] | 234 |    blank or tab; it's a bad idea to include newline in this!). | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 235 |  | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 236 |    :file:`Tools/scripts/ndiff.py` is a command-line front-end to this function. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 237 |  | 
| Terry Jan Reedy | bddecc3 | 2014-04-18 17:00:19 -0400 | [diff] [blame] | 238 |       >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True), | 
 | 239 |       ...              'ore\ntree\nemu\n'.splitlines(keepends=True)) | 
| Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 240 |       >>> print(''.join(diff), end="") | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 241 |       - one | 
 | 242 |       ?  ^ | 
 | 243 |       + ore | 
 | 244 |       ?  ^ | 
 | 245 |       - two | 
 | 246 |       - three | 
 | 247 |       ?  - | 
 | 248 |       + tree | 
 | 249 |       + emu | 
 | 250 |  | 
 | 251 |  | 
 | 252 | .. function:: restore(sequence, which) | 
 | 253 |  | 
 | 254 |    Return one of the two sequences that generated a delta. | 
 | 255 |  | 
 | 256 |    Given a *sequence* produced by :meth:`Differ.compare` or :func:`ndiff`, extract | 
 | 257 |    lines originating from file 1 or 2 (parameter *which*), stripping off line | 
 | 258 |    prefixes. | 
 | 259 |  | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 260 |    Example: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 261 |  | 
| Terry Jan Reedy | bddecc3 | 2014-04-18 17:00:19 -0400 | [diff] [blame] | 262 |       >>> diff = ndiff('one\ntwo\nthree\n'.splitlines(keepends=True), | 
 | 263 |       ...              'ore\ntree\nemu\n'.splitlines(keepends=True)) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 264 |       >>> diff = list(diff) # materialize the generated delta into a list | 
| Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 265 |       >>> print(''.join(restore(diff, 1)), end="") | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 266 |       one | 
 | 267 |       two | 
 | 268 |       three | 
| Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 269 |       >>> print(''.join(restore(diff, 2)), end="") | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 270 |       ore | 
 | 271 |       tree | 
 | 272 |       emu | 
 | 273 |  | 
 | 274 |  | 
| Georg Brandl | c2a4f4f | 2009-04-10 09:03:43 +0000 | [diff] [blame] | 275 | .. function:: unified_diff(a, b, fromfile='', tofile='', fromfiledate='', tofiledate='', n=3, lineterm='\\n') | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 276 |  | 
| Georg Brandl | 9afde1c | 2007-11-01 20:32:30 +0000 | [diff] [blame] | 277 |    Compare *a* and *b* (lists of strings); return a delta (a :term:`generator` | 
 | 278 |    generating the delta lines) in unified diff format. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 279 |  | 
 | 280 |    Unified diffs are a compact way of showing just the lines that have changed plus | 
 | 281 |    a few lines of context.  The changes are shown in a inline style (instead of | 
 | 282 |    separate before/after blocks).  The number of context lines is set by *n* which | 
 | 283 |    defaults to three. | 
 | 284 |  | 
 | 285 |    By default, the diff control lines (those with ``---``, ``+++``, or ``@@``) are | 
 | 286 |    created with a trailing newline.  This is helpful so that inputs created from | 
| Serhiy Storchaka | bfdcd43 | 2013-10-13 23:09:14 +0300 | [diff] [blame] | 287 |    :func:`io.IOBase.readlines` result in diffs that are suitable for use with | 
 | 288 |    :func:`io.IOBase.writelines` since both the inputs and outputs have trailing | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 289 |    newlines. | 
 | 290 |  | 
 | 291 |    For inputs that do not have trailing newlines, set the *lineterm* argument to | 
 | 292 |    ``""`` so that the output will be uniformly newline free. | 
 | 293 |  | 
 | 294 |    The context diff format normally has a header for filenames and modification | 
 | 295 |    times.  Any or all of these may be specified using strings for *fromfile*, | 
| R. David Murray | b2416e5 | 2010-04-12 16:58:02 +0000 | [diff] [blame] | 296 |    *tofile*, *fromfiledate*, and *tofiledate*.  The modification times are normally | 
 | 297 |    expressed in the ISO 8601 format. If not specified, the | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 298 |    strings default to blanks. | 
 | 299 |  | 
| Christian Heimes | 8640e74 | 2008-02-23 16:23:06 +0000 | [diff] [blame] | 300 |  | 
 | 301 |       >>> s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n'] | 
 | 302 |       >>> s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n'] | 
 | 303 |       >>> for line in unified_diff(s1, s2, fromfile='before.py', tofile='after.py'): | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 304 |       ...     sys.stdout.write(line)   # doctest: +NORMALIZE_WHITESPACE | 
| Christian Heimes | 8640e74 | 2008-02-23 16:23:06 +0000 | [diff] [blame] | 305 |       --- before.py | 
 | 306 |       +++ after.py | 
 | 307 |       @@ -1,4 +1,4 @@ | 
 | 308 |       -bacon | 
 | 309 |       -eggs | 
 | 310 |       -ham | 
 | 311 |       +python | 
 | 312 |       +eggy | 
 | 313 |       +hamster | 
 | 314 |        guido | 
 | 315 |  | 
 | 316 |    See :ref:`difflib-interface` for a more detailed example. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 317 |  | 
| Greg Ward | 4d9d256 | 2015-04-20 20:21:21 -0400 | [diff] [blame] | 318 | .. function:: diff_bytes(dfunc, a, b, fromfile=b'', tofile=b'', fromfiledate=b'', tofiledate=b'', n=3, lineterm=b'\\n') | 
 | 319 |  | 
 | 320 |    Compare *a* and *b* (lists of bytes objects) using *dfunc*; yield a | 
 | 321 |    sequence of delta lines (also bytes) in the format returned by *dfunc*. | 
 | 322 |    *dfunc* must be a callable, typically either :func:`unified_diff` or | 
 | 323 |    :func:`context_diff`. | 
 | 324 |  | 
 | 325 |    Allows you to compare data with unknown or inconsistent encoding. All | 
 | 326 |    inputs except *n* must be bytes objects, not str. Works by losslessly | 
 | 327 |    converting all inputs (except *n*) to str, and calling ``dfunc(a, b, | 
 | 328 |    fromfile, tofile, fromfiledate, tofiledate, n, lineterm)``. The output of | 
 | 329 |    *dfunc* is then converted back to bytes, so the delta lines that you | 
 | 330 |    receive have the same unknown/inconsistent encodings as *a* and *b*. | 
 | 331 |  | 
 | 332 |    .. versionadded:: 3.5 | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 333 |  | 
 | 334 | .. function:: IS_LINE_JUNK(line) | 
 | 335 |  | 
 | 336 |    Return true for ignorable lines.  The line *line* is ignorable if *line* is | 
 | 337 |    blank or contains a single ``'#'``, otherwise it is not ignorable.  Used as a | 
| Georg Brandl | e6bcc91 | 2008-05-12 18:05:20 +0000 | [diff] [blame] | 338 |    default for parameter *linejunk* in :func:`ndiff` in older versions. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 339 |  | 
 | 340 |  | 
 | 341 | .. function:: IS_CHARACTER_JUNK(ch) | 
 | 342 |  | 
 | 343 |    Return true for ignorable characters.  The character *ch* is ignorable if *ch* | 
 | 344 |    is a space or tab, otherwise it is not ignorable.  Used as a default for | 
 | 345 |    parameter *charjunk* in :func:`ndiff`. | 
 | 346 |  | 
 | 347 |  | 
 | 348 | .. seealso:: | 
 | 349 |  | 
| Georg Brandl | 525d355 | 2014-10-29 10:26:56 +0100 | [diff] [blame] | 350 |    `Pattern Matching: The Gestalt Approach <http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970>`_ | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 351 |       Discussion of a similar algorithm by John W. Ratcliff and D. E. Metzener. This | 
| Georg Brandl | 525d355 | 2014-10-29 10:26:56 +0100 | [diff] [blame] | 352 |       was published in `Dr. Dobb's Journal <http://www.drdobbs.com/>`_ in July, 1988. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 353 |  | 
 | 354 |  | 
 | 355 | .. _sequence-matcher: | 
 | 356 |  | 
 | 357 | SequenceMatcher Objects | 
 | 358 | ----------------------- | 
 | 359 |  | 
 | 360 | The :class:`SequenceMatcher` class has this constructor: | 
 | 361 |  | 
 | 362 |  | 
| Terry Reedy | 99f9637 | 2010-11-25 06:12:34 +0000 | [diff] [blame] | 363 | .. class:: SequenceMatcher(isjunk=None, a='', b='', autojunk=True) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 364 |  | 
 | 365 |    Optional argument *isjunk* must be ``None`` (the default) or a one-argument | 
 | 366 |    function that takes a sequence element and returns true if and only if the | 
 | 367 |    element is "junk" and should be ignored. Passing ``None`` for *isjunk* is | 
 | 368 |    equivalent to passing ``lambda x: 0``; in other words, no elements are ignored. | 
 | 369 |    For example, pass:: | 
 | 370 |  | 
 | 371 |       lambda x: x in " \t" | 
 | 372 |  | 
 | 373 |    if you're comparing lines as sequences of characters, and don't want to synch up | 
 | 374 |    on blanks or hard tabs. | 
 | 375 |  | 
 | 376 |    The optional arguments *a* and *b* are sequences to be compared; both default to | 
| Guido van Rossum | 2cc30da | 2007-11-02 23:46:40 +0000 | [diff] [blame] | 377 |    empty strings.  The elements of both sequences must be :term:`hashable`. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 378 |  | 
| Terry Reedy | 99f9637 | 2010-11-25 06:12:34 +0000 | [diff] [blame] | 379 |    The optional argument *autojunk* can be used to disable the automatic junk | 
 | 380 |    heuristic. | 
 | 381 |  | 
| Terry Reedy | dc9b17d | 2010-11-27 20:52:14 +0000 | [diff] [blame] | 382 |    .. versionadded:: 3.2 | 
 | 383 |       The *autojunk* parameter. | 
 | 384 |  | 
| Terry Reedy | 74a7c67 | 2010-12-03 18:57:42 +0000 | [diff] [blame] | 385 |    SequenceMatcher objects get three data attributes: *bjunk* is the | 
| Serhiy Storchaka | fbc1c26 | 2013-11-29 12:17:13 +0200 | [diff] [blame] | 386 |    set of elements of *b* for which *isjunk* is ``True``; *bpopular* is the set of | 
| Terry Reedy | 17a5925 | 2010-12-15 20:18:10 +0000 | [diff] [blame] | 387 |    non-junk elements considered popular by the heuristic (if it is not | 
 | 388 |    disabled); *b2j* is a dict mapping the remaining elements of *b* to a list | 
 | 389 |    of positions where they occur. All three are reset whenever *b* is reset | 
 | 390 |    with :meth:`set_seqs` or :meth:`set_seq2`. | 
| Terry Reedy | 74a7c67 | 2010-12-03 18:57:42 +0000 | [diff] [blame] | 391 |  | 
| Georg Brandl | 500be24 | 2010-12-03 19:56:42 +0000 | [diff] [blame] | 392 |    .. versionadded:: 3.2 | 
| Terry Reedy | 74a7c67 | 2010-12-03 18:57:42 +0000 | [diff] [blame] | 393 |       The *bjunk* and *bpopular* attributes. | 
 | 394 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 395 |    :class:`SequenceMatcher` objects have the following methods: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 396 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 397 |    .. method:: set_seqs(a, b) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 398 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 399 |       Set the two sequences to be compared. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 400 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 401 |    :class:`SequenceMatcher` computes and caches detailed information about the | 
 | 402 |    second sequence, so if you want to compare one sequence against many | 
 | 403 |    sequences, use :meth:`set_seq2` to set the commonly used sequence once and | 
 | 404 |    call :meth:`set_seq1` repeatedly, once for each of the other sequences. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 405 |  | 
 | 406 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 407 |    .. method:: set_seq1(a) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 408 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 409 |       Set the first sequence to be compared.  The second sequence to be compared | 
 | 410 |       is not changed. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 411 |  | 
 | 412 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 413 |    .. method:: set_seq2(b) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 414 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 415 |       Set the second sequence to be compared.  The first sequence to be compared | 
 | 416 |       is not changed. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 417 |  | 
 | 418 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 419 |    .. method:: find_longest_match(alo, ahi, blo, bhi) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 420 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 421 |       Find longest matching block in ``a[alo:ahi]`` and ``b[blo:bhi]``. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 422 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 423 |       If *isjunk* was omitted or ``None``, :meth:`find_longest_match` returns | 
 | 424 |       ``(i, j, k)`` such that ``a[i:i+k]`` is equal to ``b[j:j+k]``, where ``alo | 
 | 425 |       <= i <= i+k <= ahi`` and ``blo <= j <= j+k <= bhi``. For all ``(i', j', | 
 | 426 |       k')`` meeting those conditions, the additional conditions ``k >= k'``, ``i | 
 | 427 |       <= i'``, and if ``i == i'``, ``j <= j'`` are also met. In other words, of | 
 | 428 |       all maximal matching blocks, return one that starts earliest in *a*, and | 
 | 429 |       of all those maximal matching blocks that start earliest in *a*, return | 
 | 430 |       the one that starts earliest in *b*. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 431 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 432 |          >>> s = SequenceMatcher(None, " abcd", "abcd abcd") | 
 | 433 |          >>> s.find_longest_match(0, 5, 0, 9) | 
 | 434 |          Match(a=0, b=4, size=5) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 435 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 436 |       If *isjunk* was provided, first the longest matching block is determined | 
 | 437 |       as above, but with the additional restriction that no junk element appears | 
 | 438 |       in the block.  Then that block is extended as far as possible by matching | 
 | 439 |       (only) junk elements on both sides. So the resulting block never matches | 
 | 440 |       on junk except as identical junk happens to be adjacent to an interesting | 
 | 441 |       match. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 442 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 443 |       Here's the same example as before, but considering blanks to be junk. That | 
 | 444 |       prevents ``' abcd'`` from matching the ``' abcd'`` at the tail end of the | 
 | 445 |       second sequence directly.  Instead only the ``'abcd'`` can match, and | 
 | 446 |       matches the leftmost ``'abcd'`` in the second sequence: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 447 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 448 |          >>> s = SequenceMatcher(lambda x: x==" ", " abcd", "abcd abcd") | 
 | 449 |          >>> s.find_longest_match(0, 5, 0, 9) | 
 | 450 |          Match(a=1, b=0, size=4) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 451 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 452 |       If no blocks match, this returns ``(alo, blo, 0)``. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 453 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 454 |       This method returns a :term:`named tuple` ``Match(a, b, size)``. | 
| Christian Heimes | 25bb783 | 2008-01-11 16:17:00 +0000 | [diff] [blame] | 455 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 456 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 457 |    .. method:: get_matching_blocks() | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 458 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 459 |       Return list of triples describing matching subsequences. Each triple is of | 
 | 460 |       the form ``(i, j, n)``, and means that ``a[i:i+n] == b[j:j+n]``.  The | 
 | 461 |       triples are monotonically increasing in *i* and *j*. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 462 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 463 |       The last triple is a dummy, and has the value ``(len(a), len(b), 0)``.  It | 
 | 464 |       is the only triple with ``n == 0``.  If ``(i, j, n)`` and ``(i', j', n')`` | 
 | 465 |       are adjacent triples in the list, and the second is not the last triple in | 
 | 466 |       the list, then ``i+n != i'`` or ``j+n != j'``; in other words, adjacent | 
 | 467 |       triples always describe non-adjacent equal blocks. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 468 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 469 |       .. XXX Explain why a dummy is used! | 
| Christian Heimes | 5b5e81c | 2007-12-31 16:14:33 +0000 | [diff] [blame] | 470 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 471 |       .. doctest:: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 472 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 473 |          >>> s = SequenceMatcher(None, "abxcd", "abcd") | 
 | 474 |          >>> s.get_matching_blocks() | 
 | 475 |          [Match(a=0, b=0, size=2), Match(a=3, b=2, size=2), Match(a=5, b=4, size=0)] | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 476 |  | 
 | 477 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 478 |    .. method:: get_opcodes() | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 479 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 480 |       Return list of 5-tuples describing how to turn *a* into *b*. Each tuple is | 
 | 481 |       of the form ``(tag, i1, i2, j1, j2)``.  The first tuple has ``i1 == j1 == | 
 | 482 |       0``, and remaining tuples have *i1* equal to the *i2* from the preceding | 
 | 483 |       tuple, and, likewise, *j1* equal to the previous *j2*. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 484 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 485 |       The *tag* values are strings, with these meanings: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 486 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 487 |       +---------------+---------------------------------------------+ | 
 | 488 |       | Value         | Meaning                                     | | 
 | 489 |       +===============+=============================================+ | 
 | 490 |       | ``'replace'`` | ``a[i1:i2]`` should be replaced by          | | 
 | 491 |       |               | ``b[j1:j2]``.                               | | 
 | 492 |       +---------------+---------------------------------------------+ | 
 | 493 |       | ``'delete'``  | ``a[i1:i2]`` should be deleted.  Note that  | | 
 | 494 |       |               | ``j1 == j2`` in this case.                  | | 
 | 495 |       +---------------+---------------------------------------------+ | 
 | 496 |       | ``'insert'``  | ``b[j1:j2]`` should be inserted at          | | 
 | 497 |       |               | ``a[i1:i1]``. Note that ``i1 == i2`` in     | | 
 | 498 |       |               | this case.                                  | | 
 | 499 |       +---------------+---------------------------------------------+ | 
 | 500 |       | ``'equal'``   | ``a[i1:i2] == b[j1:j2]`` (the sub-sequences | | 
 | 501 |       |               | are equal).                                 | | 
 | 502 |       +---------------+---------------------------------------------+ | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 503 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 504 |       For example: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 505 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 506 |         >>> a = "qabxcd" | 
 | 507 |         >>> b = "abycdf" | 
 | 508 |         >>> s = SequenceMatcher(None, a, b) | 
 | 509 |         >>> for tag, i1, i2, j1, j2 in s.get_opcodes(): | 
| Raymond Hettinger | dbb677a | 2011-04-09 19:41:00 -0700 | [diff] [blame] | 510 |             print('{:7}   a[{}:{}] --> b[{}:{}] {!r:>8} --> {!r}'.format( | 
 | 511 |                 tag, i1, i2, j1, j2, a[i1:i2], b[j1:j2])) | 
 | 512 |  | 
 | 513 |  | 
 | 514 |         delete    a[0:1] --> b[0:0]      'q' --> '' | 
 | 515 |         equal     a[1:3] --> b[0:2]     'ab' --> 'ab' | 
 | 516 |         replace   a[3:4] --> b[2:3]      'x' --> 'y' | 
 | 517 |         equal     a[4:6] --> b[3:5]     'cd' --> 'cd' | 
 | 518 |         insert    a[6:6] --> b[5:6]       '' --> 'f' | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 519 |  | 
 | 520 |  | 
| Georg Brandl | c2a4f4f | 2009-04-10 09:03:43 +0000 | [diff] [blame] | 521 |    .. method:: get_grouped_opcodes(n=3) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 522 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 523 |       Return a :term:`generator` of groups with up to *n* lines of context. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 524 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 525 |       Starting with the groups returned by :meth:`get_opcodes`, this method | 
 | 526 |       splits out smaller change clusters and eliminates intervening ranges which | 
 | 527 |       have no changes. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 528 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 529 |       The groups are returned in the same format as :meth:`get_opcodes`. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 530 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 531 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 532 |    .. method:: ratio() | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 533 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 534 |       Return a measure of the sequences' similarity as a float in the range [0, | 
 | 535 |       1]. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 536 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 537 |       Where T is the total number of elements in both sequences, and M is the | 
 | 538 |       number of matches, this is 2.0\*M / T. Note that this is ``1.0`` if the | 
 | 539 |       sequences are identical, and ``0.0`` if they have nothing in common. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 540 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 541 |       This is expensive to compute if :meth:`get_matching_blocks` or | 
 | 542 |       :meth:`get_opcodes` hasn't already been called, in which case you may want | 
 | 543 |       to try :meth:`quick_ratio` or :meth:`real_quick_ratio` first to get an | 
 | 544 |       upper bound. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 545 |  | 
 | 546 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 547 |    .. method:: quick_ratio() | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 548 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 549 |       Return an upper bound on :meth:`ratio` relatively quickly. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 550 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 551 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 552 |    .. method:: real_quick_ratio() | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 553 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 554 |       Return an upper bound on :meth:`ratio` very quickly. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 555 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 556 |  | 
 | 557 | The three methods that return the ratio of matching to total characters can give | 
 | 558 | different results due to differing levels of approximation, although | 
 | 559 | :meth:`quick_ratio` and :meth:`real_quick_ratio` are always at least as large as | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 560 | :meth:`ratio`: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 561 |  | 
 | 562 |    >>> s = SequenceMatcher(None, "abcd", "bcde") | 
 | 563 |    >>> s.ratio() | 
 | 564 |    0.75 | 
 | 565 |    >>> s.quick_ratio() | 
 | 566 |    0.75 | 
 | 567 |    >>> s.real_quick_ratio() | 
 | 568 |    1.0 | 
 | 569 |  | 
 | 570 |  | 
 | 571 | .. _sequencematcher-examples: | 
 | 572 |  | 
 | 573 | SequenceMatcher Examples | 
 | 574 | ------------------------ | 
 | 575 |  | 
| Terry Reedy | 74a7c67 | 2010-12-03 18:57:42 +0000 | [diff] [blame] | 576 | This example compares two strings, considering blanks to be "junk": | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 577 |  | 
 | 578 |    >>> s = SequenceMatcher(lambda x: x == " ", | 
 | 579 |    ...                     "private Thread currentThread;", | 
 | 580 |    ...                     "private volatile Thread currentThread;") | 
 | 581 |  | 
 | 582 | :meth:`ratio` returns a float in [0, 1], measuring the similarity of the | 
 | 583 | sequences.  As a rule of thumb, a :meth:`ratio` value over 0.6 means the | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 584 | sequences are close matches: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 585 |  | 
| Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 586 |    >>> print(round(s.ratio(), 3)) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 587 |    0.866 | 
 | 588 |  | 
 | 589 | If you're only interested in where the sequences match, | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 590 | :meth:`get_matching_blocks` is handy: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 591 |  | 
 | 592 |    >>> for block in s.get_matching_blocks(): | 
| Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 593 |    ...     print("a[%d] and b[%d] match for %d elements" % block) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 594 |    a[0] and b[0] match for 8 elements | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 595 |    a[8] and b[17] match for 21 elements | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 596 |    a[29] and b[38] match for 0 elements | 
 | 597 |  | 
 | 598 | Note that the last tuple returned by :meth:`get_matching_blocks` is always a | 
 | 599 | dummy, ``(len(a), len(b), 0)``, and this is the only case in which the last | 
 | 600 | tuple element (number of elements matched) is ``0``. | 
 | 601 |  | 
 | 602 | If you want to know how to change the first sequence into the second, use | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 603 | :meth:`get_opcodes`: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 604 |  | 
 | 605 |    >>> for opcode in s.get_opcodes(): | 
| Georg Brandl | 6911e3c | 2007-09-04 07:15:32 +0000 | [diff] [blame] | 606 |    ...     print("%6s a[%d:%d] b[%d:%d]" % opcode) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 607 |     equal a[0:8] b[0:8] | 
 | 608 |    insert a[8:8] b[8:17] | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 609 |     equal a[8:29] b[17:38] | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 610 |  | 
| Raymond Hettinger | 58c8c26 | 2009-04-27 21:01:21 +0000 | [diff] [blame] | 611 | .. seealso:: | 
 | 612 |  | 
 | 613 |    * The :func:`get_close_matches` function in this module which shows how | 
 | 614 |      simple code building on :class:`SequenceMatcher` can be used to do useful | 
 | 615 |      work. | 
 | 616 |  | 
 | 617 |    * `Simple version control recipe | 
 | 618 |      <http://code.activestate.com/recipes/576729/>`_ for a small application | 
 | 619 |      built with :class:`SequenceMatcher`. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 620 |  | 
 | 621 |  | 
 | 622 | .. _differ-objects: | 
 | 623 |  | 
 | 624 | Differ Objects | 
 | 625 | -------------- | 
 | 626 |  | 
 | 627 | Note that :class:`Differ`\ -generated deltas make no claim to be **minimal** | 
 | 628 | diffs. To the contrary, minimal diffs are often counter-intuitive, because they | 
 | 629 | synch up anywhere possible, sometimes accidental matches 100 pages apart. | 
 | 630 | Restricting synch points to contiguous matches preserves some notion of | 
 | 631 | locality, at the occasional cost of producing a longer diff. | 
 | 632 |  | 
 | 633 | The :class:`Differ` class has this constructor: | 
 | 634 |  | 
 | 635 |  | 
| Georg Brandl | c2a4f4f | 2009-04-10 09:03:43 +0000 | [diff] [blame] | 636 | .. class:: Differ(linejunk=None, charjunk=None) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 637 |  | 
 | 638 |    Optional keyword parameters *linejunk* and *charjunk* are for filter functions | 
 | 639 |    (or ``None``): | 
 | 640 |  | 
 | 641 |    *linejunk*: A function that accepts a single string argument, and returns true | 
 | 642 |    if the string is junk.  The default is ``None``, meaning that no line is | 
 | 643 |    considered junk. | 
 | 644 |  | 
 | 645 |    *charjunk*: A function that accepts a single character argument (a string of | 
 | 646 |    length 1), and returns true if the character is junk. The default is ``None``, | 
 | 647 |    meaning that no character is considered junk. | 
 | 648 |  | 
| Andrew Kuchling | c51da2b | 2014-03-19 16:43:06 -0400 | [diff] [blame] | 649 |    These junk-filtering functions speed up matching to find | 
 | 650 |    differences and do not cause any differing lines or characters to | 
 | 651 |    be ignored.  Read the description of the | 
 | 652 |    :meth:`~SequenceMatcher.find_longest_match` method's *isjunk* | 
 | 653 |    parameter for an explanation. | 
 | 654 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 655 |    :class:`Differ` objects are used (deltas generated) via a single method: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 656 |  | 
 | 657 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 658 |    .. method:: Differ.compare(a, b) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 659 |  | 
| Benjamin Peterson | e41251e | 2008-04-25 01:59:09 +0000 | [diff] [blame] | 660 |       Compare two sequences of lines, and generate the delta (a sequence of lines). | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 661 |  | 
| Serhiy Storchaka | bfdcd43 | 2013-10-13 23:09:14 +0300 | [diff] [blame] | 662 |       Each sequence must contain individual single-line strings ending with | 
 | 663 |       newlines.  Such sequences can be obtained from the | 
 | 664 |       :meth:`~io.IOBase.readlines` method of file-like objects.  The delta | 
 | 665 |       generated also consists of newline-terminated strings, ready to be | 
 | 666 |       printed as-is via the :meth:`~io.IOBase.writelines` method of a | 
 | 667 |       file-like object. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 668 |  | 
 | 669 |  | 
 | 670 | .. _differ-examples: | 
 | 671 |  | 
 | 672 | Differ Example | 
 | 673 | -------------- | 
 | 674 |  | 
 | 675 | This example compares two texts. First we set up the texts, sequences of | 
 | 676 | individual single-line strings ending with newlines (such sequences can also be | 
| Serhiy Storchaka | bfdcd43 | 2013-10-13 23:09:14 +0300 | [diff] [blame] | 677 | obtained from the :meth:`~io.BaseIO.readlines` method of file-like objects): | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 678 |  | 
 | 679 |    >>> text1 = '''  1. Beautiful is better than ugly. | 
 | 680 |    ...   2. Explicit is better than implicit. | 
 | 681 |    ...   3. Simple is better than complex. | 
 | 682 |    ...   4. Complex is better than complicated. | 
| Terry Jan Reedy | bddecc3 | 2014-04-18 17:00:19 -0400 | [diff] [blame] | 683 |    ... '''.splitlines(keepends=True) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 684 |    >>> len(text1) | 
 | 685 |    4 | 
 | 686 |    >>> text1[0][-1] | 
 | 687 |    '\n' | 
 | 688 |    >>> text2 = '''  1. Beautiful is better than ugly. | 
 | 689 |    ...   3.   Simple is better than complex. | 
 | 690 |    ...   4. Complicated is better than complex. | 
 | 691 |    ...   5. Flat is better than nested. | 
| Terry Jan Reedy | bddecc3 | 2014-04-18 17:00:19 -0400 | [diff] [blame] | 692 |    ... '''.splitlines(keepends=True) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 693 |  | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 694 | Next we instantiate a Differ object: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 695 |  | 
 | 696 |    >>> d = Differ() | 
 | 697 |  | 
 | 698 | Note that when instantiating a :class:`Differ` object we may pass functions to | 
 | 699 | filter out line and character "junk."  See the :meth:`Differ` constructor for | 
 | 700 | details. | 
 | 701 |  | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 702 | Finally, we compare the two: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 703 |  | 
 | 704 |    >>> result = list(d.compare(text1, text2)) | 
 | 705 |  | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 706 | ``result`` is a list of strings, so let's pretty-print it: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 707 |  | 
 | 708 |    >>> from pprint import pprint | 
 | 709 |    >>> pprint(result) | 
 | 710 |    ['    1. Beautiful is better than ugly.\n', | 
 | 711 |     '-   2. Explicit is better than implicit.\n', | 
 | 712 |     '-   3. Simple is better than complex.\n', | 
 | 713 |     '+   3.   Simple is better than complex.\n', | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 714 |     '?     ++\n', | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 715 |     '-   4. Complex is better than complicated.\n', | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 716 |     '?            ^                     ---- ^\n', | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 717 |     '+   4. Complicated is better than complex.\n', | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 718 |     '?           ++++ ^                      ^\n', | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 719 |     '+   5. Flat is better than nested.\n'] | 
 | 720 |  | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 721 | As a single multi-line string it looks like this: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 722 |  | 
 | 723 |    >>> import sys | 
 | 724 |    >>> sys.stdout.writelines(result) | 
 | 725 |        1. Beautiful is better than ugly. | 
 | 726 |    -   2. Explicit is better than implicit. | 
 | 727 |    -   3. Simple is better than complex. | 
 | 728 |    +   3.   Simple is better than complex. | 
 | 729 |    ?     ++ | 
 | 730 |    -   4. Complex is better than complicated. | 
 | 731 |    ?            ^                     ---- ^ | 
 | 732 |    +   4. Complicated is better than complex. | 
 | 733 |    ?           ++++ ^                      ^ | 
 | 734 |    +   5. Flat is better than nested. | 
 | 735 |  | 
| Christian Heimes | 8640e74 | 2008-02-23 16:23:06 +0000 | [diff] [blame] | 736 |  | 
 | 737 | .. _difflib-interface: | 
 | 738 |  | 
 | 739 | A command-line interface to difflib | 
 | 740 | ----------------------------------- | 
 | 741 |  | 
 | 742 | This example shows how to use difflib to create a ``diff``-like utility. | 
 | 743 | It is also contained in the Python source distribution, as | 
 | 744 | :file:`Tools/scripts/diff.py`. | 
 | 745 |  | 
| Berker Peksag | 707deb9 | 2015-07-30 00:03:48 +0300 | [diff] [blame] | 746 | .. literalinclude:: ../../Tools/scripts/diff.py |