Mostly in SequenceMatcher.{__chain_b, find_longest_match}: This now does a dynamic analysis of which elements are so frequently repeated as to constitute noise. The primary benefit is an enormous speedup in find_longest_match, as the innermost loop can have factors of 100s less potential matches to worry about, in cases where the sequences have many duplicate elements. In effect, this zooms in on sequences of non-ubiquitous elements now. While I like what I've seen of the effects so far, I still consider this experimental. Please give it a try!

commit: 81b9251d5996ec89bcc016c29ecc0b5f0204e59b [log] [tgz]
author: Tim Peters <tim.peters@gmail.com> Mon Apr 29 01:37:32 2002 +0000
committer: Tim Peters <tim.peters@gmail.com> Mon Apr 29 01:37:32 2002 +0000
tree: 642fedf48a570b1c7c9b5ecc360ffef8954b58b9
parent: 29c0afcfecdd52f6980274e0d613df7bf52bc6a6 [diff] [blame]
diff --git a/Misc/NEWS b/Misc/NEWS
index 5f7e942..903e5b0 100644
--- a/Misc/NEWS
+++ b/Misc/NEWS

@@ -72,9 +72,9 @@
 
 Extension modules
 
-- The bsddb.*open functions can now take 'None' as a filename. 
+- The bsddb.*open functions can now take 'None' as a filename.
   This will create a temporary in-memory bsddb that won't be
-  written to disk. 
+  written to disk.
 
 - posix.mknod was added.
 
@@ -99,6 +99,15 @@
 
 Library
 
+- difflib's SequenceMatcher class now does a dynamic analysis of
+  which elements are so frequent as to constitute noise.  For
+  comparing files as sequences of lines, this generally works better
+  than the IS_LINE_JUNK function, and function ndiff's linejunk
+  argument defaults to None now as a result.  A happy benefit is
+  that SequenceMatcher may run much faster now when applied
+  to large files with many duplicate lines (for example, C program
+  text with lots of repeated "}" and "return NULL;" lines).
+
 - New Text.dump() method in Tkinter module.
 
 - New distutils commands for building packagers were added to
commit	81b9251d5996ec89bcc016c29ecc0b5f0204e59b	[log] [tgz]
author	Tim Peters <tim.peters@gmail.com>	Mon Apr 29 01:37:32 2002 +0000
committer	Tim Peters <tim.peters@gmail.com>	Mon Apr 29 01:37:32 2002 +0000
tree	642fedf48a570b1c7c9b5ecc360ffef8954b58b9
parent	29c0afcfecdd52f6980274e0d613df7bf52bc6a6 [diff] [blame]