Issue 21469: Mitigate risk of false positives with robotparser. * Repair the broken link to norobots-rfc.txt. * HTTP response codes >= 500 treated as a failed read rather than as a not found. Not found means that we can assume the entire site is allowed. A 5xx server error tells us nothing. * A successful read() or parse() updates the mtime (which is defined to be "the time the robots.txt file was last fetched"). * The can_fetch() method returns False unless we've had a read() with a 2xx or 4xx response. This avoids false positives in the case where a user calls can_fetch() before calling read(). * I don't see any easy way to test this patch without hitting internet resources that might change or without use of mock objects that wouldn't provide must reassurance.

commit: a5413c499702a74fdc50e4bc8e7e6a480856a1f9 [log] [tgz]
author: Raymond Hettinger <python@rcn.com> Mon May 12 22:18:50 2014 -0700
committer: Raymond Hettinger <python@rcn.com> Mon May 12 22:18:50 2014 -0700
tree: 079266511a220614fb6b33699dc27e5102695ae1
parent: c5945966aee2fb3ddd96d7521b245cdb9968afcb [diff] [blame]
diff --git a/Misc/NEWS b/Misc/NEWS
index 5d3209d..2bda726 100644
--- a/Misc/NEWS
+++ b/Misc/NEWS

@@ -52,6 +52,10 @@
 - Issue #21306: Backport hmac.compare_digest from Python 3. This is part of PEP
   466.
 
+- Issue #21469:  Reduced the risk of false positives in robotparser by
+  checking to make sure that robots.txt has been read or does not exist
+  prior to returning True in can_fetch().
+
 - Issue #21321: itertools.islice() now releases the reference to the source
   iterator when the slice is exhausted.  Patch by Anton Afanasyev.
commit	a5413c499702a74fdc50e4bc8e7e6a480856a1f9	[log] [tgz]
author	Raymond Hettinger <python@rcn.com>	Mon May 12 22:18:50 2014 -0700
committer	Raymond Hettinger <python@rcn.com>	Mon May 12 22:18:50 2014 -0700
tree	079266511a220614fb6b33699dc27e5102695ae1
parent	c5945966aee2fb3ddd96d7521b245cdb9968afcb [diff] [blame]