blob: e8d0ed746fe0932c3e7b4a65fe8718d2a936c77f [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
Guido van Rossume284b211999-11-17 15:40:08 +00003# Original code by Guido van Rossum; extensive changes by Sam Bayer,
4# including code to check URL fragments.
5
Guido van Rossum272b37d1997-01-30 02:44:48 +00006"""Web tree checker.
7
8This utility is handy to check a subweb of the world-wide web for
9errors. A subweb is specified by giving one or more ``root URLs''; a
10page belongs to the subweb if one of the root URLs is an initial
11prefix of it.
12
13File URL extension:
14
15In order to easy the checking of subwebs via the local file system,
16the interpretation of ``file:'' URLs is extended to mimic the behavior
17of your average HTTP daemon: if a directory pathname is given, the
18file index.html in that directory is returned if it exists, otherwise
19a directory listing is returned. Now, you can point webchecker to the
20document tree in the local file system of your HTTP daemon, and have
21most of it checked. In fact the default works this way if your local
22web tree is located at /usr/local/etc/httpd/htdpcs (the default for
23the NCSA HTTP daemon and probably others).
24
Guido van Rossumaf310c11997-02-02 23:30:32 +000025Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
Guido van Rossumaf310c11997-02-02 23:30:32 +000027When done, it reports pages with bad links within the subweb. When
28interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000029
30In verbose mode, additional messages are printed during the
31information gathering phase. By default, it prints a summary of its
32work status every 50 URLs (adjustable with the -r option), and it
33reports errors as they are encountered. Use the -q option to disable
34this output.
35
36Checkpoint feature:
37
38Whether interrupted or not, it dumps its state (a Python pickle) to a
39checkpoint file and the -R option allows it to restart from the
40checkpoint (assuming that the pages on the subweb that were already
41processed haven't changed). Even when it has run till completion, -R
42can still be useful -- it will print the reports again, and -Rq prints
43the errors only. In this case, the checkpoint file is not written
44again. The checkpoint file can be set with the -d option.
45
46The checkpoint file is written as a Python pickle. Remember that
47Python's pickle module is currently quite slow. Give it the time it
48needs to load and save the checkpoint file. When interrupted while
49writing the checkpoint file, the old checkpoint file is not
50overwritten, but all work done in the current run is lost.
51
52Miscellaneous:
53
Guido van Rossumaf310c11997-02-02 23:30:32 +000054- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
55
Guido van Rossum3edbb351997-01-30 03:19:41 +000056- Webchecker honors the "robots.txt" convention. Thanks to Skip
57Montanaro for his robotparser.py module (included in this directory)!
58The agent name is hardwired to "webchecker". URLs that are disallowed
59by the robots.txt file are reported as external URLs.
60
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000062skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000063
Guido van Rossumaf310c11997-02-02 23:30:32 +000064- When the server or protocol does not tell us a file's type, we guess
65it based on the URL's suffix. The mimetypes.py module (also in this
66directory) has a built-in table mapping most currently known suffixes,
67and in addition attempts to read the mime.types configuration files in
68the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossume284b211999-11-17 15:40:08 +000070- We follow links indicated by <A>, <FRAME> and <IMG> tags. We also
Guido van Rossumaf310c11997-02-02 23:30:32 +000071honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000072
Guido van Rossume284b211999-11-17 15:40:08 +000073- We now check internal NAME anchor links, as well as toplevel links.
74
Guido van Rossumaf310c11997-02-02 23:30:32 +000075- Checking external links is now done by default; use -x to *disable*
76this feature. External links are now checked during normal
77processing. (XXX The status of a checked link could be categorized
78better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000079
Guido van Rossume284b211999-11-17 15:40:08 +000080- If external links are not checked, you can use the -t flag to
81provide specific overrides to -x.
Guido van Rossum272b37d1997-01-30 02:44:48 +000082
83Usage: webchecker.py [option] ... [rooturl] ...
84
85Options:
86
87-R -- restart from checkpoint file
88-d file -- checkpoint filename (default %(DUMPFILE)s)
89-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000090-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000091-q -- quiet operation (also suppresses external links report)
92-r number -- number of links processed per round (default %(ROUNDSIZE)d)
Guido van Rossume284b211999-11-17 15:40:08 +000093-t root -- specify root dir which should be treated as internal (can repeat)
Guido van Rossum272b37d1997-01-30 02:44:48 +000094-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000095-x -- don't check external links (these are often slow to check)
Guido van Rossume284b211999-11-17 15:40:08 +000096-a -- don't check name anchors
Guido van Rossum272b37d1997-01-30 02:44:48 +000097
98Arguments:
99
100rooturl -- URL to start checking
101 (default %(DEFROOT)s)
102
103"""
104
Guido van Rossume5605ba1997-01-31 14:43:15 +0000105
Guido van Rossum00756bd1998-02-21 20:02:09 +0000106__version__ = "$Revision$"
Guido van Rossum325a64f1997-01-30 03:30:20 +0000107
Guido van Rossum272b37d1997-01-30 02:44:48 +0000108
109import sys
110import os
111from types import *
Guido van Rossum272b37d1997-01-30 02:44:48 +0000112import StringIO
113import getopt
114import pickle
115
116import urllib
117import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000118import sgmllib
Walter Dörwald88a20ba2002-06-06 17:01:21 +0000119import cgi
Guido van Rossum272b37d1997-01-30 02:44:48 +0000120
121import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000122import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000123
Guido van Rossum00756bd1998-02-21 20:02:09 +0000124# Extract real version number if necessary
125if __version__[0] == '$':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000126 _v = __version__.split()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000127 if len(_v) == 3:
Guido van Rossum986abac1998-04-06 14:29:28 +0000128 __version__ = _v[1]
Guido van Rossum00756bd1998-02-21 20:02:09 +0000129
Guido van Rossum272b37d1997-01-30 02:44:48 +0000130
131# Tunable parameters
Guido van Rossum986abac1998-04-06 14:29:28 +0000132DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
133CHECKEXT = 1 # Check external references (1 deep)
134VERBOSE = 1 # Verbosity level (0-3)
135MAXPAGE = 150000 # Ignore files bigger than this
136ROUNDSIZE = 50 # Number of links processed per round
137DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
138AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossume284b211999-11-17 15:40:08 +0000139NONAMES = 0 # Force name anchor checking
Guido van Rossum272b37d1997-01-30 02:44:48 +0000140
141
142# Global variables
Guido van Rossum272b37d1997-01-30 02:44:48 +0000143
144
145def main():
Guido van Rossum00756bd1998-02-21 20:02:09 +0000146 checkext = CHECKEXT
147 verbose = VERBOSE
148 maxpage = MAXPAGE
149 roundsize = ROUNDSIZE
Guido van Rossum272b37d1997-01-30 02:44:48 +0000150 dumpfile = DUMPFILE
151 restart = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000152 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000153
154 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000155 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:t:vxa')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000156 except getopt.error, msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000157 sys.stdout = sys.stderr
158 print msg
159 print __doc__%globals()
160 sys.exit(2)
Guido van Rossume284b211999-11-17 15:40:08 +0000161
162 # The extra_roots variable collects extra roots.
163 extra_roots = []
164 nonames = NONAMES
165
Guido van Rossum272b37d1997-01-30 02:44:48 +0000166 for o, a in opts:
Guido van Rossum986abac1998-04-06 14:29:28 +0000167 if o == '-R':
168 restart = 1
169 if o == '-d':
170 dumpfile = a
171 if o == '-m':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000172 maxpage = int(a)
Guido van Rossum986abac1998-04-06 14:29:28 +0000173 if o == '-n':
174 norun = 1
175 if o == '-q':
176 verbose = 0
177 if o == '-r':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000178 roundsize = int(a)
Guido van Rossume284b211999-11-17 15:40:08 +0000179 if o == '-t':
180 extra_roots.append(a)
181 if o == '-a':
182 nonames = not nonames
Guido van Rossum986abac1998-04-06 14:29:28 +0000183 if o == '-v':
184 verbose = verbose + 1
185 if o == '-x':
186 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000187
Guido van Rossume5605ba1997-01-31 14:43:15 +0000188 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000189 print AGENTNAME, "version", __version__
Guido van Rossum325a64f1997-01-30 03:30:20 +0000190
Guido van Rossum272b37d1997-01-30 02:44:48 +0000191 if restart:
Guido van Rossum986abac1998-04-06 14:29:28 +0000192 c = load_pickle(dumpfile=dumpfile, verbose=verbose)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000193 else:
Guido van Rossum986abac1998-04-06 14:29:28 +0000194 c = Checker()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000195
196 c.setflags(checkext=checkext, verbose=verbose,
Guido van Rossume284b211999-11-17 15:40:08 +0000197 maxpage=maxpage, roundsize=roundsize,
198 nonames=nonames
199 )
Guido van Rossum00756bd1998-02-21 20:02:09 +0000200
201 if not restart and not args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000202 args.append(DEFROOT)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000203
204 for arg in args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000205 c.addroot(arg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000206
Guido van Rossume284b211999-11-17 15:40:08 +0000207 # The -t flag is only needed if external links are not to be
208 # checked. So -t values are ignored unless -x was specified.
209 if not checkext:
210 for root in extra_roots:
211 # Make sure it's terminated by a slash,
212 # so that addroot doesn't discard the last
213 # directory component.
214 if root[-1] != "/":
215 root = root + "/"
216 c.addroot(root, add_to_do = 0)
217
Guido van Rossumbee64531998-04-27 19:35:15 +0000218 try:
219
220 if not norun:
221 try:
222 c.run()
223 except KeyboardInterrupt:
224 if verbose > 0:
225 print "[run interrupted]"
226
Guido van Rossum986abac1998-04-06 14:29:28 +0000227 try:
Guido van Rossumbee64531998-04-27 19:35:15 +0000228 c.report()
Guido van Rossum986abac1998-04-06 14:29:28 +0000229 except KeyboardInterrupt:
230 if verbose > 0:
Guido van Rossumbee64531998-04-27 19:35:15 +0000231 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000232
Guido van Rossumbee64531998-04-27 19:35:15 +0000233 finally:
234 if c.save_pickle(dumpfile):
235 if dumpfile == DUMPFILE:
236 print "Use ``%s -R'' to restart." % sys.argv[0]
237 else:
238 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
239 dumpfile)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000240
241
242def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
243 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000244 print "Loading checkpoint from %s ..." % dumpfile
Guido van Rossum00756bd1998-02-21 20:02:09 +0000245 f = open(dumpfile, "rb")
246 c = pickle.load(f)
247 f.close()
248 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000249 print "Done."
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000250 print "Root:", "\n ".join(c.roots)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000251 return c
Guido van Rossum272b37d1997-01-30 02:44:48 +0000252
253
254class Checker:
255
Guido van Rossum00756bd1998-02-21 20:02:09 +0000256 checkext = CHECKEXT
257 verbose = VERBOSE
258 maxpage = MAXPAGE
259 roundsize = ROUNDSIZE
Guido van Rossume284b211999-11-17 15:40:08 +0000260 nonames = NONAMES
Guido van Rossum00756bd1998-02-21 20:02:09 +0000261
262 validflags = tuple(dir())
263
264 def __init__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000265 self.reset()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000266
267 def setflags(self, **kw):
Guido van Rossum986abac1998-04-06 14:29:28 +0000268 for key in kw.keys():
269 if key not in self.validflags:
270 raise NameError, "invalid keyword argument: %s" % str(key)
271 for key, value in kw.items():
272 setattr(self, key, value)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000273
274 def reset(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000275 self.roots = []
276 self.todo = {}
277 self.done = {}
278 self.bad = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000279
280 # Add a name table, so that the name URLs can be checked. Also
281 # serves as an implicit cache for which URLs are done.
282 self.name_table = {}
283
Guido van Rossum986abac1998-04-06 14:29:28 +0000284 self.round = 0
285 # The following are not pickled:
286 self.robots = {}
287 self.errors = {}
288 self.urlopener = MyURLopener()
289 self.changed = 0
Guido van Rossume284b211999-11-17 15:40:08 +0000290
Guido van Rossum125700a1998-07-08 03:04:39 +0000291 def note(self, level, format, *args):
292 if self.verbose > level:
293 if args:
294 format = format%args
295 self.message(format)
Guido van Rossume284b211999-11-17 15:40:08 +0000296
Guido van Rossum125700a1998-07-08 03:04:39 +0000297 def message(self, format, *args):
298 if args:
299 format = format%args
300 print format
Guido van Rossum3edbb351997-01-30 03:19:41 +0000301
302 def __getstate__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000303 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000304
305 def __setstate__(self, state):
Guido van Rossum986abac1998-04-06 14:29:28 +0000306 self.reset()
307 (self.roots, self.todo, self.done, self.bad, self.round) = state
308 for root in self.roots:
309 self.addrobot(root)
310 for url in self.bad.keys():
311 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000312
Guido van Rossume284b211999-11-17 15:40:08 +0000313 def addroot(self, root, add_to_do = 1):
Guido van Rossum986abac1998-04-06 14:29:28 +0000314 if root not in self.roots:
315 troot = root
316 scheme, netloc, path, params, query, fragment = \
317 urlparse.urlparse(root)
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000318 i = path.rfind("/") + 1
Guido van Rossum986abac1998-04-06 14:29:28 +0000319 if 0 < i < len(path):
320 path = path[:i]
321 troot = urlparse.urlunparse((scheme, netloc, path,
322 params, query, fragment))
323 self.roots.append(troot)
324 self.addrobot(root)
Guido van Rossume284b211999-11-17 15:40:08 +0000325 if add_to_do:
326 self.newlink((root, ""), ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000327
328 def addrobot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000329 root = urlparse.urljoin(root, "/")
330 if self.robots.has_key(root): return
331 url = urlparse.urljoin(root, "/robots.txt")
332 self.robots[root] = rp = robotparser.RobotFileParser()
Guido van Rossum125700a1998-07-08 03:04:39 +0000333 self.note(2, "Parsing %s", url)
334 rp.debug = self.verbose > 3
Guido van Rossum986abac1998-04-06 14:29:28 +0000335 rp.set_url(url)
336 try:
337 rp.read()
Guido van Rossumf0953b92001-12-11 22:41:24 +0000338 except (OSError, IOError), msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000339 self.note(1, "I/O error parsing %s: %s", url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000340
341 def run(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000342 while self.todo:
343 self.round = self.round + 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000344 self.note(0, "\nRound %d (%s)\n", self.round, self.status())
Guido van Rossum6eb9d321998-06-15 12:33:02 +0000345 urls = self.todo.keys()
346 urls.sort()
347 del urls[self.roundsize:]
Guido van Rossum986abac1998-04-06 14:29:28 +0000348 for url in urls:
349 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000350
351 def status(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000352 return "%d total, %d to do, %d done, %d bad" % (
353 len(self.todo)+len(self.done),
354 len(self.todo), len(self.done),
355 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000356
Guido van Rossumaf310c11997-02-02 23:30:32 +0000357 def report(self):
Guido van Rossum125700a1998-07-08 03:04:39 +0000358 self.message("")
359 if not self.todo: s = "Final"
360 else: s = "Interim"
361 self.message("%s Report (%s)", s, self.status())
Guido van Rossum986abac1998-04-06 14:29:28 +0000362 self.report_errors()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000363
Guido van Rossum272b37d1997-01-30 02:44:48 +0000364 def report_errors(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000365 if not self.bad:
Guido van Rossum125700a1998-07-08 03:04:39 +0000366 self.message("\nNo errors")
Guido van Rossum986abac1998-04-06 14:29:28 +0000367 return
Guido van Rossum125700a1998-07-08 03:04:39 +0000368 self.message("\nError Report:")
Guido van Rossum986abac1998-04-06 14:29:28 +0000369 sources = self.errors.keys()
370 sources.sort()
371 for source in sources:
372 triples = self.errors[source]
Guido van Rossum125700a1998-07-08 03:04:39 +0000373 self.message("")
Guido van Rossum986abac1998-04-06 14:29:28 +0000374 if len(triples) > 1:
Guido van Rossum125700a1998-07-08 03:04:39 +0000375 self.message("%d Errors in %s", len(triples), source)
Guido van Rossum986abac1998-04-06 14:29:28 +0000376 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000377 self.message("Error in %s", source)
Guido van Rossume284b211999-11-17 15:40:08 +0000378 # Call self.format_url() instead of referring
379 # to the URL directly, since the URLs in these
380 # triples is now a (URL, fragment) pair. The value
381 # of the "source" variable comes from the list of
382 # origins, and is a URL, not a pair.
383 for url, rawlink, msg in triples:
384 if rawlink != self.format_url(url): s = " (%s)" % rawlink
Guido van Rossum125700a1998-07-08 03:04:39 +0000385 else: s = ""
Guido van Rossume284b211999-11-17 15:40:08 +0000386 self.message(" HREF %s%s\n msg %s",
387 self.format_url(url), s, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000388
Guido van Rossume284b211999-11-17 15:40:08 +0000389 def dopage(self, url_pair):
390
391 # All printing of URLs uses format_url(); argument changed to
392 # url_pair for clarity.
Guido van Rossum986abac1998-04-06 14:29:28 +0000393 if self.verbose > 1:
394 if self.verbose > 2:
Guido van Rossume284b211999-11-17 15:40:08 +0000395 self.show("Check ", self.format_url(url_pair),
396 " from", self.todo[url_pair])
Guido van Rossum986abac1998-04-06 14:29:28 +0000397 else:
Guido van Rossume284b211999-11-17 15:40:08 +0000398 self.message("Check %s", self.format_url(url_pair))
399 url, local_fragment = url_pair
400 if local_fragment and self.nonames:
401 self.markdone(url_pair)
402 return
403 page = self.getpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000404 if page:
Guido van Rossume284b211999-11-17 15:40:08 +0000405 # Store the page which corresponds to this URL.
406 self.name_table[url] = page
407 # If there is a fragment in this url_pair, and it's not
408 # in the list of names for the page, call setbad(), since
409 # it's a missing anchor.
410 if local_fragment and local_fragment not in page.getnames():
411 self.setbad(url_pair, ("Missing name anchor `%s'" % local_fragment))
Guido van Rossum986abac1998-04-06 14:29:28 +0000412 for info in page.getlinkinfos():
Guido van Rossume284b211999-11-17 15:40:08 +0000413 # getlinkinfos() now returns the fragment as well,
414 # and we store that fragment here in the "todo" dictionary.
415 link, rawlink, fragment = info
416 # However, we don't want the fragment as the origin, since
417 # the origin is logically a page.
Guido van Rossum986abac1998-04-06 14:29:28 +0000418 origin = url, rawlink
Guido van Rossume284b211999-11-17 15:40:08 +0000419 self.newlink((link, fragment), origin)
420 else:
421 # If no page has been created yet, we want to
422 # record that fact.
423 self.name_table[url_pair[0]] = None
424 self.markdone(url_pair)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000425
Guido van Rossumaf310c11997-02-02 23:30:32 +0000426 def newlink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000427 if self.done.has_key(url):
428 self.newdonelink(url, origin)
429 else:
430 self.newtodolink(url, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000431
432 def newdonelink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000433 if origin not in self.done[url]:
434 self.done[url].append(origin)
435
436 # Call self.format_url(), since the URL here
437 # is now a (URL, fragment) pair.
438 self.note(3, " Done link %s", self.format_url(url))
439
440 # Make sure that if it's bad, that the origin gets added.
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000441 if self.bad.has_key(url):
442 source, rawlink = origin
443 triple = url, rawlink, self.bad[url]
444 self.seterror(source, triple)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000445
446 def newtodolink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000447 # Call self.format_url(), since the URL here
448 # is now a (URL, fragment) pair.
Guido van Rossum986abac1998-04-06 14:29:28 +0000449 if self.todo.has_key(url):
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000450 if origin not in self.todo[url]:
451 self.todo[url].append(origin)
Guido van Rossume284b211999-11-17 15:40:08 +0000452 self.note(3, " Seen todo link %s", self.format_url(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000453 else:
454 self.todo[url] = [origin]
Guido van Rossume284b211999-11-17 15:40:08 +0000455 self.note(3, " New todo link %s", self.format_url(url))
456
457 def format_url(self, url):
458 link, fragment = url
459 if fragment: return link + "#" + fragment
460 else: return link
Guido van Rossume5605ba1997-01-31 14:43:15 +0000461
462 def markdone(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000463 self.done[url] = self.todo[url]
464 del self.todo[url]
465 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000466
467 def inroots(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000468 for root in self.roots:
469 if url[:len(root)] == root:
Guido van Rossum125700a1998-07-08 03:04:39 +0000470 return self.isallowed(root, url)
Guido van Rossum986abac1998-04-06 14:29:28 +0000471 return 0
Guido van Rossume284b211999-11-17 15:40:08 +0000472
Guido van Rossum125700a1998-07-08 03:04:39 +0000473 def isallowed(self, root, url):
474 root = urlparse.urljoin(root, "/")
475 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000476
Guido van Rossume284b211999-11-17 15:40:08 +0000477 def getpage(self, url_pair):
478 # Incoming argument name is a (URL, fragment) pair.
479 # The page may have been cached in the name_table variable.
480 url, fragment = url_pair
481 if self.name_table.has_key(url):
482 return self.name_table[url]
483
Andrew M. Kuchling566c0c72002-03-08 17:19:10 +0000484 scheme, path = urllib.splittype(url)
Fred Drakef3186e82001-04-04 17:47:25 +0000485 if scheme in ('mailto', 'news', 'javascript', 'telnet'):
486 self.note(1, " Not checking %s URL" % scheme)
Guido van Rossum986abac1998-04-06 14:29:28 +0000487 return None
488 isint = self.inroots(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000489
490 # Ensure that openpage gets the URL pair to
491 # print out its error message and record the error pair
492 # correctly.
Guido van Rossum986abac1998-04-06 14:29:28 +0000493 if not isint:
494 if not self.checkext:
Guido van Rossum125700a1998-07-08 03:04:39 +0000495 self.note(1, " Not checking ext link")
Guido van Rossum986abac1998-04-06 14:29:28 +0000496 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000497 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000498 if f:
499 self.safeclose(f)
500 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000501 text, nurl = self.readhtml(url_pair)
502
Guido van Rossum986abac1998-04-06 14:29:28 +0000503 if nurl != url:
Guido van Rossum125700a1998-07-08 03:04:39 +0000504 self.note(1, " Redirected to %s", nurl)
Guido van Rossum986abac1998-04-06 14:29:28 +0000505 url = nurl
506 if text:
Guido van Rossum125700a1998-07-08 03:04:39 +0000507 return Page(text, url, maxpage=self.maxpage, checker=self)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000508
Guido van Rossume284b211999-11-17 15:40:08 +0000509 # These next three functions take (URL, fragment) pairs as
510 # arguments, so that openpage() receives the appropriate tuple to
511 # record error messages.
512 def readhtml(self, url_pair):
513 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000514 text = None
Guido van Rossume284b211999-11-17 15:40:08 +0000515 f, url = self.openhtml(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000516 if f:
517 text = f.read()
518 f.close()
519 return text, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000520
Guido van Rossume284b211999-11-17 15:40:08 +0000521 def openhtml(self, url_pair):
522 url, fragment = url_pair
523 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000524 if f:
525 url = f.geturl()
526 info = f.info()
527 if not self.checkforhtml(info, url):
528 self.safeclose(f)
529 f = None
530 return f, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000531
Guido van Rossume284b211999-11-17 15:40:08 +0000532 def openpage(self, url_pair):
533 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000534 try:
535 return self.urlopener.open(url)
Guido van Rossumf0953b92001-12-11 22:41:24 +0000536 except (OSError, IOError), msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000537 msg = self.sanitize(msg)
Guido van Rossum125700a1998-07-08 03:04:39 +0000538 self.note(0, "Error %s", msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000539 if self.verbose > 0:
Guido van Rossume284b211999-11-17 15:40:08 +0000540 self.show(" HREF ", url, " from", self.todo[url_pair])
541 self.setbad(url_pair, msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000542 return None
Guido van Rossum00756bd1998-02-21 20:02:09 +0000543
544 def checkforhtml(self, info, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000545 if info.has_key('content-type'):
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000546 ctype = cgi.parse_header(info['content-type'])[0].lower()
Fred Drake0b9e3f72002-11-12 22:19:34 +0000547 if ';' in ctype:
548 # handle content-type: text/html; charset=iso8859-1 :
549 ctype = ctype.split(';', 1)[0].strip()
Guido van Rossum986abac1998-04-06 14:29:28 +0000550 else:
551 if url[-1:] == "/":
552 return 1
553 ctype, encoding = mimetypes.guess_type(url)
554 if ctype == 'text/html':
555 return 1
556 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000557 self.note(1, " Not HTML, mime type %s", ctype)
Guido van Rossum986abac1998-04-06 14:29:28 +0000558 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000559
Guido van Rossume5605ba1997-01-31 14:43:15 +0000560 def setgood(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000561 if self.bad.has_key(url):
562 del self.bad[url]
563 self.changed = 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000564 self.note(0, "(Clear previously seen error)")
Guido van Rossume5605ba1997-01-31 14:43:15 +0000565
566 def setbad(self, url, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000567 if self.bad.has_key(url) and self.bad[url] == msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000568 self.note(0, "(Seen this error before)")
Guido van Rossum986abac1998-04-06 14:29:28 +0000569 return
570 self.bad[url] = msg
571 self.changed = 1
572 self.markerror(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000573
Guido van Rossumaf310c11997-02-02 23:30:32 +0000574 def markerror(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000575 try:
576 origins = self.todo[url]
577 except KeyError:
578 origins = self.done[url]
579 for source, rawlink in origins:
580 triple = url, rawlink, self.bad[url]
581 self.seterror(source, triple)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000582
583 def seterror(self, url, triple):
Guido van Rossum986abac1998-04-06 14:29:28 +0000584 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000585 # Because of the way the URLs are now processed, I need to
586 # check to make sure the URL hasn't been entered in the
587 # error list. The first element of the triple here is a
588 # (URL, fragment) pair, but the URL key is not, since it's
589 # from the list of origins.
590 if triple not in self.errors[url]:
591 self.errors[url].append(triple)
Guido van Rossum986abac1998-04-06 14:29:28 +0000592 except KeyError:
593 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000594
Guido van Rossum00756bd1998-02-21 20:02:09 +0000595 # The following used to be toplevel functions; they have been
596 # changed into methods so they can be overridden in subclasses.
597
598 def show(self, p1, link, p2, origins):
Guido van Rossum125700a1998-07-08 03:04:39 +0000599 self.message("%s %s", p1, link)
Guido van Rossum986abac1998-04-06 14:29:28 +0000600 i = 0
601 for source, rawlink in origins:
602 i = i+1
603 if i == 2:
604 p2 = ' '*len(p2)
Guido van Rossum125700a1998-07-08 03:04:39 +0000605 if rawlink != link: s = " (%s)" % rawlink
606 else: s = ""
607 self.message("%s %s%s", p2, source, s)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000608
609 def sanitize(self, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000610 if isinstance(IOError, ClassType) and isinstance(msg, IOError):
611 # Do the other branch recursively
612 msg.args = self.sanitize(msg.args)
613 elif isinstance(msg, TupleType):
614 if len(msg) >= 4 and msg[0] == 'http error' and \
615 isinstance(msg[3], InstanceType):
616 # Remove the Message instance -- it may contain
617 # a file object which prevents pickling.
618 msg = msg[:3] + msg[4:]
619 return msg
Guido van Rossum00756bd1998-02-21 20:02:09 +0000620
621 def safeclose(self, f):
Guido van Rossum986abac1998-04-06 14:29:28 +0000622 try:
623 url = f.geturl()
624 except AttributeError:
625 pass
626 else:
627 if url[:4] == 'ftp:' or url[:7] == 'file://':
628 # Apparently ftp connections don't like to be closed
629 # prematurely...
630 text = f.read()
631 f.close()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000632
633 def save_pickle(self, dumpfile=DUMPFILE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000634 if not self.changed:
Guido van Rossum125700a1998-07-08 03:04:39 +0000635 self.note(0, "\nNo need to save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000636 elif not dumpfile:
Guido van Rossum125700a1998-07-08 03:04:39 +0000637 self.note(0, "No dumpfile, won't save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000638 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000639 self.note(0, "\nSaving checkpoint to %s ...", dumpfile)
Guido van Rossum986abac1998-04-06 14:29:28 +0000640 newfile = dumpfile + ".new"
641 f = open(newfile, "wb")
642 pickle.dump(self, f)
643 f.close()
644 try:
645 os.unlink(dumpfile)
646 except os.error:
647 pass
648 os.rename(newfile, dumpfile)
Guido van Rossum125700a1998-07-08 03:04:39 +0000649 self.note(0, "Done.")
Guido van Rossum986abac1998-04-06 14:29:28 +0000650 return 1
Guido van Rossum00756bd1998-02-21 20:02:09 +0000651
Guido van Rossum272b37d1997-01-30 02:44:48 +0000652
653class Page:
654
Guido van Rossum125700a1998-07-08 03:04:39 +0000655 def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE, checker=None):
Guido van Rossum986abac1998-04-06 14:29:28 +0000656 self.text = text
657 self.url = url
658 self.verbose = verbose
659 self.maxpage = maxpage
Guido van Rossum125700a1998-07-08 03:04:39 +0000660 self.checker = checker
Guido van Rossum272b37d1997-01-30 02:44:48 +0000661
Guido van Rossume284b211999-11-17 15:40:08 +0000662 # The parsing of the page is done in the __init__() routine in
663 # order to initialize the list of names the file
664 # contains. Stored the parser in an instance variable. Passed
665 # the URL to MyHTMLParser().
666 size = len(self.text)
667 if size > self.maxpage:
668 self.note(0, "Skip huge file %s (%.0f Kbytes)", self.url, (size*0.001))
669 self.parser = None
670 return
671 self.checker.note(2, " Parsing %s (%d bytes)", self.url, size)
672 self.parser = MyHTMLParser(url, verbose=self.verbose,
673 checker=self.checker)
674 self.parser.feed(self.text)
675 self.parser.close()
676
Guido van Rossuma42c1ee1998-08-06 21:31:13 +0000677 def note(self, level, msg, *args):
678 if self.checker:
679 apply(self.checker.note, (level, msg) + args)
680 else:
681 if self.verbose >= level:
682 if args:
683 msg = msg%args
684 print msg
685
Guido van Rossume284b211999-11-17 15:40:08 +0000686 # Method to retrieve names.
687 def getnames(self):
Guido van Rossum84306242000-03-28 20:10:39 +0000688 if self.parser:
689 return self.parser.names
690 else:
691 return []
Guido van Rossume284b211999-11-17 15:40:08 +0000692
Guido van Rossum272b37d1997-01-30 02:44:48 +0000693 def getlinkinfos(self):
Guido van Rossume284b211999-11-17 15:40:08 +0000694 # File reading is done in __init__() routine. Store parser in
695 # local variable to indicate success of parsing.
696
697 # If no parser was stored, fail.
698 if not self.parser: return []
699
700 rawlinks = self.parser.getlinks()
701 base = urlparse.urljoin(self.url, self.parser.getbase() or "")
Guido van Rossum986abac1998-04-06 14:29:28 +0000702 infos = []
703 for rawlink in rawlinks:
704 t = urlparse.urlparse(rawlink)
Guido van Rossume284b211999-11-17 15:40:08 +0000705 # DON'T DISCARD THE FRAGMENT! Instead, include
706 # it in the tuples which are returned. See Checker.dopage().
707 fragment = t[-1]
Guido van Rossum986abac1998-04-06 14:29:28 +0000708 t = t[:-1] + ('',)
709 rawlink = urlparse.urlunparse(t)
710 link = urlparse.urljoin(base, rawlink)
Guido van Rossume284b211999-11-17 15:40:08 +0000711 infos.append((link, rawlink, fragment))
712
Guido van Rossum986abac1998-04-06 14:29:28 +0000713 return infos
Guido van Rossum272b37d1997-01-30 02:44:48 +0000714
715
716class MyStringIO(StringIO.StringIO):
717
718 def __init__(self, url, info):
Guido van Rossum986abac1998-04-06 14:29:28 +0000719 self.__url = url
720 self.__info = info
721 StringIO.StringIO.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000722
723 def info(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000724 return self.__info
Guido van Rossum272b37d1997-01-30 02:44:48 +0000725
726 def geturl(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000727 return self.__url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000728
729
730class MyURLopener(urllib.FancyURLopener):
731
732 http_error_default = urllib.URLopener.http_error_default
733
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000734 def __init__(*args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000735 self = args[0]
736 apply(urllib.FancyURLopener.__init__, args)
737 self.addheaders = [
738 ('User-agent', 'Python-webchecker/%s' % __version__),
739 ]
Guido van Rossum89efda31997-05-07 15:00:56 +0000740
741 def http_error_401(self, url, fp, errcode, errmsg, headers):
742 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000743
Guido van Rossum272b37d1997-01-30 02:44:48 +0000744 def open_file(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000745 path = urllib.url2pathname(urllib.unquote(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000746 if os.path.isdir(path):
Guido van Rossum0ec14931999-04-26 23:11:46 +0000747 if path[-1] != os.sep:
748 url = url + '/'
Guido van Rossum986abac1998-04-06 14:29:28 +0000749 indexpath = os.path.join(path, "index.html")
750 if os.path.exists(indexpath):
751 return self.open_file(url + "index.html")
752 try:
753 names = os.listdir(path)
754 except os.error, msg:
755 raise IOError, msg, sys.exc_traceback
756 names.sort()
757 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
758 s.write('<BASE HREF="file:%s">\n' %
759 urllib.quote(os.path.join(path, "")))
760 for name in names:
761 q = urllib.quote(name)
762 s.write('<A HREF="%s">%s</A>\n' % (q, q))
763 s.seek(0)
764 return s
Guido van Rossum0ec14931999-04-26 23:11:46 +0000765 return urllib.FancyURLopener.open_file(self, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000766
767
Guido van Rossume5605ba1997-01-31 14:43:15 +0000768class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000769
Guido van Rossume284b211999-11-17 15:40:08 +0000770 def __init__(self, url, verbose=VERBOSE, checker=None):
Guido van Rossum125700a1998-07-08 03:04:39 +0000771 self.myverbose = verbose # now unused
772 self.checker = checker
Guido van Rossum986abac1998-04-06 14:29:28 +0000773 self.base = None
774 self.links = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000775 self.names = []
776 self.url = url
Guido van Rossum986abac1998-04-06 14:29:28 +0000777 sgmllib.SGMLParser.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000778
779 def start_a(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000780 self.link_attr(attributes, 'href')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000781
Guido van Rossume284b211999-11-17 15:40:08 +0000782 # We must rescue the NAME
783 # attributes from the anchor, in order to
784 # cache the internal anchors which are made
785 # available in the page.
786 for name, value in attributes:
787 if name == "name":
788 if value in self.names:
789 self.checker.message("WARNING: duplicate name %s in %s",
790 value, self.url)
791 else: self.names.append(value)
792 break
793
Guido van Rossum6133ec61997-02-01 05:16:08 +0000794 def end_a(self): pass
795
Guido van Rossum2237b731997-10-06 18:54:01 +0000796 def do_area(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000797 self.link_attr(attributes, 'href')
Guido van Rossum2237b731997-10-06 18:54:01 +0000798
Fred Drakef3186e82001-04-04 17:47:25 +0000799 def do_body(self, attributes):
Fred Draked34a9c92001-04-05 18:14:50 +0000800 self.link_attr(attributes, 'background', 'bgsound')
Fred Drakef3186e82001-04-04 17:47:25 +0000801
Guido van Rossum6133ec61997-02-01 05:16:08 +0000802 def do_img(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000803 self.link_attr(attributes, 'src', 'lowsrc')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000804
805 def do_frame(self, attributes):
Fred Drakef3186e82001-04-04 17:47:25 +0000806 self.link_attr(attributes, 'src', 'longdesc')
807
808 def do_iframe(self, attributes):
809 self.link_attr(attributes, 'src', 'longdesc')
810
811 def do_link(self, attributes):
812 for name, value in attributes:
813 if name == "rel":
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000814 parts = value.lower().split()
Fred Drakef3186e82001-04-04 17:47:25 +0000815 if ( parts == ["stylesheet"]
816 or parts == ["alternate", "stylesheet"]):
817 self.link_attr(attributes, "href")
818 break
819
820 def do_object(self, attributes):
821 self.link_attr(attributes, 'data', 'usemap')
822
823 def do_script(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000824 self.link_attr(attributes, 'src')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000825
Fred Draked34a9c92001-04-05 18:14:50 +0000826 def do_table(self, attributes):
827 self.link_attr(attributes, 'background')
828
829 def do_td(self, attributes):
830 self.link_attr(attributes, 'background')
831
832 def do_th(self, attributes):
833 self.link_attr(attributes, 'background')
834
835 def do_tr(self, attributes):
836 self.link_attr(attributes, 'background')
837
Guido van Rossum6133ec61997-02-01 05:16:08 +0000838 def link_attr(self, attributes, *args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000839 for name, value in attributes:
840 if name in args:
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000841 if value: value = value.strip()
Guido van Rossum986abac1998-04-06 14:29:28 +0000842 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000843
844 def do_base(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000845 for name, value in attributes:
846 if name == 'href':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000847 if value: value = value.strip()
Guido van Rossum986abac1998-04-06 14:29:28 +0000848 if value:
Guido van Rossum125700a1998-07-08 03:04:39 +0000849 if self.checker:
850 self.checker.note(1, " Base %s", value)
Guido van Rossum986abac1998-04-06 14:29:28 +0000851 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000852
853 def getlinks(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000854 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000855
856 def getbase(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000857 return self.base
Guido van Rossum272b37d1997-01-30 02:44:48 +0000858
859
Guido van Rossum272b37d1997-01-30 02:44:48 +0000860if __name__ == '__main__':
861 main()