blob: fa70f6575e9ff6a681a525663439bc0e08a85de6 [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
Guido van Rossume284b211999-11-17 15:40:08 +00003# Original code by Guido van Rossum; extensive changes by Sam Bayer,
4# including code to check URL fragments.
5
Guido van Rossum272b37d1997-01-30 02:44:48 +00006"""Web tree checker.
7
8This utility is handy to check a subweb of the world-wide web for
9errors. A subweb is specified by giving one or more ``root URLs''; a
10page belongs to the subweb if one of the root URLs is an initial
11prefix of it.
12
13File URL extension:
14
15In order to easy the checking of subwebs via the local file system,
16the interpretation of ``file:'' URLs is extended to mimic the behavior
17of your average HTTP daemon: if a directory pathname is given, the
18file index.html in that directory is returned if it exists, otherwise
19a directory listing is returned. Now, you can point webchecker to the
20document tree in the local file system of your HTTP daemon, and have
21most of it checked. In fact the default works this way if your local
22web tree is located at /usr/local/etc/httpd/htdpcs (the default for
23the NCSA HTTP daemon and probably others).
24
Guido van Rossumaf310c11997-02-02 23:30:32 +000025Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
Guido van Rossumaf310c11997-02-02 23:30:32 +000027When done, it reports pages with bad links within the subweb. When
28interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000029
30In verbose mode, additional messages are printed during the
31information gathering phase. By default, it prints a summary of its
32work status every 50 URLs (adjustable with the -r option), and it
33reports errors as they are encountered. Use the -q option to disable
34this output.
35
36Checkpoint feature:
37
38Whether interrupted or not, it dumps its state (a Python pickle) to a
39checkpoint file and the -R option allows it to restart from the
40checkpoint (assuming that the pages on the subweb that were already
41processed haven't changed). Even when it has run till completion, -R
42can still be useful -- it will print the reports again, and -Rq prints
43the errors only. In this case, the checkpoint file is not written
44again. The checkpoint file can be set with the -d option.
45
46The checkpoint file is written as a Python pickle. Remember that
47Python's pickle module is currently quite slow. Give it the time it
48needs to load and save the checkpoint file. When interrupted while
49writing the checkpoint file, the old checkpoint file is not
50overwritten, but all work done in the current run is lost.
51
52Miscellaneous:
53
Guido van Rossumaf310c11997-02-02 23:30:32 +000054- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
55
Guido van Rossum3edbb351997-01-30 03:19:41 +000056- Webchecker honors the "robots.txt" convention. Thanks to Skip
57Montanaro for his robotparser.py module (included in this directory)!
58The agent name is hardwired to "webchecker". URLs that are disallowed
59by the robots.txt file are reported as external URLs.
60
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000062skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000063
Guido van Rossumaf310c11997-02-02 23:30:32 +000064- When the server or protocol does not tell us a file's type, we guess
65it based on the URL's suffix. The mimetypes.py module (also in this
66directory) has a built-in table mapping most currently known suffixes,
67and in addition attempts to read the mime.types configuration files in
68the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossume284b211999-11-17 15:40:08 +000070- We follow links indicated by <A>, <FRAME> and <IMG> tags. We also
Guido van Rossumaf310c11997-02-02 23:30:32 +000071honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000072
Guido van Rossume284b211999-11-17 15:40:08 +000073- We now check internal NAME anchor links, as well as toplevel links.
74
Guido van Rossumaf310c11997-02-02 23:30:32 +000075- Checking external links is now done by default; use -x to *disable*
76this feature. External links are now checked during normal
77processing. (XXX The status of a checked link could be categorized
78better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000079
Guido van Rossume284b211999-11-17 15:40:08 +000080- If external links are not checked, you can use the -t flag to
81provide specific overrides to -x.
Guido van Rossum272b37d1997-01-30 02:44:48 +000082
83Usage: webchecker.py [option] ... [rooturl] ...
84
85Options:
86
87-R -- restart from checkpoint file
88-d file -- checkpoint filename (default %(DUMPFILE)s)
89-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000090-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000091-q -- quiet operation (also suppresses external links report)
92-r number -- number of links processed per round (default %(ROUNDSIZE)d)
Guido van Rossume284b211999-11-17 15:40:08 +000093-t root -- specify root dir which should be treated as internal (can repeat)
Guido van Rossum272b37d1997-01-30 02:44:48 +000094-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000095-x -- don't check external links (these are often slow to check)
Guido van Rossume284b211999-11-17 15:40:08 +000096-a -- don't check name anchors
Guido van Rossum272b37d1997-01-30 02:44:48 +000097
98Arguments:
99
100rooturl -- URL to start checking
101 (default %(DEFROOT)s)
102
103"""
104
Guido van Rossume5605ba1997-01-31 14:43:15 +0000105
Guido van Rossum00756bd1998-02-21 20:02:09 +0000106__version__ = "$Revision$"
Guido van Rossum325a64f1997-01-30 03:30:20 +0000107
Guido van Rossum272b37d1997-01-30 02:44:48 +0000108
109import sys
110import os
111from types import *
112import string
113import StringIO
114import getopt
115import pickle
116
117import urllib
118import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000119import sgmllib
Walter Dörwald88a20ba2002-06-06 17:01:21 +0000120import cgi
Guido van Rossum272b37d1997-01-30 02:44:48 +0000121
122import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000123import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000124
Guido van Rossum00756bd1998-02-21 20:02:09 +0000125# Extract real version number if necessary
126if __version__[0] == '$':
127 _v = string.split(__version__)
128 if len(_v) == 3:
Guido van Rossum986abac1998-04-06 14:29:28 +0000129 __version__ = _v[1]
Guido van Rossum00756bd1998-02-21 20:02:09 +0000130
Guido van Rossum272b37d1997-01-30 02:44:48 +0000131
132# Tunable parameters
Guido van Rossum986abac1998-04-06 14:29:28 +0000133DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
134CHECKEXT = 1 # Check external references (1 deep)
135VERBOSE = 1 # Verbosity level (0-3)
136MAXPAGE = 150000 # Ignore files bigger than this
137ROUNDSIZE = 50 # Number of links processed per round
138DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
139AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossume284b211999-11-17 15:40:08 +0000140NONAMES = 0 # Force name anchor checking
Guido van Rossum272b37d1997-01-30 02:44:48 +0000141
142
143# Global variables
Guido van Rossum272b37d1997-01-30 02:44:48 +0000144
145
146def main():
Guido van Rossum00756bd1998-02-21 20:02:09 +0000147 checkext = CHECKEXT
148 verbose = VERBOSE
149 maxpage = MAXPAGE
150 roundsize = ROUNDSIZE
Guido van Rossum272b37d1997-01-30 02:44:48 +0000151 dumpfile = DUMPFILE
152 restart = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000153 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000154
155 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000156 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:t:vxa')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000157 except getopt.error, msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000158 sys.stdout = sys.stderr
159 print msg
160 print __doc__%globals()
161 sys.exit(2)
Guido van Rossume284b211999-11-17 15:40:08 +0000162
163 # The extra_roots variable collects extra roots.
164 extra_roots = []
165 nonames = NONAMES
166
Guido van Rossum272b37d1997-01-30 02:44:48 +0000167 for o, a in opts:
Guido van Rossum986abac1998-04-06 14:29:28 +0000168 if o == '-R':
169 restart = 1
170 if o == '-d':
171 dumpfile = a
172 if o == '-m':
173 maxpage = string.atoi(a)
174 if o == '-n':
175 norun = 1
176 if o == '-q':
177 verbose = 0
178 if o == '-r':
179 roundsize = string.atoi(a)
Guido van Rossume284b211999-11-17 15:40:08 +0000180 if o == '-t':
181 extra_roots.append(a)
182 if o == '-a':
183 nonames = not nonames
Guido van Rossum986abac1998-04-06 14:29:28 +0000184 if o == '-v':
185 verbose = verbose + 1
186 if o == '-x':
187 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000188
Guido van Rossume5605ba1997-01-31 14:43:15 +0000189 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000190 print AGENTNAME, "version", __version__
Guido van Rossum325a64f1997-01-30 03:30:20 +0000191
Guido van Rossum272b37d1997-01-30 02:44:48 +0000192 if restart:
Guido van Rossum986abac1998-04-06 14:29:28 +0000193 c = load_pickle(dumpfile=dumpfile, verbose=verbose)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000194 else:
Guido van Rossum986abac1998-04-06 14:29:28 +0000195 c = Checker()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000196
197 c.setflags(checkext=checkext, verbose=verbose,
Guido van Rossume284b211999-11-17 15:40:08 +0000198 maxpage=maxpage, roundsize=roundsize,
199 nonames=nonames
200 )
Guido van Rossum00756bd1998-02-21 20:02:09 +0000201
202 if not restart and not args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000203 args.append(DEFROOT)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000204
205 for arg in args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000206 c.addroot(arg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000207
Guido van Rossume284b211999-11-17 15:40:08 +0000208 # The -t flag is only needed if external links are not to be
209 # checked. So -t values are ignored unless -x was specified.
210 if not checkext:
211 for root in extra_roots:
212 # Make sure it's terminated by a slash,
213 # so that addroot doesn't discard the last
214 # directory component.
215 if root[-1] != "/":
216 root = root + "/"
217 c.addroot(root, add_to_do = 0)
218
Guido van Rossumbee64531998-04-27 19:35:15 +0000219 try:
220
221 if not norun:
222 try:
223 c.run()
224 except KeyboardInterrupt:
225 if verbose > 0:
226 print "[run interrupted]"
227
Guido van Rossum986abac1998-04-06 14:29:28 +0000228 try:
Guido van Rossumbee64531998-04-27 19:35:15 +0000229 c.report()
Guido van Rossum986abac1998-04-06 14:29:28 +0000230 except KeyboardInterrupt:
231 if verbose > 0:
Guido van Rossumbee64531998-04-27 19:35:15 +0000232 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000233
Guido van Rossumbee64531998-04-27 19:35:15 +0000234 finally:
235 if c.save_pickle(dumpfile):
236 if dumpfile == DUMPFILE:
237 print "Use ``%s -R'' to restart." % sys.argv[0]
238 else:
239 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
240 dumpfile)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000241
242
243def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
244 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000245 print "Loading checkpoint from %s ..." % dumpfile
Guido van Rossum00756bd1998-02-21 20:02:09 +0000246 f = open(dumpfile, "rb")
247 c = pickle.load(f)
248 f.close()
249 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000250 print "Done."
251 print "Root:", string.join(c.roots, "\n ")
Guido van Rossum00756bd1998-02-21 20:02:09 +0000252 return c
Guido van Rossum272b37d1997-01-30 02:44:48 +0000253
254
255class Checker:
256
Guido van Rossum00756bd1998-02-21 20:02:09 +0000257 checkext = CHECKEXT
258 verbose = VERBOSE
259 maxpage = MAXPAGE
260 roundsize = ROUNDSIZE
Guido van Rossume284b211999-11-17 15:40:08 +0000261 nonames = NONAMES
Guido van Rossum00756bd1998-02-21 20:02:09 +0000262
263 validflags = tuple(dir())
264
265 def __init__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000266 self.reset()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000267
268 def setflags(self, **kw):
Guido van Rossum986abac1998-04-06 14:29:28 +0000269 for key in kw.keys():
270 if key not in self.validflags:
271 raise NameError, "invalid keyword argument: %s" % str(key)
272 for key, value in kw.items():
273 setattr(self, key, value)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000274
275 def reset(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000276 self.roots = []
277 self.todo = {}
278 self.done = {}
279 self.bad = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000280
281 # Add a name table, so that the name URLs can be checked. Also
282 # serves as an implicit cache for which URLs are done.
283 self.name_table = {}
284
Guido van Rossum986abac1998-04-06 14:29:28 +0000285 self.round = 0
286 # The following are not pickled:
287 self.robots = {}
288 self.errors = {}
289 self.urlopener = MyURLopener()
290 self.changed = 0
Guido van Rossume284b211999-11-17 15:40:08 +0000291
Guido van Rossum125700a1998-07-08 03:04:39 +0000292 def note(self, level, format, *args):
293 if self.verbose > level:
294 if args:
295 format = format%args
296 self.message(format)
Guido van Rossume284b211999-11-17 15:40:08 +0000297
Guido van Rossum125700a1998-07-08 03:04:39 +0000298 def message(self, format, *args):
299 if args:
300 format = format%args
301 print format
Guido van Rossum3edbb351997-01-30 03:19:41 +0000302
303 def __getstate__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000304 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000305
306 def __setstate__(self, state):
Guido van Rossum986abac1998-04-06 14:29:28 +0000307 self.reset()
308 (self.roots, self.todo, self.done, self.bad, self.round) = state
309 for root in self.roots:
310 self.addrobot(root)
311 for url in self.bad.keys():
312 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000313
Guido van Rossume284b211999-11-17 15:40:08 +0000314 def addroot(self, root, add_to_do = 1):
Guido van Rossum986abac1998-04-06 14:29:28 +0000315 if root not in self.roots:
316 troot = root
317 scheme, netloc, path, params, query, fragment = \
318 urlparse.urlparse(root)
319 i = string.rfind(path, "/") + 1
320 if 0 < i < len(path):
321 path = path[:i]
322 troot = urlparse.urlunparse((scheme, netloc, path,
323 params, query, fragment))
324 self.roots.append(troot)
325 self.addrobot(root)
Guido van Rossume284b211999-11-17 15:40:08 +0000326 if add_to_do:
327 self.newlink((root, ""), ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000328
329 def addrobot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000330 root = urlparse.urljoin(root, "/")
331 if self.robots.has_key(root): return
332 url = urlparse.urljoin(root, "/robots.txt")
333 self.robots[root] = rp = robotparser.RobotFileParser()
Guido van Rossum125700a1998-07-08 03:04:39 +0000334 self.note(2, "Parsing %s", url)
335 rp.debug = self.verbose > 3
Guido van Rossum986abac1998-04-06 14:29:28 +0000336 rp.set_url(url)
337 try:
338 rp.read()
Guido van Rossumf0953b92001-12-11 22:41:24 +0000339 except (OSError, IOError), msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000340 self.note(1, "I/O error parsing %s: %s", url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000341
342 def run(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000343 while self.todo:
344 self.round = self.round + 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000345 self.note(0, "\nRound %d (%s)\n", self.round, self.status())
Guido van Rossum6eb9d321998-06-15 12:33:02 +0000346 urls = self.todo.keys()
347 urls.sort()
348 del urls[self.roundsize:]
Guido van Rossum986abac1998-04-06 14:29:28 +0000349 for url in urls:
350 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000351
352 def status(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000353 return "%d total, %d to do, %d done, %d bad" % (
354 len(self.todo)+len(self.done),
355 len(self.todo), len(self.done),
356 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000357
Guido van Rossumaf310c11997-02-02 23:30:32 +0000358 def report(self):
Guido van Rossum125700a1998-07-08 03:04:39 +0000359 self.message("")
360 if not self.todo: s = "Final"
361 else: s = "Interim"
362 self.message("%s Report (%s)", s, self.status())
Guido van Rossum986abac1998-04-06 14:29:28 +0000363 self.report_errors()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000364
Guido van Rossum272b37d1997-01-30 02:44:48 +0000365 def report_errors(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000366 if not self.bad:
Guido van Rossum125700a1998-07-08 03:04:39 +0000367 self.message("\nNo errors")
Guido van Rossum986abac1998-04-06 14:29:28 +0000368 return
Guido van Rossum125700a1998-07-08 03:04:39 +0000369 self.message("\nError Report:")
Guido van Rossum986abac1998-04-06 14:29:28 +0000370 sources = self.errors.keys()
371 sources.sort()
372 for source in sources:
373 triples = self.errors[source]
Guido van Rossum125700a1998-07-08 03:04:39 +0000374 self.message("")
Guido van Rossum986abac1998-04-06 14:29:28 +0000375 if len(triples) > 1:
Guido van Rossum125700a1998-07-08 03:04:39 +0000376 self.message("%d Errors in %s", len(triples), source)
Guido van Rossum986abac1998-04-06 14:29:28 +0000377 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000378 self.message("Error in %s", source)
Guido van Rossume284b211999-11-17 15:40:08 +0000379 # Call self.format_url() instead of referring
380 # to the URL directly, since the URLs in these
381 # triples is now a (URL, fragment) pair. The value
382 # of the "source" variable comes from the list of
383 # origins, and is a URL, not a pair.
384 for url, rawlink, msg in triples:
385 if rawlink != self.format_url(url): s = " (%s)" % rawlink
Guido van Rossum125700a1998-07-08 03:04:39 +0000386 else: s = ""
Guido van Rossume284b211999-11-17 15:40:08 +0000387 self.message(" HREF %s%s\n msg %s",
388 self.format_url(url), s, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000389
Guido van Rossume284b211999-11-17 15:40:08 +0000390 def dopage(self, url_pair):
391
392 # All printing of URLs uses format_url(); argument changed to
393 # url_pair for clarity.
Guido van Rossum986abac1998-04-06 14:29:28 +0000394 if self.verbose > 1:
395 if self.verbose > 2:
Guido van Rossume284b211999-11-17 15:40:08 +0000396 self.show("Check ", self.format_url(url_pair),
397 " from", self.todo[url_pair])
Guido van Rossum986abac1998-04-06 14:29:28 +0000398 else:
Guido van Rossume284b211999-11-17 15:40:08 +0000399 self.message("Check %s", self.format_url(url_pair))
400 url, local_fragment = url_pair
401 if local_fragment and self.nonames:
402 self.markdone(url_pair)
403 return
404 page = self.getpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000405 if page:
Guido van Rossume284b211999-11-17 15:40:08 +0000406 # Store the page which corresponds to this URL.
407 self.name_table[url] = page
408 # If there is a fragment in this url_pair, and it's not
409 # in the list of names for the page, call setbad(), since
410 # it's a missing anchor.
411 if local_fragment and local_fragment not in page.getnames():
412 self.setbad(url_pair, ("Missing name anchor `%s'" % local_fragment))
Guido van Rossum986abac1998-04-06 14:29:28 +0000413 for info in page.getlinkinfos():
Guido van Rossume284b211999-11-17 15:40:08 +0000414 # getlinkinfos() now returns the fragment as well,
415 # and we store that fragment here in the "todo" dictionary.
416 link, rawlink, fragment = info
417 # However, we don't want the fragment as the origin, since
418 # the origin is logically a page.
Guido van Rossum986abac1998-04-06 14:29:28 +0000419 origin = url, rawlink
Guido van Rossume284b211999-11-17 15:40:08 +0000420 self.newlink((link, fragment), origin)
421 else:
422 # If no page has been created yet, we want to
423 # record that fact.
424 self.name_table[url_pair[0]] = None
425 self.markdone(url_pair)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000426
Guido van Rossumaf310c11997-02-02 23:30:32 +0000427 def newlink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000428 if self.done.has_key(url):
429 self.newdonelink(url, origin)
430 else:
431 self.newtodolink(url, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000432
433 def newdonelink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000434 if origin not in self.done[url]:
435 self.done[url].append(origin)
436
437 # Call self.format_url(), since the URL here
438 # is now a (URL, fragment) pair.
439 self.note(3, " Done link %s", self.format_url(url))
440
441 # Make sure that if it's bad, that the origin gets added.
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000442 if self.bad.has_key(url):
443 source, rawlink = origin
444 triple = url, rawlink, self.bad[url]
445 self.seterror(source, triple)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000446
447 def newtodolink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000448 # Call self.format_url(), since the URL here
449 # is now a (URL, fragment) pair.
Guido van Rossum986abac1998-04-06 14:29:28 +0000450 if self.todo.has_key(url):
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000451 if origin not in self.todo[url]:
452 self.todo[url].append(origin)
Guido van Rossume284b211999-11-17 15:40:08 +0000453 self.note(3, " Seen todo link %s", self.format_url(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000454 else:
455 self.todo[url] = [origin]
Guido van Rossume284b211999-11-17 15:40:08 +0000456 self.note(3, " New todo link %s", self.format_url(url))
457
458 def format_url(self, url):
459 link, fragment = url
460 if fragment: return link + "#" + fragment
461 else: return link
Guido van Rossume5605ba1997-01-31 14:43:15 +0000462
463 def markdone(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000464 self.done[url] = self.todo[url]
465 del self.todo[url]
466 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000467
468 def inroots(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000469 for root in self.roots:
470 if url[:len(root)] == root:
Guido van Rossum125700a1998-07-08 03:04:39 +0000471 return self.isallowed(root, url)
Guido van Rossum986abac1998-04-06 14:29:28 +0000472 return 0
Guido van Rossume284b211999-11-17 15:40:08 +0000473
Guido van Rossum125700a1998-07-08 03:04:39 +0000474 def isallowed(self, root, url):
475 root = urlparse.urljoin(root, "/")
476 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000477
Guido van Rossume284b211999-11-17 15:40:08 +0000478 def getpage(self, url_pair):
479 # Incoming argument name is a (URL, fragment) pair.
480 # The page may have been cached in the name_table variable.
481 url, fragment = url_pair
482 if self.name_table.has_key(url):
483 return self.name_table[url]
484
Andrew M. Kuchling566c0c72002-03-08 17:19:10 +0000485 scheme, path = urllib.splittype(url)
Fred Drakef3186e82001-04-04 17:47:25 +0000486 if scheme in ('mailto', 'news', 'javascript', 'telnet'):
487 self.note(1, " Not checking %s URL" % scheme)
Guido van Rossum986abac1998-04-06 14:29:28 +0000488 return None
489 isint = self.inroots(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000490
491 # Ensure that openpage gets the URL pair to
492 # print out its error message and record the error pair
493 # correctly.
Guido van Rossum986abac1998-04-06 14:29:28 +0000494 if not isint:
495 if not self.checkext:
Guido van Rossum125700a1998-07-08 03:04:39 +0000496 self.note(1, " Not checking ext link")
Guido van Rossum986abac1998-04-06 14:29:28 +0000497 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000498 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000499 if f:
500 self.safeclose(f)
501 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000502 text, nurl = self.readhtml(url_pair)
503
Guido van Rossum986abac1998-04-06 14:29:28 +0000504 if nurl != url:
Guido van Rossum125700a1998-07-08 03:04:39 +0000505 self.note(1, " Redirected to %s", nurl)
Guido van Rossum986abac1998-04-06 14:29:28 +0000506 url = nurl
507 if text:
Guido van Rossum125700a1998-07-08 03:04:39 +0000508 return Page(text, url, maxpage=self.maxpage, checker=self)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000509
Guido van Rossume284b211999-11-17 15:40:08 +0000510 # These next three functions take (URL, fragment) pairs as
511 # arguments, so that openpage() receives the appropriate tuple to
512 # record error messages.
513 def readhtml(self, url_pair):
514 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000515 text = None
Guido van Rossume284b211999-11-17 15:40:08 +0000516 f, url = self.openhtml(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000517 if f:
518 text = f.read()
519 f.close()
520 return text, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000521
Guido van Rossume284b211999-11-17 15:40:08 +0000522 def openhtml(self, url_pair):
523 url, fragment = url_pair
524 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000525 if f:
526 url = f.geturl()
527 info = f.info()
528 if not self.checkforhtml(info, url):
529 self.safeclose(f)
530 f = None
531 return f, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000532
Guido van Rossume284b211999-11-17 15:40:08 +0000533 def openpage(self, url_pair):
534 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000535 try:
536 return self.urlopener.open(url)
Guido van Rossumf0953b92001-12-11 22:41:24 +0000537 except (OSError, IOError), msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000538 msg = self.sanitize(msg)
Guido van Rossum125700a1998-07-08 03:04:39 +0000539 self.note(0, "Error %s", msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000540 if self.verbose > 0:
Guido van Rossume284b211999-11-17 15:40:08 +0000541 self.show(" HREF ", url, " from", self.todo[url_pair])
542 self.setbad(url_pair, msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000543 return None
Guido van Rossum00756bd1998-02-21 20:02:09 +0000544
545 def checkforhtml(self, info, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000546 if info.has_key('content-type'):
Walter Dörwald88a20ba2002-06-06 17:01:21 +0000547 ctype = string.lower(cgi.parse_header(info['content-type'])[0])
Guido van Rossum986abac1998-04-06 14:29:28 +0000548 else:
549 if url[-1:] == "/":
550 return 1
551 ctype, encoding = mimetypes.guess_type(url)
552 if ctype == 'text/html':
553 return 1
554 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000555 self.note(1, " Not HTML, mime type %s", ctype)
Guido van Rossum986abac1998-04-06 14:29:28 +0000556 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000557
Guido van Rossume5605ba1997-01-31 14:43:15 +0000558 def setgood(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000559 if self.bad.has_key(url):
560 del self.bad[url]
561 self.changed = 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000562 self.note(0, "(Clear previously seen error)")
Guido van Rossume5605ba1997-01-31 14:43:15 +0000563
564 def setbad(self, url, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000565 if self.bad.has_key(url) and self.bad[url] == msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000566 self.note(0, "(Seen this error before)")
Guido van Rossum986abac1998-04-06 14:29:28 +0000567 return
568 self.bad[url] = msg
569 self.changed = 1
570 self.markerror(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000571
Guido van Rossumaf310c11997-02-02 23:30:32 +0000572 def markerror(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000573 try:
574 origins = self.todo[url]
575 except KeyError:
576 origins = self.done[url]
577 for source, rawlink in origins:
578 triple = url, rawlink, self.bad[url]
579 self.seterror(source, triple)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000580
581 def seterror(self, url, triple):
Guido van Rossum986abac1998-04-06 14:29:28 +0000582 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000583 # Because of the way the URLs are now processed, I need to
584 # check to make sure the URL hasn't been entered in the
585 # error list. The first element of the triple here is a
586 # (URL, fragment) pair, but the URL key is not, since it's
587 # from the list of origins.
588 if triple not in self.errors[url]:
589 self.errors[url].append(triple)
Guido van Rossum986abac1998-04-06 14:29:28 +0000590 except KeyError:
591 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000592
Guido van Rossum00756bd1998-02-21 20:02:09 +0000593 # The following used to be toplevel functions; they have been
594 # changed into methods so they can be overridden in subclasses.
595
596 def show(self, p1, link, p2, origins):
Guido van Rossum125700a1998-07-08 03:04:39 +0000597 self.message("%s %s", p1, link)
Guido van Rossum986abac1998-04-06 14:29:28 +0000598 i = 0
599 for source, rawlink in origins:
600 i = i+1
601 if i == 2:
602 p2 = ' '*len(p2)
Guido van Rossum125700a1998-07-08 03:04:39 +0000603 if rawlink != link: s = " (%s)" % rawlink
604 else: s = ""
605 self.message("%s %s%s", p2, source, s)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000606
607 def sanitize(self, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000608 if isinstance(IOError, ClassType) and isinstance(msg, IOError):
609 # Do the other branch recursively
610 msg.args = self.sanitize(msg.args)
611 elif isinstance(msg, TupleType):
612 if len(msg) >= 4 and msg[0] == 'http error' and \
613 isinstance(msg[3], InstanceType):
614 # Remove the Message instance -- it may contain
615 # a file object which prevents pickling.
616 msg = msg[:3] + msg[4:]
617 return msg
Guido van Rossum00756bd1998-02-21 20:02:09 +0000618
619 def safeclose(self, f):
Guido van Rossum986abac1998-04-06 14:29:28 +0000620 try:
621 url = f.geturl()
622 except AttributeError:
623 pass
624 else:
625 if url[:4] == 'ftp:' or url[:7] == 'file://':
626 # Apparently ftp connections don't like to be closed
627 # prematurely...
628 text = f.read()
629 f.close()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000630
631 def save_pickle(self, dumpfile=DUMPFILE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000632 if not self.changed:
Guido van Rossum125700a1998-07-08 03:04:39 +0000633 self.note(0, "\nNo need to save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000634 elif not dumpfile:
Guido van Rossum125700a1998-07-08 03:04:39 +0000635 self.note(0, "No dumpfile, won't save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000636 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000637 self.note(0, "\nSaving checkpoint to %s ...", dumpfile)
Guido van Rossum986abac1998-04-06 14:29:28 +0000638 newfile = dumpfile + ".new"
639 f = open(newfile, "wb")
640 pickle.dump(self, f)
641 f.close()
642 try:
643 os.unlink(dumpfile)
644 except os.error:
645 pass
646 os.rename(newfile, dumpfile)
Guido van Rossum125700a1998-07-08 03:04:39 +0000647 self.note(0, "Done.")
Guido van Rossum986abac1998-04-06 14:29:28 +0000648 return 1
Guido van Rossum00756bd1998-02-21 20:02:09 +0000649
Guido van Rossum272b37d1997-01-30 02:44:48 +0000650
651class Page:
652
Guido van Rossum125700a1998-07-08 03:04:39 +0000653 def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE, checker=None):
Guido van Rossum986abac1998-04-06 14:29:28 +0000654 self.text = text
655 self.url = url
656 self.verbose = verbose
657 self.maxpage = maxpage
Guido van Rossum125700a1998-07-08 03:04:39 +0000658 self.checker = checker
Guido van Rossum272b37d1997-01-30 02:44:48 +0000659
Guido van Rossume284b211999-11-17 15:40:08 +0000660 # The parsing of the page is done in the __init__() routine in
661 # order to initialize the list of names the file
662 # contains. Stored the parser in an instance variable. Passed
663 # the URL to MyHTMLParser().
664 size = len(self.text)
665 if size > self.maxpage:
666 self.note(0, "Skip huge file %s (%.0f Kbytes)", self.url, (size*0.001))
667 self.parser = None
668 return
669 self.checker.note(2, " Parsing %s (%d bytes)", self.url, size)
670 self.parser = MyHTMLParser(url, verbose=self.verbose,
671 checker=self.checker)
672 self.parser.feed(self.text)
673 self.parser.close()
674
Guido van Rossuma42c1ee1998-08-06 21:31:13 +0000675 def note(self, level, msg, *args):
676 if self.checker:
677 apply(self.checker.note, (level, msg) + args)
678 else:
679 if self.verbose >= level:
680 if args:
681 msg = msg%args
682 print msg
683
Guido van Rossume284b211999-11-17 15:40:08 +0000684 # Method to retrieve names.
685 def getnames(self):
Guido van Rossum84306242000-03-28 20:10:39 +0000686 if self.parser:
687 return self.parser.names
688 else:
689 return []
Guido van Rossume284b211999-11-17 15:40:08 +0000690
Guido van Rossum272b37d1997-01-30 02:44:48 +0000691 def getlinkinfos(self):
Guido van Rossume284b211999-11-17 15:40:08 +0000692 # File reading is done in __init__() routine. Store parser in
693 # local variable to indicate success of parsing.
694
695 # If no parser was stored, fail.
696 if not self.parser: return []
697
698 rawlinks = self.parser.getlinks()
699 base = urlparse.urljoin(self.url, self.parser.getbase() or "")
Guido van Rossum986abac1998-04-06 14:29:28 +0000700 infos = []
701 for rawlink in rawlinks:
702 t = urlparse.urlparse(rawlink)
Guido van Rossume284b211999-11-17 15:40:08 +0000703 # DON'T DISCARD THE FRAGMENT! Instead, include
704 # it in the tuples which are returned. See Checker.dopage().
705 fragment = t[-1]
Guido van Rossum986abac1998-04-06 14:29:28 +0000706 t = t[:-1] + ('',)
707 rawlink = urlparse.urlunparse(t)
708 link = urlparse.urljoin(base, rawlink)
Guido van Rossume284b211999-11-17 15:40:08 +0000709 infos.append((link, rawlink, fragment))
710
Guido van Rossum986abac1998-04-06 14:29:28 +0000711 return infos
Guido van Rossum272b37d1997-01-30 02:44:48 +0000712
713
714class MyStringIO(StringIO.StringIO):
715
716 def __init__(self, url, info):
Guido van Rossum986abac1998-04-06 14:29:28 +0000717 self.__url = url
718 self.__info = info
719 StringIO.StringIO.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000720
721 def info(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000722 return self.__info
Guido van Rossum272b37d1997-01-30 02:44:48 +0000723
724 def geturl(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000725 return self.__url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000726
727
728class MyURLopener(urllib.FancyURLopener):
729
730 http_error_default = urllib.URLopener.http_error_default
731
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000732 def __init__(*args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000733 self = args[0]
734 apply(urllib.FancyURLopener.__init__, args)
735 self.addheaders = [
736 ('User-agent', 'Python-webchecker/%s' % __version__),
737 ]
Guido van Rossum89efda31997-05-07 15:00:56 +0000738
739 def http_error_401(self, url, fp, errcode, errmsg, headers):
740 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000741
Guido van Rossum272b37d1997-01-30 02:44:48 +0000742 def open_file(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000743 path = urllib.url2pathname(urllib.unquote(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000744 if os.path.isdir(path):
Guido van Rossum0ec14931999-04-26 23:11:46 +0000745 if path[-1] != os.sep:
746 url = url + '/'
Guido van Rossum986abac1998-04-06 14:29:28 +0000747 indexpath = os.path.join(path, "index.html")
748 if os.path.exists(indexpath):
749 return self.open_file(url + "index.html")
750 try:
751 names = os.listdir(path)
752 except os.error, msg:
753 raise IOError, msg, sys.exc_traceback
754 names.sort()
755 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
756 s.write('<BASE HREF="file:%s">\n' %
757 urllib.quote(os.path.join(path, "")))
758 for name in names:
759 q = urllib.quote(name)
760 s.write('<A HREF="%s">%s</A>\n' % (q, q))
761 s.seek(0)
762 return s
Guido van Rossum0ec14931999-04-26 23:11:46 +0000763 return urllib.FancyURLopener.open_file(self, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000764
765
Guido van Rossume5605ba1997-01-31 14:43:15 +0000766class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000767
Guido van Rossume284b211999-11-17 15:40:08 +0000768 def __init__(self, url, verbose=VERBOSE, checker=None):
Guido van Rossum125700a1998-07-08 03:04:39 +0000769 self.myverbose = verbose # now unused
770 self.checker = checker
Guido van Rossum986abac1998-04-06 14:29:28 +0000771 self.base = None
772 self.links = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000773 self.names = []
774 self.url = url
Guido van Rossum986abac1998-04-06 14:29:28 +0000775 sgmllib.SGMLParser.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000776
777 def start_a(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000778 self.link_attr(attributes, 'href')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000779
Guido van Rossume284b211999-11-17 15:40:08 +0000780 # We must rescue the NAME
781 # attributes from the anchor, in order to
782 # cache the internal anchors which are made
783 # available in the page.
784 for name, value in attributes:
785 if name == "name":
786 if value in self.names:
787 self.checker.message("WARNING: duplicate name %s in %s",
788 value, self.url)
789 else: self.names.append(value)
790 break
791
Guido van Rossum6133ec61997-02-01 05:16:08 +0000792 def end_a(self): pass
793
Guido van Rossum2237b731997-10-06 18:54:01 +0000794 def do_area(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000795 self.link_attr(attributes, 'href')
Guido van Rossum2237b731997-10-06 18:54:01 +0000796
Fred Drakef3186e82001-04-04 17:47:25 +0000797 def do_body(self, attributes):
Fred Draked34a9c92001-04-05 18:14:50 +0000798 self.link_attr(attributes, 'background', 'bgsound')
Fred Drakef3186e82001-04-04 17:47:25 +0000799
Guido van Rossum6133ec61997-02-01 05:16:08 +0000800 def do_img(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000801 self.link_attr(attributes, 'src', 'lowsrc')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000802
803 def do_frame(self, attributes):
Fred Drakef3186e82001-04-04 17:47:25 +0000804 self.link_attr(attributes, 'src', 'longdesc')
805
806 def do_iframe(self, attributes):
807 self.link_attr(attributes, 'src', 'longdesc')
808
809 def do_link(self, attributes):
810 for name, value in attributes:
811 if name == "rel":
812 parts = string.split(string.lower(value))
813 if ( parts == ["stylesheet"]
814 or parts == ["alternate", "stylesheet"]):
815 self.link_attr(attributes, "href")
816 break
817
818 def do_object(self, attributes):
819 self.link_attr(attributes, 'data', 'usemap')
820
821 def do_script(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000822 self.link_attr(attributes, 'src')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000823
Fred Draked34a9c92001-04-05 18:14:50 +0000824 def do_table(self, attributes):
825 self.link_attr(attributes, 'background')
826
827 def do_td(self, attributes):
828 self.link_attr(attributes, 'background')
829
830 def do_th(self, attributes):
831 self.link_attr(attributes, 'background')
832
833 def do_tr(self, attributes):
834 self.link_attr(attributes, 'background')
835
Guido van Rossum6133ec61997-02-01 05:16:08 +0000836 def link_attr(self, attributes, *args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000837 for name, value in attributes:
838 if name in args:
839 if value: value = string.strip(value)
840 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000841
842 def do_base(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000843 for name, value in attributes:
844 if name == 'href':
845 if value: value = string.strip(value)
846 if value:
Guido van Rossum125700a1998-07-08 03:04:39 +0000847 if self.checker:
848 self.checker.note(1, " Base %s", value)
Guido van Rossum986abac1998-04-06 14:29:28 +0000849 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000850
851 def getlinks(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000852 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000853
854 def getbase(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000855 return self.base
Guido van Rossum272b37d1997-01-30 02:44:48 +0000856
857
Guido van Rossum272b37d1997-01-30 02:44:48 +0000858if __name__ == '__main__':
859 main()