blob: e89529e5cf3a19982fc4ce834df650041aef705f [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
Guido van Rossume284b211999-11-17 15:40:08 +00003# Original code by Guido van Rossum; extensive changes by Sam Bayer,
4# including code to check URL fragments.
5
Guido van Rossum272b37d1997-01-30 02:44:48 +00006"""Web tree checker.
7
8This utility is handy to check a subweb of the world-wide web for
9errors. A subweb is specified by giving one or more ``root URLs''; a
10page belongs to the subweb if one of the root URLs is an initial
11prefix of it.
12
13File URL extension:
14
15In order to easy the checking of subwebs via the local file system,
16the interpretation of ``file:'' URLs is extended to mimic the behavior
17of your average HTTP daemon: if a directory pathname is given, the
18file index.html in that directory is returned if it exists, otherwise
19a directory listing is returned. Now, you can point webchecker to the
20document tree in the local file system of your HTTP daemon, and have
21most of it checked. In fact the default works this way if your local
22web tree is located at /usr/local/etc/httpd/htdpcs (the default for
23the NCSA HTTP daemon and probably others).
24
Guido van Rossumaf310c11997-02-02 23:30:32 +000025Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
Guido van Rossumaf310c11997-02-02 23:30:32 +000027When done, it reports pages with bad links within the subweb. When
28interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000029
30In verbose mode, additional messages are printed during the
31information gathering phase. By default, it prints a summary of its
32work status every 50 URLs (adjustable with the -r option), and it
33reports errors as they are encountered. Use the -q option to disable
34this output.
35
36Checkpoint feature:
37
38Whether interrupted or not, it dumps its state (a Python pickle) to a
39checkpoint file and the -R option allows it to restart from the
40checkpoint (assuming that the pages on the subweb that were already
41processed haven't changed). Even when it has run till completion, -R
42can still be useful -- it will print the reports again, and -Rq prints
43the errors only. In this case, the checkpoint file is not written
44again. The checkpoint file can be set with the -d option.
45
46The checkpoint file is written as a Python pickle. Remember that
47Python's pickle module is currently quite slow. Give it the time it
48needs to load and save the checkpoint file. When interrupted while
49writing the checkpoint file, the old checkpoint file is not
50overwritten, but all work done in the current run is lost.
51
52Miscellaneous:
53
Guido van Rossumaf310c11997-02-02 23:30:32 +000054- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
55
Guido van Rossum3edbb351997-01-30 03:19:41 +000056- Webchecker honors the "robots.txt" convention. Thanks to Skip
57Montanaro for his robotparser.py module (included in this directory)!
58The agent name is hardwired to "webchecker". URLs that are disallowed
59by the robots.txt file are reported as external URLs.
60
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000062skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000063
Guido van Rossumaf310c11997-02-02 23:30:32 +000064- When the server or protocol does not tell us a file's type, we guess
65it based on the URL's suffix. The mimetypes.py module (also in this
66directory) has a built-in table mapping most currently known suffixes,
67and in addition attempts to read the mime.types configuration files in
68the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossume284b211999-11-17 15:40:08 +000070- We follow links indicated by <A>, <FRAME> and <IMG> tags. We also
Guido van Rossumaf310c11997-02-02 23:30:32 +000071honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000072
Guido van Rossume284b211999-11-17 15:40:08 +000073- We now check internal NAME anchor links, as well as toplevel links.
74
Guido van Rossumaf310c11997-02-02 23:30:32 +000075- Checking external links is now done by default; use -x to *disable*
76this feature. External links are now checked during normal
77processing. (XXX The status of a checked link could be categorized
78better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000079
Guido van Rossume284b211999-11-17 15:40:08 +000080- If external links are not checked, you can use the -t flag to
81provide specific overrides to -x.
Guido van Rossum272b37d1997-01-30 02:44:48 +000082
83Usage: webchecker.py [option] ... [rooturl] ...
84
85Options:
86
87-R -- restart from checkpoint file
88-d file -- checkpoint filename (default %(DUMPFILE)s)
89-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000090-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000091-q -- quiet operation (also suppresses external links report)
92-r number -- number of links processed per round (default %(ROUNDSIZE)d)
Guido van Rossume284b211999-11-17 15:40:08 +000093-t root -- specify root dir which should be treated as internal (can repeat)
Guido van Rossum272b37d1997-01-30 02:44:48 +000094-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000095-x -- don't check external links (these are often slow to check)
Guido van Rossume284b211999-11-17 15:40:08 +000096-a -- don't check name anchors
Guido van Rossum272b37d1997-01-30 02:44:48 +000097
98Arguments:
99
100rooturl -- URL to start checking
101 (default %(DEFROOT)s)
102
103"""
104
Guido van Rossume5605ba1997-01-31 14:43:15 +0000105
Guido van Rossum00756bd1998-02-21 20:02:09 +0000106__version__ = "$Revision$"
Guido van Rossum325a64f1997-01-30 03:30:20 +0000107
Guido van Rossum272b37d1997-01-30 02:44:48 +0000108
109import sys
110import os
111from types import *
Guido van Rossum272b37d1997-01-30 02:44:48 +0000112import StringIO
113import getopt
114import pickle
115
116import urllib
117import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000118import sgmllib
Walter Dörwald88a20ba2002-06-06 17:01:21 +0000119import cgi
Guido van Rossum272b37d1997-01-30 02:44:48 +0000120
121import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000122import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000123
Guido van Rossum00756bd1998-02-21 20:02:09 +0000124# Extract real version number if necessary
125if __version__[0] == '$':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000126 _v = __version__.split()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000127 if len(_v) == 3:
Guido van Rossum986abac1998-04-06 14:29:28 +0000128 __version__ = _v[1]
Guido van Rossum00756bd1998-02-21 20:02:09 +0000129
Guido van Rossum272b37d1997-01-30 02:44:48 +0000130
131# Tunable parameters
Guido van Rossum986abac1998-04-06 14:29:28 +0000132DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
133CHECKEXT = 1 # Check external references (1 deep)
134VERBOSE = 1 # Verbosity level (0-3)
135MAXPAGE = 150000 # Ignore files bigger than this
136ROUNDSIZE = 50 # Number of links processed per round
137DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
138AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossume284b211999-11-17 15:40:08 +0000139NONAMES = 0 # Force name anchor checking
Guido van Rossum272b37d1997-01-30 02:44:48 +0000140
141
142# Global variables
Guido van Rossum272b37d1997-01-30 02:44:48 +0000143
144
145def main():
Guido van Rossum00756bd1998-02-21 20:02:09 +0000146 checkext = CHECKEXT
147 verbose = VERBOSE
148 maxpage = MAXPAGE
149 roundsize = ROUNDSIZE
Guido van Rossum272b37d1997-01-30 02:44:48 +0000150 dumpfile = DUMPFILE
151 restart = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000152 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000153
154 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000155 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:t:vxa')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000156 except getopt.error, msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000157 sys.stdout = sys.stderr
158 print msg
159 print __doc__%globals()
160 sys.exit(2)
Guido van Rossume284b211999-11-17 15:40:08 +0000161
162 # The extra_roots variable collects extra roots.
163 extra_roots = []
164 nonames = NONAMES
165
Guido van Rossum272b37d1997-01-30 02:44:48 +0000166 for o, a in opts:
Guido van Rossum986abac1998-04-06 14:29:28 +0000167 if o == '-R':
168 restart = 1
169 if o == '-d':
170 dumpfile = a
171 if o == '-m':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000172 maxpage = int(a)
Guido van Rossum986abac1998-04-06 14:29:28 +0000173 if o == '-n':
174 norun = 1
175 if o == '-q':
176 verbose = 0
177 if o == '-r':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000178 roundsize = int(a)
Guido van Rossume284b211999-11-17 15:40:08 +0000179 if o == '-t':
180 extra_roots.append(a)
181 if o == '-a':
182 nonames = not nonames
Guido van Rossum986abac1998-04-06 14:29:28 +0000183 if o == '-v':
184 verbose = verbose + 1
185 if o == '-x':
186 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000187
Guido van Rossume5605ba1997-01-31 14:43:15 +0000188 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000189 print AGENTNAME, "version", __version__
Guido van Rossum325a64f1997-01-30 03:30:20 +0000190
Guido van Rossum272b37d1997-01-30 02:44:48 +0000191 if restart:
Guido van Rossum986abac1998-04-06 14:29:28 +0000192 c = load_pickle(dumpfile=dumpfile, verbose=verbose)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000193 else:
Guido van Rossum986abac1998-04-06 14:29:28 +0000194 c = Checker()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000195
196 c.setflags(checkext=checkext, verbose=verbose,
Guido van Rossume284b211999-11-17 15:40:08 +0000197 maxpage=maxpage, roundsize=roundsize,
198 nonames=nonames
199 )
Guido van Rossum00756bd1998-02-21 20:02:09 +0000200
201 if not restart and not args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000202 args.append(DEFROOT)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000203
204 for arg in args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000205 c.addroot(arg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000206
Guido van Rossume284b211999-11-17 15:40:08 +0000207 # The -t flag is only needed if external links are not to be
208 # checked. So -t values are ignored unless -x was specified.
209 if not checkext:
210 for root in extra_roots:
211 # Make sure it's terminated by a slash,
212 # so that addroot doesn't discard the last
213 # directory component.
214 if root[-1] != "/":
215 root = root + "/"
216 c.addroot(root, add_to_do = 0)
217
Guido van Rossumbee64531998-04-27 19:35:15 +0000218 try:
219
220 if not norun:
221 try:
222 c.run()
223 except KeyboardInterrupt:
224 if verbose > 0:
225 print "[run interrupted]"
226
Guido van Rossum986abac1998-04-06 14:29:28 +0000227 try:
Guido van Rossumbee64531998-04-27 19:35:15 +0000228 c.report()
Guido van Rossum986abac1998-04-06 14:29:28 +0000229 except KeyboardInterrupt:
230 if verbose > 0:
Guido van Rossumbee64531998-04-27 19:35:15 +0000231 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000232
Guido van Rossumbee64531998-04-27 19:35:15 +0000233 finally:
234 if c.save_pickle(dumpfile):
235 if dumpfile == DUMPFILE:
236 print "Use ``%s -R'' to restart." % sys.argv[0]
237 else:
238 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
239 dumpfile)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000240
241
242def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
243 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000244 print "Loading checkpoint from %s ..." % dumpfile
Guido van Rossum00756bd1998-02-21 20:02:09 +0000245 f = open(dumpfile, "rb")
246 c = pickle.load(f)
247 f.close()
248 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000249 print "Done."
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000250 print "Root:", "\n ".join(c.roots)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000251 return c
Guido van Rossum272b37d1997-01-30 02:44:48 +0000252
253
254class Checker:
255
Guido van Rossum00756bd1998-02-21 20:02:09 +0000256 checkext = CHECKEXT
257 verbose = VERBOSE
258 maxpage = MAXPAGE
259 roundsize = ROUNDSIZE
Guido van Rossume284b211999-11-17 15:40:08 +0000260 nonames = NONAMES
Guido van Rossum00756bd1998-02-21 20:02:09 +0000261
262 validflags = tuple(dir())
263
264 def __init__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000265 self.reset()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000266
267 def setflags(self, **kw):
Guido van Rossum986abac1998-04-06 14:29:28 +0000268 for key in kw.keys():
269 if key not in self.validflags:
270 raise NameError, "invalid keyword argument: %s" % str(key)
271 for key, value in kw.items():
272 setattr(self, key, value)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000273
274 def reset(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000275 self.roots = []
276 self.todo = {}
277 self.done = {}
278 self.bad = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000279
280 # Add a name table, so that the name URLs can be checked. Also
281 # serves as an implicit cache for which URLs are done.
282 self.name_table = {}
283
Guido van Rossum986abac1998-04-06 14:29:28 +0000284 self.round = 0
285 # The following are not pickled:
286 self.robots = {}
287 self.errors = {}
288 self.urlopener = MyURLopener()
289 self.changed = 0
Guido van Rossume284b211999-11-17 15:40:08 +0000290
Guido van Rossum125700a1998-07-08 03:04:39 +0000291 def note(self, level, format, *args):
292 if self.verbose > level:
293 if args:
294 format = format%args
295 self.message(format)
Guido van Rossume284b211999-11-17 15:40:08 +0000296
Guido van Rossum125700a1998-07-08 03:04:39 +0000297 def message(self, format, *args):
298 if args:
299 format = format%args
300 print format
Guido van Rossum3edbb351997-01-30 03:19:41 +0000301
302 def __getstate__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000303 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000304
305 def __setstate__(self, state):
Guido van Rossum986abac1998-04-06 14:29:28 +0000306 self.reset()
307 (self.roots, self.todo, self.done, self.bad, self.round) = state
308 for root in self.roots:
309 self.addrobot(root)
310 for url in self.bad.keys():
311 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000312
Guido van Rossume284b211999-11-17 15:40:08 +0000313 def addroot(self, root, add_to_do = 1):
Guido van Rossum986abac1998-04-06 14:29:28 +0000314 if root not in self.roots:
315 troot = root
316 scheme, netloc, path, params, query, fragment = \
317 urlparse.urlparse(root)
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000318 i = path.rfind("/") + 1
Guido van Rossum986abac1998-04-06 14:29:28 +0000319 if 0 < i < len(path):
320 path = path[:i]
321 troot = urlparse.urlunparse((scheme, netloc, path,
322 params, query, fragment))
323 self.roots.append(troot)
324 self.addrobot(root)
Guido van Rossume284b211999-11-17 15:40:08 +0000325 if add_to_do:
326 self.newlink((root, ""), ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000327
328 def addrobot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000329 root = urlparse.urljoin(root, "/")
330 if self.robots.has_key(root): return
331 url = urlparse.urljoin(root, "/robots.txt")
332 self.robots[root] = rp = robotparser.RobotFileParser()
Guido van Rossum125700a1998-07-08 03:04:39 +0000333 self.note(2, "Parsing %s", url)
334 rp.debug = self.verbose > 3
Guido van Rossum986abac1998-04-06 14:29:28 +0000335 rp.set_url(url)
336 try:
337 rp.read()
Guido van Rossumf0953b92001-12-11 22:41:24 +0000338 except (OSError, IOError), msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000339 self.note(1, "I/O error parsing %s: %s", url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000340
341 def run(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000342 while self.todo:
343 self.round = self.round + 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000344 self.note(0, "\nRound %d (%s)\n", self.round, self.status())
Guido van Rossum6eb9d321998-06-15 12:33:02 +0000345 urls = self.todo.keys()
346 urls.sort()
347 del urls[self.roundsize:]
Guido van Rossum986abac1998-04-06 14:29:28 +0000348 for url in urls:
349 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000350
351 def status(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000352 return "%d total, %d to do, %d done, %d bad" % (
353 len(self.todo)+len(self.done),
354 len(self.todo), len(self.done),
355 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000356
Guido van Rossumaf310c11997-02-02 23:30:32 +0000357 def report(self):
Guido van Rossum125700a1998-07-08 03:04:39 +0000358 self.message("")
359 if not self.todo: s = "Final"
360 else: s = "Interim"
361 self.message("%s Report (%s)", s, self.status())
Guido van Rossum986abac1998-04-06 14:29:28 +0000362 self.report_errors()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000363
Guido van Rossum272b37d1997-01-30 02:44:48 +0000364 def report_errors(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000365 if not self.bad:
Guido van Rossum125700a1998-07-08 03:04:39 +0000366 self.message("\nNo errors")
Guido van Rossum986abac1998-04-06 14:29:28 +0000367 return
Guido van Rossum125700a1998-07-08 03:04:39 +0000368 self.message("\nError Report:")
Guido van Rossum986abac1998-04-06 14:29:28 +0000369 sources = self.errors.keys()
370 sources.sort()
371 for source in sources:
372 triples = self.errors[source]
Guido van Rossum125700a1998-07-08 03:04:39 +0000373 self.message("")
Guido van Rossum986abac1998-04-06 14:29:28 +0000374 if len(triples) > 1:
Guido van Rossum125700a1998-07-08 03:04:39 +0000375 self.message("%d Errors in %s", len(triples), source)
Guido van Rossum986abac1998-04-06 14:29:28 +0000376 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000377 self.message("Error in %s", source)
Guido van Rossume284b211999-11-17 15:40:08 +0000378 # Call self.format_url() instead of referring
379 # to the URL directly, since the URLs in these
380 # triples is now a (URL, fragment) pair. The value
381 # of the "source" variable comes from the list of
382 # origins, and is a URL, not a pair.
383 for url, rawlink, msg in triples:
384 if rawlink != self.format_url(url): s = " (%s)" % rawlink
Guido van Rossum125700a1998-07-08 03:04:39 +0000385 else: s = ""
Guido van Rossume284b211999-11-17 15:40:08 +0000386 self.message(" HREF %s%s\n msg %s",
387 self.format_url(url), s, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000388
Guido van Rossume284b211999-11-17 15:40:08 +0000389 def dopage(self, url_pair):
390
391 # All printing of URLs uses format_url(); argument changed to
392 # url_pair for clarity.
Guido van Rossum986abac1998-04-06 14:29:28 +0000393 if self.verbose > 1:
394 if self.verbose > 2:
Guido van Rossume284b211999-11-17 15:40:08 +0000395 self.show("Check ", self.format_url(url_pair),
396 " from", self.todo[url_pair])
Guido van Rossum986abac1998-04-06 14:29:28 +0000397 else:
Guido van Rossume284b211999-11-17 15:40:08 +0000398 self.message("Check %s", self.format_url(url_pair))
399 url, local_fragment = url_pair
400 if local_fragment and self.nonames:
401 self.markdone(url_pair)
402 return
Mark Hammondce56c372003-02-27 06:59:10 +0000403 try:
404 page = self.getpage(url_pair)
405 except sgmllib.SGMLParseError, msg:
406 msg = self.sanitize(msg)
407 self.note(0, "Error parsing %s: %s",
408 self.format_url(url_pair), msg)
409 # Dont actually mark the URL as bad - it exists, just
410 # we can't parse it!
411 page = None
Guido van Rossum986abac1998-04-06 14:29:28 +0000412 if page:
Guido van Rossume284b211999-11-17 15:40:08 +0000413 # Store the page which corresponds to this URL.
414 self.name_table[url] = page
415 # If there is a fragment in this url_pair, and it's not
416 # in the list of names for the page, call setbad(), since
417 # it's a missing anchor.
418 if local_fragment and local_fragment not in page.getnames():
419 self.setbad(url_pair, ("Missing name anchor `%s'" % local_fragment))
Guido van Rossum986abac1998-04-06 14:29:28 +0000420 for info in page.getlinkinfos():
Guido van Rossume284b211999-11-17 15:40:08 +0000421 # getlinkinfos() now returns the fragment as well,
422 # and we store that fragment here in the "todo" dictionary.
423 link, rawlink, fragment = info
424 # However, we don't want the fragment as the origin, since
425 # the origin is logically a page.
Guido van Rossum986abac1998-04-06 14:29:28 +0000426 origin = url, rawlink
Guido van Rossume284b211999-11-17 15:40:08 +0000427 self.newlink((link, fragment), origin)
428 else:
429 # If no page has been created yet, we want to
430 # record that fact.
431 self.name_table[url_pair[0]] = None
432 self.markdone(url_pair)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000433
Guido van Rossumaf310c11997-02-02 23:30:32 +0000434 def newlink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000435 if self.done.has_key(url):
436 self.newdonelink(url, origin)
437 else:
438 self.newtodolink(url, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000439
440 def newdonelink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000441 if origin not in self.done[url]:
442 self.done[url].append(origin)
443
444 # Call self.format_url(), since the URL here
445 # is now a (URL, fragment) pair.
446 self.note(3, " Done link %s", self.format_url(url))
447
448 # Make sure that if it's bad, that the origin gets added.
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000449 if self.bad.has_key(url):
450 source, rawlink = origin
451 triple = url, rawlink, self.bad[url]
452 self.seterror(source, triple)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000453
454 def newtodolink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000455 # Call self.format_url(), since the URL here
456 # is now a (URL, fragment) pair.
Guido van Rossum986abac1998-04-06 14:29:28 +0000457 if self.todo.has_key(url):
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000458 if origin not in self.todo[url]:
459 self.todo[url].append(origin)
Guido van Rossume284b211999-11-17 15:40:08 +0000460 self.note(3, " Seen todo link %s", self.format_url(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000461 else:
462 self.todo[url] = [origin]
Guido van Rossume284b211999-11-17 15:40:08 +0000463 self.note(3, " New todo link %s", self.format_url(url))
464
465 def format_url(self, url):
466 link, fragment = url
467 if fragment: return link + "#" + fragment
468 else: return link
Guido van Rossume5605ba1997-01-31 14:43:15 +0000469
470 def markdone(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000471 self.done[url] = self.todo[url]
472 del self.todo[url]
473 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000474
475 def inroots(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000476 for root in self.roots:
477 if url[:len(root)] == root:
Guido van Rossum125700a1998-07-08 03:04:39 +0000478 return self.isallowed(root, url)
Guido van Rossum986abac1998-04-06 14:29:28 +0000479 return 0
Guido van Rossume284b211999-11-17 15:40:08 +0000480
Guido van Rossum125700a1998-07-08 03:04:39 +0000481 def isallowed(self, root, url):
482 root = urlparse.urljoin(root, "/")
483 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000484
Guido van Rossume284b211999-11-17 15:40:08 +0000485 def getpage(self, url_pair):
486 # Incoming argument name is a (URL, fragment) pair.
487 # The page may have been cached in the name_table variable.
488 url, fragment = url_pair
489 if self.name_table.has_key(url):
490 return self.name_table[url]
491
Andrew M. Kuchling566c0c72002-03-08 17:19:10 +0000492 scheme, path = urllib.splittype(url)
Fred Drakef3186e82001-04-04 17:47:25 +0000493 if scheme in ('mailto', 'news', 'javascript', 'telnet'):
494 self.note(1, " Not checking %s URL" % scheme)
Guido van Rossum986abac1998-04-06 14:29:28 +0000495 return None
496 isint = self.inroots(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000497
498 # Ensure that openpage gets the URL pair to
499 # print out its error message and record the error pair
500 # correctly.
Guido van Rossum986abac1998-04-06 14:29:28 +0000501 if not isint:
502 if not self.checkext:
Guido van Rossum125700a1998-07-08 03:04:39 +0000503 self.note(1, " Not checking ext link")
Guido van Rossum986abac1998-04-06 14:29:28 +0000504 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000505 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000506 if f:
507 self.safeclose(f)
508 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000509 text, nurl = self.readhtml(url_pair)
510
Guido van Rossum986abac1998-04-06 14:29:28 +0000511 if nurl != url:
Guido van Rossum125700a1998-07-08 03:04:39 +0000512 self.note(1, " Redirected to %s", nurl)
Guido van Rossum986abac1998-04-06 14:29:28 +0000513 url = nurl
514 if text:
Guido van Rossum125700a1998-07-08 03:04:39 +0000515 return Page(text, url, maxpage=self.maxpage, checker=self)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000516
Guido van Rossume284b211999-11-17 15:40:08 +0000517 # These next three functions take (URL, fragment) pairs as
518 # arguments, so that openpage() receives the appropriate tuple to
519 # record error messages.
520 def readhtml(self, url_pair):
521 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000522 text = None
Guido van Rossume284b211999-11-17 15:40:08 +0000523 f, url = self.openhtml(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000524 if f:
525 text = f.read()
526 f.close()
527 return text, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000528
Guido van Rossume284b211999-11-17 15:40:08 +0000529 def openhtml(self, url_pair):
530 url, fragment = url_pair
531 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000532 if f:
533 url = f.geturl()
534 info = f.info()
535 if not self.checkforhtml(info, url):
536 self.safeclose(f)
537 f = None
538 return f, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000539
Guido van Rossume284b211999-11-17 15:40:08 +0000540 def openpage(self, url_pair):
541 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000542 try:
543 return self.urlopener.open(url)
Guido van Rossumf0953b92001-12-11 22:41:24 +0000544 except (OSError, IOError), msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000545 msg = self.sanitize(msg)
Guido van Rossum125700a1998-07-08 03:04:39 +0000546 self.note(0, "Error %s", msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000547 if self.verbose > 0:
Guido van Rossume284b211999-11-17 15:40:08 +0000548 self.show(" HREF ", url, " from", self.todo[url_pair])
549 self.setbad(url_pair, msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000550 return None
Guido van Rossum00756bd1998-02-21 20:02:09 +0000551
552 def checkforhtml(self, info, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000553 if info.has_key('content-type'):
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000554 ctype = cgi.parse_header(info['content-type'])[0].lower()
Fred Drake0b9e3f72002-11-12 22:19:34 +0000555 if ';' in ctype:
556 # handle content-type: text/html; charset=iso8859-1 :
557 ctype = ctype.split(';', 1)[0].strip()
Guido van Rossum986abac1998-04-06 14:29:28 +0000558 else:
559 if url[-1:] == "/":
560 return 1
561 ctype, encoding = mimetypes.guess_type(url)
562 if ctype == 'text/html':
563 return 1
564 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000565 self.note(1, " Not HTML, mime type %s", ctype)
Guido van Rossum986abac1998-04-06 14:29:28 +0000566 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000567
Guido van Rossume5605ba1997-01-31 14:43:15 +0000568 def setgood(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000569 if self.bad.has_key(url):
570 del self.bad[url]
571 self.changed = 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000572 self.note(0, "(Clear previously seen error)")
Guido van Rossume5605ba1997-01-31 14:43:15 +0000573
574 def setbad(self, url, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000575 if self.bad.has_key(url) and self.bad[url] == msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000576 self.note(0, "(Seen this error before)")
Guido van Rossum986abac1998-04-06 14:29:28 +0000577 return
578 self.bad[url] = msg
579 self.changed = 1
580 self.markerror(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000581
Guido van Rossumaf310c11997-02-02 23:30:32 +0000582 def markerror(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000583 try:
584 origins = self.todo[url]
585 except KeyError:
586 origins = self.done[url]
587 for source, rawlink in origins:
588 triple = url, rawlink, self.bad[url]
589 self.seterror(source, triple)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000590
591 def seterror(self, url, triple):
Guido van Rossum986abac1998-04-06 14:29:28 +0000592 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000593 # Because of the way the URLs are now processed, I need to
594 # check to make sure the URL hasn't been entered in the
595 # error list. The first element of the triple here is a
596 # (URL, fragment) pair, but the URL key is not, since it's
597 # from the list of origins.
598 if triple not in self.errors[url]:
599 self.errors[url].append(triple)
Guido van Rossum986abac1998-04-06 14:29:28 +0000600 except KeyError:
601 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000602
Guido van Rossum00756bd1998-02-21 20:02:09 +0000603 # The following used to be toplevel functions; they have been
604 # changed into methods so they can be overridden in subclasses.
605
606 def show(self, p1, link, p2, origins):
Guido van Rossum125700a1998-07-08 03:04:39 +0000607 self.message("%s %s", p1, link)
Guido van Rossum986abac1998-04-06 14:29:28 +0000608 i = 0
609 for source, rawlink in origins:
610 i = i+1
611 if i == 2:
612 p2 = ' '*len(p2)
Guido van Rossum125700a1998-07-08 03:04:39 +0000613 if rawlink != link: s = " (%s)" % rawlink
614 else: s = ""
615 self.message("%s %s%s", p2, source, s)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000616
617 def sanitize(self, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000618 if isinstance(IOError, ClassType) and isinstance(msg, IOError):
619 # Do the other branch recursively
620 msg.args = self.sanitize(msg.args)
621 elif isinstance(msg, TupleType):
622 if len(msg) >= 4 and msg[0] == 'http error' and \
623 isinstance(msg[3], InstanceType):
624 # Remove the Message instance -- it may contain
625 # a file object which prevents pickling.
626 msg = msg[:3] + msg[4:]
627 return msg
Guido van Rossum00756bd1998-02-21 20:02:09 +0000628
629 def safeclose(self, f):
Guido van Rossum986abac1998-04-06 14:29:28 +0000630 try:
631 url = f.geturl()
632 except AttributeError:
633 pass
634 else:
635 if url[:4] == 'ftp:' or url[:7] == 'file://':
636 # Apparently ftp connections don't like to be closed
637 # prematurely...
638 text = f.read()
639 f.close()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000640
641 def save_pickle(self, dumpfile=DUMPFILE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000642 if not self.changed:
Guido van Rossum125700a1998-07-08 03:04:39 +0000643 self.note(0, "\nNo need to save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000644 elif not dumpfile:
Guido van Rossum125700a1998-07-08 03:04:39 +0000645 self.note(0, "No dumpfile, won't save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000646 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000647 self.note(0, "\nSaving checkpoint to %s ...", dumpfile)
Guido van Rossum986abac1998-04-06 14:29:28 +0000648 newfile = dumpfile + ".new"
649 f = open(newfile, "wb")
650 pickle.dump(self, f)
651 f.close()
652 try:
653 os.unlink(dumpfile)
654 except os.error:
655 pass
656 os.rename(newfile, dumpfile)
Guido van Rossum125700a1998-07-08 03:04:39 +0000657 self.note(0, "Done.")
Guido van Rossum986abac1998-04-06 14:29:28 +0000658 return 1
Guido van Rossum00756bd1998-02-21 20:02:09 +0000659
Guido van Rossum272b37d1997-01-30 02:44:48 +0000660
661class Page:
662
Guido van Rossum125700a1998-07-08 03:04:39 +0000663 def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE, checker=None):
Guido van Rossum986abac1998-04-06 14:29:28 +0000664 self.text = text
665 self.url = url
666 self.verbose = verbose
667 self.maxpage = maxpage
Guido van Rossum125700a1998-07-08 03:04:39 +0000668 self.checker = checker
Guido van Rossum272b37d1997-01-30 02:44:48 +0000669
Guido van Rossume284b211999-11-17 15:40:08 +0000670 # The parsing of the page is done in the __init__() routine in
671 # order to initialize the list of names the file
672 # contains. Stored the parser in an instance variable. Passed
673 # the URL to MyHTMLParser().
674 size = len(self.text)
675 if size > self.maxpage:
676 self.note(0, "Skip huge file %s (%.0f Kbytes)", self.url, (size*0.001))
677 self.parser = None
678 return
679 self.checker.note(2, " Parsing %s (%d bytes)", self.url, size)
680 self.parser = MyHTMLParser(url, verbose=self.verbose,
681 checker=self.checker)
682 self.parser.feed(self.text)
683 self.parser.close()
684
Guido van Rossuma42c1ee1998-08-06 21:31:13 +0000685 def note(self, level, msg, *args):
686 if self.checker:
687 apply(self.checker.note, (level, msg) + args)
688 else:
689 if self.verbose >= level:
690 if args:
691 msg = msg%args
692 print msg
693
Guido van Rossume284b211999-11-17 15:40:08 +0000694 # Method to retrieve names.
695 def getnames(self):
Guido van Rossum84306242000-03-28 20:10:39 +0000696 if self.parser:
697 return self.parser.names
698 else:
699 return []
Guido van Rossume284b211999-11-17 15:40:08 +0000700
Guido van Rossum272b37d1997-01-30 02:44:48 +0000701 def getlinkinfos(self):
Guido van Rossume284b211999-11-17 15:40:08 +0000702 # File reading is done in __init__() routine. Store parser in
703 # local variable to indicate success of parsing.
704
705 # If no parser was stored, fail.
706 if not self.parser: return []
707
708 rawlinks = self.parser.getlinks()
709 base = urlparse.urljoin(self.url, self.parser.getbase() or "")
Guido van Rossum986abac1998-04-06 14:29:28 +0000710 infos = []
711 for rawlink in rawlinks:
712 t = urlparse.urlparse(rawlink)
Guido van Rossume284b211999-11-17 15:40:08 +0000713 # DON'T DISCARD THE FRAGMENT! Instead, include
714 # it in the tuples which are returned. See Checker.dopage().
715 fragment = t[-1]
Guido van Rossum986abac1998-04-06 14:29:28 +0000716 t = t[:-1] + ('',)
717 rawlink = urlparse.urlunparse(t)
718 link = urlparse.urljoin(base, rawlink)
Guido van Rossume284b211999-11-17 15:40:08 +0000719 infos.append((link, rawlink, fragment))
720
Guido van Rossum986abac1998-04-06 14:29:28 +0000721 return infos
Guido van Rossum272b37d1997-01-30 02:44:48 +0000722
723
724class MyStringIO(StringIO.StringIO):
725
726 def __init__(self, url, info):
Guido van Rossum986abac1998-04-06 14:29:28 +0000727 self.__url = url
728 self.__info = info
729 StringIO.StringIO.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000730
731 def info(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000732 return self.__info
Guido van Rossum272b37d1997-01-30 02:44:48 +0000733
734 def geturl(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000735 return self.__url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000736
737
738class MyURLopener(urllib.FancyURLopener):
739
740 http_error_default = urllib.URLopener.http_error_default
741
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000742 def __init__(*args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000743 self = args[0]
744 apply(urllib.FancyURLopener.__init__, args)
745 self.addheaders = [
746 ('User-agent', 'Python-webchecker/%s' % __version__),
747 ]
Guido van Rossum89efda31997-05-07 15:00:56 +0000748
749 def http_error_401(self, url, fp, errcode, errmsg, headers):
750 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000751
Guido van Rossum272b37d1997-01-30 02:44:48 +0000752 def open_file(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000753 path = urllib.url2pathname(urllib.unquote(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000754 if os.path.isdir(path):
Guido van Rossum0ec14931999-04-26 23:11:46 +0000755 if path[-1] != os.sep:
756 url = url + '/'
Guido van Rossum986abac1998-04-06 14:29:28 +0000757 indexpath = os.path.join(path, "index.html")
758 if os.path.exists(indexpath):
759 return self.open_file(url + "index.html")
760 try:
761 names = os.listdir(path)
762 except os.error, msg:
763 raise IOError, msg, sys.exc_traceback
764 names.sort()
765 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
766 s.write('<BASE HREF="file:%s">\n' %
767 urllib.quote(os.path.join(path, "")))
768 for name in names:
769 q = urllib.quote(name)
770 s.write('<A HREF="%s">%s</A>\n' % (q, q))
771 s.seek(0)
772 return s
Guido van Rossum0ec14931999-04-26 23:11:46 +0000773 return urllib.FancyURLopener.open_file(self, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000774
775
Guido van Rossume5605ba1997-01-31 14:43:15 +0000776class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000777
Guido van Rossume284b211999-11-17 15:40:08 +0000778 def __init__(self, url, verbose=VERBOSE, checker=None):
Guido van Rossum125700a1998-07-08 03:04:39 +0000779 self.myverbose = verbose # now unused
780 self.checker = checker
Guido van Rossum986abac1998-04-06 14:29:28 +0000781 self.base = None
782 self.links = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000783 self.names = []
784 self.url = url
Guido van Rossum986abac1998-04-06 14:29:28 +0000785 sgmllib.SGMLParser.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000786
787 def start_a(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000788 self.link_attr(attributes, 'href')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000789
Guido van Rossume284b211999-11-17 15:40:08 +0000790 # We must rescue the NAME
791 # attributes from the anchor, in order to
792 # cache the internal anchors which are made
793 # available in the page.
794 for name, value in attributes:
795 if name == "name":
796 if value in self.names:
797 self.checker.message("WARNING: duplicate name %s in %s",
798 value, self.url)
799 else: self.names.append(value)
800 break
801
Guido van Rossum6133ec61997-02-01 05:16:08 +0000802 def end_a(self): pass
803
Guido van Rossum2237b731997-10-06 18:54:01 +0000804 def do_area(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000805 self.link_attr(attributes, 'href')
Guido van Rossum2237b731997-10-06 18:54:01 +0000806
Fred Drakef3186e82001-04-04 17:47:25 +0000807 def do_body(self, attributes):
Fred Draked34a9c92001-04-05 18:14:50 +0000808 self.link_attr(attributes, 'background', 'bgsound')
Fred Drakef3186e82001-04-04 17:47:25 +0000809
Guido van Rossum6133ec61997-02-01 05:16:08 +0000810 def do_img(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000811 self.link_attr(attributes, 'src', 'lowsrc')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000812
813 def do_frame(self, attributes):
Fred Drakef3186e82001-04-04 17:47:25 +0000814 self.link_attr(attributes, 'src', 'longdesc')
815
816 def do_iframe(self, attributes):
817 self.link_attr(attributes, 'src', 'longdesc')
818
819 def do_link(self, attributes):
820 for name, value in attributes:
821 if name == "rel":
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000822 parts = value.lower().split()
Fred Drakef3186e82001-04-04 17:47:25 +0000823 if ( parts == ["stylesheet"]
824 or parts == ["alternate", "stylesheet"]):
825 self.link_attr(attributes, "href")
826 break
827
828 def do_object(self, attributes):
829 self.link_attr(attributes, 'data', 'usemap')
830
831 def do_script(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000832 self.link_attr(attributes, 'src')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000833
Fred Draked34a9c92001-04-05 18:14:50 +0000834 def do_table(self, attributes):
835 self.link_attr(attributes, 'background')
836
837 def do_td(self, attributes):
838 self.link_attr(attributes, 'background')
839
840 def do_th(self, attributes):
841 self.link_attr(attributes, 'background')
842
843 def do_tr(self, attributes):
844 self.link_attr(attributes, 'background')
845
Guido van Rossum6133ec61997-02-01 05:16:08 +0000846 def link_attr(self, attributes, *args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000847 for name, value in attributes:
848 if name in args:
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000849 if value: value = value.strip()
Guido van Rossum986abac1998-04-06 14:29:28 +0000850 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000851
852 def do_base(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000853 for name, value in attributes:
854 if name == 'href':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000855 if value: value = value.strip()
Guido van Rossum986abac1998-04-06 14:29:28 +0000856 if value:
Guido van Rossum125700a1998-07-08 03:04:39 +0000857 if self.checker:
858 self.checker.note(1, " Base %s", value)
Guido van Rossum986abac1998-04-06 14:29:28 +0000859 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000860
861 def getlinks(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000862 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000863
864 def getbase(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000865 return self.base
Guido van Rossum272b37d1997-01-30 02:44:48 +0000866
867
Guido van Rossum272b37d1997-01-30 02:44:48 +0000868if __name__ == '__main__':
869 main()