blob: e79e7f10c09726179890c1a6902d0bfd9c51b97e [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
Guido van Rossume284b211999-11-17 15:40:08 +00003# Original code by Guido van Rossum; extensive changes by Sam Bayer,
4# including code to check URL fragments.
5
Guido van Rossum272b37d1997-01-30 02:44:48 +00006"""Web tree checker.
7
8This utility is handy to check a subweb of the world-wide web for
9errors. A subweb is specified by giving one or more ``root URLs''; a
10page belongs to the subweb if one of the root URLs is an initial
11prefix of it.
12
13File URL extension:
14
15In order to easy the checking of subwebs via the local file system,
16the interpretation of ``file:'' URLs is extended to mimic the behavior
17of your average HTTP daemon: if a directory pathname is given, the
18file index.html in that directory is returned if it exists, otherwise
19a directory listing is returned. Now, you can point webchecker to the
20document tree in the local file system of your HTTP daemon, and have
21most of it checked. In fact the default works this way if your local
22web tree is located at /usr/local/etc/httpd/htdpcs (the default for
23the NCSA HTTP daemon and probably others).
24
Guido van Rossumaf310c11997-02-02 23:30:32 +000025Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
Guido van Rossumaf310c11997-02-02 23:30:32 +000027When done, it reports pages with bad links within the subweb. When
28interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000029
30In verbose mode, additional messages are printed during the
31information gathering phase. By default, it prints a summary of its
32work status every 50 URLs (adjustable with the -r option), and it
33reports errors as they are encountered. Use the -q option to disable
34this output.
35
36Checkpoint feature:
37
38Whether interrupted or not, it dumps its state (a Python pickle) to a
39checkpoint file and the -R option allows it to restart from the
40checkpoint (assuming that the pages on the subweb that were already
41processed haven't changed). Even when it has run till completion, -R
42can still be useful -- it will print the reports again, and -Rq prints
43the errors only. In this case, the checkpoint file is not written
44again. The checkpoint file can be set with the -d option.
45
46The checkpoint file is written as a Python pickle. Remember that
47Python's pickle module is currently quite slow. Give it the time it
48needs to load and save the checkpoint file. When interrupted while
49writing the checkpoint file, the old checkpoint file is not
50overwritten, but all work done in the current run is lost.
51
52Miscellaneous:
53
Guido van Rossumaf310c11997-02-02 23:30:32 +000054- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
55
Guido van Rossum3edbb351997-01-30 03:19:41 +000056- Webchecker honors the "robots.txt" convention. Thanks to Skip
57Montanaro for his robotparser.py module (included in this directory)!
58The agent name is hardwired to "webchecker". URLs that are disallowed
59by the robots.txt file are reported as external URLs.
60
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000062skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000063
Guido van Rossumaf310c11997-02-02 23:30:32 +000064- When the server or protocol does not tell us a file's type, we guess
65it based on the URL's suffix. The mimetypes.py module (also in this
66directory) has a built-in table mapping most currently known suffixes,
67and in addition attempts to read the mime.types configuration files in
68the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossume284b211999-11-17 15:40:08 +000070- We follow links indicated by <A>, <FRAME> and <IMG> tags. We also
Guido van Rossumaf310c11997-02-02 23:30:32 +000071honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000072
Guido van Rossume284b211999-11-17 15:40:08 +000073- We now check internal NAME anchor links, as well as toplevel links.
74
Guido van Rossumaf310c11997-02-02 23:30:32 +000075- Checking external links is now done by default; use -x to *disable*
76this feature. External links are now checked during normal
77processing. (XXX The status of a checked link could be categorized
78better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000079
Guido van Rossume284b211999-11-17 15:40:08 +000080- If external links are not checked, you can use the -t flag to
81provide specific overrides to -x.
Guido van Rossum272b37d1997-01-30 02:44:48 +000082
83Usage: webchecker.py [option] ... [rooturl] ...
84
85Options:
86
87-R -- restart from checkpoint file
88-d file -- checkpoint filename (default %(DUMPFILE)s)
89-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000090-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000091-q -- quiet operation (also suppresses external links report)
92-r number -- number of links processed per round (default %(ROUNDSIZE)d)
Guido van Rossume284b211999-11-17 15:40:08 +000093-t root -- specify root dir which should be treated as internal (can repeat)
Guido van Rossum272b37d1997-01-30 02:44:48 +000094-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000095-x -- don't check external links (these are often slow to check)
Guido van Rossume284b211999-11-17 15:40:08 +000096-a -- don't check name anchors
Guido van Rossum272b37d1997-01-30 02:44:48 +000097
98Arguments:
99
100rooturl -- URL to start checking
101 (default %(DEFROOT)s)
102
103"""
104
Guido van Rossume5605ba1997-01-31 14:43:15 +0000105
Guido van Rossum00756bd1998-02-21 20:02:09 +0000106__version__ = "$Revision$"
Guido van Rossum325a64f1997-01-30 03:30:20 +0000107
Guido van Rossum272b37d1997-01-30 02:44:48 +0000108
109import sys
110import os
111from types import *
112import string
113import StringIO
114import getopt
115import pickle
116
117import urllib
118import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000119import sgmllib
Guido van Rossum272b37d1997-01-30 02:44:48 +0000120
121import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000122import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000123
Guido van Rossum00756bd1998-02-21 20:02:09 +0000124# Extract real version number if necessary
125if __version__[0] == '$':
126 _v = string.split(__version__)
127 if len(_v) == 3:
Guido van Rossum986abac1998-04-06 14:29:28 +0000128 __version__ = _v[1]
Guido van Rossum00756bd1998-02-21 20:02:09 +0000129
Guido van Rossum272b37d1997-01-30 02:44:48 +0000130
131# Tunable parameters
Guido van Rossum986abac1998-04-06 14:29:28 +0000132DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
133CHECKEXT = 1 # Check external references (1 deep)
134VERBOSE = 1 # Verbosity level (0-3)
135MAXPAGE = 150000 # Ignore files bigger than this
136ROUNDSIZE = 50 # Number of links processed per round
137DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
138AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossume284b211999-11-17 15:40:08 +0000139NONAMES = 0 # Force name anchor checking
Guido van Rossum272b37d1997-01-30 02:44:48 +0000140
141
142# Global variables
Guido van Rossum272b37d1997-01-30 02:44:48 +0000143
144
145def main():
Guido van Rossum00756bd1998-02-21 20:02:09 +0000146 checkext = CHECKEXT
147 verbose = VERBOSE
148 maxpage = MAXPAGE
149 roundsize = ROUNDSIZE
Guido van Rossum272b37d1997-01-30 02:44:48 +0000150 dumpfile = DUMPFILE
151 restart = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000152 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000153
154 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000155 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:t:vxa')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000156 except getopt.error, msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000157 sys.stdout = sys.stderr
158 print msg
159 print __doc__%globals()
160 sys.exit(2)
Guido van Rossume284b211999-11-17 15:40:08 +0000161
162 # The extra_roots variable collects extra roots.
163 extra_roots = []
164 nonames = NONAMES
165
Guido van Rossum272b37d1997-01-30 02:44:48 +0000166 for o, a in opts:
Guido van Rossum986abac1998-04-06 14:29:28 +0000167 if o == '-R':
168 restart = 1
169 if o == '-d':
170 dumpfile = a
171 if o == '-m':
172 maxpage = string.atoi(a)
173 if o == '-n':
174 norun = 1
175 if o == '-q':
176 verbose = 0
177 if o == '-r':
178 roundsize = string.atoi(a)
Guido van Rossume284b211999-11-17 15:40:08 +0000179 if o == '-t':
180 extra_roots.append(a)
181 if o == '-a':
182 nonames = not nonames
Guido van Rossum986abac1998-04-06 14:29:28 +0000183 if o == '-v':
184 verbose = verbose + 1
185 if o == '-x':
186 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000187
Guido van Rossume5605ba1997-01-31 14:43:15 +0000188 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000189 print AGENTNAME, "version", __version__
Guido van Rossum325a64f1997-01-30 03:30:20 +0000190
Guido van Rossum272b37d1997-01-30 02:44:48 +0000191 if restart:
Guido van Rossum986abac1998-04-06 14:29:28 +0000192 c = load_pickle(dumpfile=dumpfile, verbose=verbose)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000193 else:
Guido van Rossum986abac1998-04-06 14:29:28 +0000194 c = Checker()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000195
196 c.setflags(checkext=checkext, verbose=verbose,
Guido van Rossume284b211999-11-17 15:40:08 +0000197 maxpage=maxpage, roundsize=roundsize,
198 nonames=nonames
199 )
Guido van Rossum00756bd1998-02-21 20:02:09 +0000200
201 if not restart and not args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000202 args.append(DEFROOT)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000203
204 for arg in args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000205 c.addroot(arg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000206
Guido van Rossume284b211999-11-17 15:40:08 +0000207 # The -t flag is only needed if external links are not to be
208 # checked. So -t values are ignored unless -x was specified.
209 if not checkext:
210 for root in extra_roots:
211 # Make sure it's terminated by a slash,
212 # so that addroot doesn't discard the last
213 # directory component.
214 if root[-1] != "/":
215 root = root + "/"
216 c.addroot(root, add_to_do = 0)
217
Guido van Rossumbee64531998-04-27 19:35:15 +0000218 try:
219
220 if not norun:
221 try:
222 c.run()
223 except KeyboardInterrupt:
224 if verbose > 0:
225 print "[run interrupted]"
226
Guido van Rossum986abac1998-04-06 14:29:28 +0000227 try:
Guido van Rossumbee64531998-04-27 19:35:15 +0000228 c.report()
Guido van Rossum986abac1998-04-06 14:29:28 +0000229 except KeyboardInterrupt:
230 if verbose > 0:
Guido van Rossumbee64531998-04-27 19:35:15 +0000231 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000232
Guido van Rossumbee64531998-04-27 19:35:15 +0000233 finally:
234 if c.save_pickle(dumpfile):
235 if dumpfile == DUMPFILE:
236 print "Use ``%s -R'' to restart." % sys.argv[0]
237 else:
238 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
239 dumpfile)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000240
241
242def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
243 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000244 print "Loading checkpoint from %s ..." % dumpfile
Guido van Rossum00756bd1998-02-21 20:02:09 +0000245 f = open(dumpfile, "rb")
246 c = pickle.load(f)
247 f.close()
248 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000249 print "Done."
250 print "Root:", string.join(c.roots, "\n ")
Guido van Rossum00756bd1998-02-21 20:02:09 +0000251 return c
Guido van Rossum272b37d1997-01-30 02:44:48 +0000252
253
254class Checker:
255
Guido van Rossum00756bd1998-02-21 20:02:09 +0000256 checkext = CHECKEXT
257 verbose = VERBOSE
258 maxpage = MAXPAGE
259 roundsize = ROUNDSIZE
Guido van Rossume284b211999-11-17 15:40:08 +0000260 nonames = NONAMES
Guido van Rossum00756bd1998-02-21 20:02:09 +0000261
262 validflags = tuple(dir())
263
264 def __init__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000265 self.reset()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000266
267 def setflags(self, **kw):
Guido van Rossum986abac1998-04-06 14:29:28 +0000268 for key in kw.keys():
269 if key not in self.validflags:
270 raise NameError, "invalid keyword argument: %s" % str(key)
271 for key, value in kw.items():
272 setattr(self, key, value)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000273
274 def reset(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000275 self.roots = []
276 self.todo = {}
277 self.done = {}
278 self.bad = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000279
280 # Add a name table, so that the name URLs can be checked. Also
281 # serves as an implicit cache for which URLs are done.
282 self.name_table = {}
283
Guido van Rossum986abac1998-04-06 14:29:28 +0000284 self.round = 0
285 # The following are not pickled:
286 self.robots = {}
287 self.errors = {}
288 self.urlopener = MyURLopener()
289 self.changed = 0
Guido van Rossume284b211999-11-17 15:40:08 +0000290
Guido van Rossum125700a1998-07-08 03:04:39 +0000291 def note(self, level, format, *args):
292 if self.verbose > level:
293 if args:
294 format = format%args
295 self.message(format)
Guido van Rossume284b211999-11-17 15:40:08 +0000296
Guido van Rossum125700a1998-07-08 03:04:39 +0000297 def message(self, format, *args):
298 if args:
299 format = format%args
300 print format
Guido van Rossum3edbb351997-01-30 03:19:41 +0000301
302 def __getstate__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000303 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000304
305 def __setstate__(self, state):
Guido van Rossum986abac1998-04-06 14:29:28 +0000306 self.reset()
307 (self.roots, self.todo, self.done, self.bad, self.round) = state
308 for root in self.roots:
309 self.addrobot(root)
310 for url in self.bad.keys():
311 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000312
Guido van Rossume284b211999-11-17 15:40:08 +0000313 def addroot(self, root, add_to_do = 1):
Guido van Rossum986abac1998-04-06 14:29:28 +0000314 if root not in self.roots:
315 troot = root
316 scheme, netloc, path, params, query, fragment = \
317 urlparse.urlparse(root)
318 i = string.rfind(path, "/") + 1
319 if 0 < i < len(path):
320 path = path[:i]
321 troot = urlparse.urlunparse((scheme, netloc, path,
322 params, query, fragment))
323 self.roots.append(troot)
324 self.addrobot(root)
Guido van Rossume284b211999-11-17 15:40:08 +0000325 if add_to_do:
326 self.newlink((root, ""), ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000327
328 def addrobot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000329 root = urlparse.urljoin(root, "/")
330 if self.robots.has_key(root): return
331 url = urlparse.urljoin(root, "/robots.txt")
332 self.robots[root] = rp = robotparser.RobotFileParser()
Guido van Rossum125700a1998-07-08 03:04:39 +0000333 self.note(2, "Parsing %s", url)
334 rp.debug = self.verbose > 3
Guido van Rossum986abac1998-04-06 14:29:28 +0000335 rp.set_url(url)
336 try:
337 rp.read()
338 except IOError, msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000339 self.note(1, "I/O error parsing %s: %s", url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000340
341 def run(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000342 while self.todo:
343 self.round = self.round + 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000344 self.note(0, "\nRound %d (%s)\n", self.round, self.status())
Guido van Rossum6eb9d321998-06-15 12:33:02 +0000345 urls = self.todo.keys()
346 urls.sort()
347 del urls[self.roundsize:]
Guido van Rossum986abac1998-04-06 14:29:28 +0000348 for url in urls:
349 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000350
351 def status(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000352 return "%d total, %d to do, %d done, %d bad" % (
353 len(self.todo)+len(self.done),
354 len(self.todo), len(self.done),
355 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000356
Guido van Rossumaf310c11997-02-02 23:30:32 +0000357 def report(self):
Guido van Rossum125700a1998-07-08 03:04:39 +0000358 self.message("")
359 if not self.todo: s = "Final"
360 else: s = "Interim"
361 self.message("%s Report (%s)", s, self.status())
Guido van Rossum986abac1998-04-06 14:29:28 +0000362 self.report_errors()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000363
Guido van Rossum272b37d1997-01-30 02:44:48 +0000364 def report_errors(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000365 if not self.bad:
Guido van Rossum125700a1998-07-08 03:04:39 +0000366 self.message("\nNo errors")
Guido van Rossum986abac1998-04-06 14:29:28 +0000367 return
Guido van Rossum125700a1998-07-08 03:04:39 +0000368 self.message("\nError Report:")
Guido van Rossum986abac1998-04-06 14:29:28 +0000369 sources = self.errors.keys()
370 sources.sort()
371 for source in sources:
372 triples = self.errors[source]
Guido van Rossum125700a1998-07-08 03:04:39 +0000373 self.message("")
Guido van Rossum986abac1998-04-06 14:29:28 +0000374 if len(triples) > 1:
Guido van Rossum125700a1998-07-08 03:04:39 +0000375 self.message("%d Errors in %s", len(triples), source)
Guido van Rossum986abac1998-04-06 14:29:28 +0000376 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000377 self.message("Error in %s", source)
Guido van Rossume284b211999-11-17 15:40:08 +0000378 # Call self.format_url() instead of referring
379 # to the URL directly, since the URLs in these
380 # triples is now a (URL, fragment) pair. The value
381 # of the "source" variable comes from the list of
382 # origins, and is a URL, not a pair.
383 for url, rawlink, msg in triples:
384 if rawlink != self.format_url(url): s = " (%s)" % rawlink
Guido van Rossum125700a1998-07-08 03:04:39 +0000385 else: s = ""
Guido van Rossume284b211999-11-17 15:40:08 +0000386 self.message(" HREF %s%s\n msg %s",
387 self.format_url(url), s, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000388
Guido van Rossume284b211999-11-17 15:40:08 +0000389 def dopage(self, url_pair):
390
391 # All printing of URLs uses format_url(); argument changed to
392 # url_pair for clarity.
Guido van Rossum986abac1998-04-06 14:29:28 +0000393 if self.verbose > 1:
394 if self.verbose > 2:
Guido van Rossume284b211999-11-17 15:40:08 +0000395 self.show("Check ", self.format_url(url_pair),
396 " from", self.todo[url_pair])
Guido van Rossum986abac1998-04-06 14:29:28 +0000397 else:
Guido van Rossume284b211999-11-17 15:40:08 +0000398 self.message("Check %s", self.format_url(url_pair))
399 url, local_fragment = url_pair
400 if local_fragment and self.nonames:
401 self.markdone(url_pair)
402 return
403 page = self.getpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000404 if page:
Guido van Rossume284b211999-11-17 15:40:08 +0000405 # Store the page which corresponds to this URL.
406 self.name_table[url] = page
407 # If there is a fragment in this url_pair, and it's not
408 # in the list of names for the page, call setbad(), since
409 # it's a missing anchor.
410 if local_fragment and local_fragment not in page.getnames():
411 self.setbad(url_pair, ("Missing name anchor `%s'" % local_fragment))
Guido van Rossum986abac1998-04-06 14:29:28 +0000412 for info in page.getlinkinfos():
Guido van Rossume284b211999-11-17 15:40:08 +0000413 # getlinkinfos() now returns the fragment as well,
414 # and we store that fragment here in the "todo" dictionary.
415 link, rawlink, fragment = info
416 # However, we don't want the fragment as the origin, since
417 # the origin is logically a page.
Guido van Rossum986abac1998-04-06 14:29:28 +0000418 origin = url, rawlink
Guido van Rossume284b211999-11-17 15:40:08 +0000419 self.newlink((link, fragment), origin)
420 else:
421 # If no page has been created yet, we want to
422 # record that fact.
423 self.name_table[url_pair[0]] = None
424 self.markdone(url_pair)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000425
Guido van Rossumaf310c11997-02-02 23:30:32 +0000426 def newlink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000427 if self.done.has_key(url):
428 self.newdonelink(url, origin)
429 else:
430 self.newtodolink(url, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000431
432 def newdonelink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000433 if origin not in self.done[url]:
434 self.done[url].append(origin)
435
436 # Call self.format_url(), since the URL here
437 # is now a (URL, fragment) pair.
438 self.note(3, " Done link %s", self.format_url(url))
439
440 # Make sure that if it's bad, that the origin gets added.
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000441 if self.bad.has_key(url):
442 source, rawlink = origin
443 triple = url, rawlink, self.bad[url]
444 self.seterror(source, triple)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000445
446 def newtodolink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000447 # Call self.format_url(), since the URL here
448 # is now a (URL, fragment) pair.
Guido van Rossum986abac1998-04-06 14:29:28 +0000449 if self.todo.has_key(url):
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000450 if origin not in self.todo[url]:
451 self.todo[url].append(origin)
Guido van Rossume284b211999-11-17 15:40:08 +0000452 self.note(3, " Seen todo link %s", self.format_url(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000453 else:
454 self.todo[url] = [origin]
Guido van Rossume284b211999-11-17 15:40:08 +0000455 self.note(3, " New todo link %s", self.format_url(url))
456
457 def format_url(self, url):
458 link, fragment = url
459 if fragment: return link + "#" + fragment
460 else: return link
Guido van Rossume5605ba1997-01-31 14:43:15 +0000461
462 def markdone(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000463 self.done[url] = self.todo[url]
464 del self.todo[url]
465 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000466
467 def inroots(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000468 for root in self.roots:
469 if url[:len(root)] == root:
Guido van Rossum125700a1998-07-08 03:04:39 +0000470 return self.isallowed(root, url)
Guido van Rossum986abac1998-04-06 14:29:28 +0000471 return 0
Guido van Rossume284b211999-11-17 15:40:08 +0000472
Guido van Rossum125700a1998-07-08 03:04:39 +0000473 def isallowed(self, root, url):
474 root = urlparse.urljoin(root, "/")
475 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000476
Guido van Rossume284b211999-11-17 15:40:08 +0000477 def getpage(self, url_pair):
478 # Incoming argument name is a (URL, fragment) pair.
479 # The page may have been cached in the name_table variable.
480 url, fragment = url_pair
481 if self.name_table.has_key(url):
482 return self.name_table[url]
483
Guido van Rossum986abac1998-04-06 14:29:28 +0000484 if url[:7] == 'mailto:' or url[:5] == 'news:':
Guido van Rossum125700a1998-07-08 03:04:39 +0000485 self.note(1, " Not checking mailto/news URL")
Guido van Rossum986abac1998-04-06 14:29:28 +0000486 return None
487 isint = self.inroots(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000488
489 # Ensure that openpage gets the URL pair to
490 # print out its error message and record the error pair
491 # correctly.
Guido van Rossum986abac1998-04-06 14:29:28 +0000492 if not isint:
493 if not self.checkext:
Guido van Rossum125700a1998-07-08 03:04:39 +0000494 self.note(1, " Not checking ext link")
Guido van Rossum986abac1998-04-06 14:29:28 +0000495 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000496 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000497 if f:
498 self.safeclose(f)
499 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000500 text, nurl = self.readhtml(url_pair)
501
Guido van Rossum986abac1998-04-06 14:29:28 +0000502 if nurl != url:
Guido van Rossum125700a1998-07-08 03:04:39 +0000503 self.note(1, " Redirected to %s", nurl)
Guido van Rossum986abac1998-04-06 14:29:28 +0000504 url = nurl
505 if text:
Guido van Rossum125700a1998-07-08 03:04:39 +0000506 return Page(text, url, maxpage=self.maxpage, checker=self)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000507
Guido van Rossume284b211999-11-17 15:40:08 +0000508 # These next three functions take (URL, fragment) pairs as
509 # arguments, so that openpage() receives the appropriate tuple to
510 # record error messages.
511 def readhtml(self, url_pair):
512 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000513 text = None
Guido van Rossume284b211999-11-17 15:40:08 +0000514 f, url = self.openhtml(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000515 if f:
516 text = f.read()
517 f.close()
518 return text, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000519
Guido van Rossume284b211999-11-17 15:40:08 +0000520 def openhtml(self, url_pair):
521 url, fragment = url_pair
522 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000523 if f:
524 url = f.geturl()
525 info = f.info()
526 if not self.checkforhtml(info, url):
527 self.safeclose(f)
528 f = None
529 return f, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000530
Guido van Rossume284b211999-11-17 15:40:08 +0000531 def openpage(self, url_pair):
532 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000533 try:
534 return self.urlopener.open(url)
535 except IOError, msg:
536 msg = self.sanitize(msg)
Guido van Rossum125700a1998-07-08 03:04:39 +0000537 self.note(0, "Error %s", msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000538 if self.verbose > 0:
Guido van Rossume284b211999-11-17 15:40:08 +0000539 self.show(" HREF ", url, " from", self.todo[url_pair])
540 self.setbad(url_pair, msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000541 return None
Guido van Rossum00756bd1998-02-21 20:02:09 +0000542
543 def checkforhtml(self, info, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000544 if info.has_key('content-type'):
545 ctype = string.lower(info['content-type'])
546 else:
547 if url[-1:] == "/":
548 return 1
549 ctype, encoding = mimetypes.guess_type(url)
550 if ctype == 'text/html':
551 return 1
552 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000553 self.note(1, " Not HTML, mime type %s", ctype)
Guido van Rossum986abac1998-04-06 14:29:28 +0000554 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000555
Guido van Rossume5605ba1997-01-31 14:43:15 +0000556 def setgood(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000557 if self.bad.has_key(url):
558 del self.bad[url]
559 self.changed = 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000560 self.note(0, "(Clear previously seen error)")
Guido van Rossume5605ba1997-01-31 14:43:15 +0000561
562 def setbad(self, url, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000563 if self.bad.has_key(url) and self.bad[url] == msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000564 self.note(0, "(Seen this error before)")
Guido van Rossum986abac1998-04-06 14:29:28 +0000565 return
566 self.bad[url] = msg
567 self.changed = 1
568 self.markerror(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000569
Guido van Rossumaf310c11997-02-02 23:30:32 +0000570 def markerror(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000571 try:
572 origins = self.todo[url]
573 except KeyError:
574 origins = self.done[url]
575 for source, rawlink in origins:
576 triple = url, rawlink, self.bad[url]
577 self.seterror(source, triple)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000578
579 def seterror(self, url, triple):
Guido van Rossum986abac1998-04-06 14:29:28 +0000580 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000581 # Because of the way the URLs are now processed, I need to
582 # check to make sure the URL hasn't been entered in the
583 # error list. The first element of the triple here is a
584 # (URL, fragment) pair, but the URL key is not, since it's
585 # from the list of origins.
586 if triple not in self.errors[url]:
587 self.errors[url].append(triple)
Guido van Rossum986abac1998-04-06 14:29:28 +0000588 except KeyError:
589 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000590
Guido van Rossum00756bd1998-02-21 20:02:09 +0000591 # The following used to be toplevel functions; they have been
592 # changed into methods so they can be overridden in subclasses.
593
594 def show(self, p1, link, p2, origins):
Guido van Rossum125700a1998-07-08 03:04:39 +0000595 self.message("%s %s", p1, link)
Guido van Rossum986abac1998-04-06 14:29:28 +0000596 i = 0
597 for source, rawlink in origins:
598 i = i+1
599 if i == 2:
600 p2 = ' '*len(p2)
Guido van Rossum125700a1998-07-08 03:04:39 +0000601 if rawlink != link: s = " (%s)" % rawlink
602 else: s = ""
603 self.message("%s %s%s", p2, source, s)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000604
605 def sanitize(self, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000606 if isinstance(IOError, ClassType) and isinstance(msg, IOError):
607 # Do the other branch recursively
608 msg.args = self.sanitize(msg.args)
609 elif isinstance(msg, TupleType):
610 if len(msg) >= 4 and msg[0] == 'http error' and \
611 isinstance(msg[3], InstanceType):
612 # Remove the Message instance -- it may contain
613 # a file object which prevents pickling.
614 msg = msg[:3] + msg[4:]
615 return msg
Guido van Rossum00756bd1998-02-21 20:02:09 +0000616
617 def safeclose(self, f):
Guido van Rossum986abac1998-04-06 14:29:28 +0000618 try:
619 url = f.geturl()
620 except AttributeError:
621 pass
622 else:
623 if url[:4] == 'ftp:' or url[:7] == 'file://':
624 # Apparently ftp connections don't like to be closed
625 # prematurely...
626 text = f.read()
627 f.close()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000628
629 def save_pickle(self, dumpfile=DUMPFILE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000630 if not self.changed:
Guido van Rossum125700a1998-07-08 03:04:39 +0000631 self.note(0, "\nNo need to save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000632 elif not dumpfile:
Guido van Rossum125700a1998-07-08 03:04:39 +0000633 self.note(0, "No dumpfile, won't save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000634 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000635 self.note(0, "\nSaving checkpoint to %s ...", dumpfile)
Guido van Rossum986abac1998-04-06 14:29:28 +0000636 newfile = dumpfile + ".new"
637 f = open(newfile, "wb")
638 pickle.dump(self, f)
639 f.close()
640 try:
641 os.unlink(dumpfile)
642 except os.error:
643 pass
644 os.rename(newfile, dumpfile)
Guido van Rossum125700a1998-07-08 03:04:39 +0000645 self.note(0, "Done.")
Guido van Rossum986abac1998-04-06 14:29:28 +0000646 return 1
Guido van Rossum00756bd1998-02-21 20:02:09 +0000647
Guido van Rossum272b37d1997-01-30 02:44:48 +0000648
649class Page:
650
Guido van Rossum125700a1998-07-08 03:04:39 +0000651 def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE, checker=None):
Guido van Rossum986abac1998-04-06 14:29:28 +0000652 self.text = text
653 self.url = url
654 self.verbose = verbose
655 self.maxpage = maxpage
Guido van Rossum125700a1998-07-08 03:04:39 +0000656 self.checker = checker
Guido van Rossum272b37d1997-01-30 02:44:48 +0000657
Guido van Rossume284b211999-11-17 15:40:08 +0000658 # The parsing of the page is done in the __init__() routine in
659 # order to initialize the list of names the file
660 # contains. Stored the parser in an instance variable. Passed
661 # the URL to MyHTMLParser().
662 size = len(self.text)
663 if size > self.maxpage:
664 self.note(0, "Skip huge file %s (%.0f Kbytes)", self.url, (size*0.001))
665 self.parser = None
666 return
667 self.checker.note(2, " Parsing %s (%d bytes)", self.url, size)
668 self.parser = MyHTMLParser(url, verbose=self.verbose,
669 checker=self.checker)
670 self.parser.feed(self.text)
671 self.parser.close()
672
Guido van Rossuma42c1ee1998-08-06 21:31:13 +0000673 def note(self, level, msg, *args):
674 if self.checker:
675 apply(self.checker.note, (level, msg) + args)
676 else:
677 if self.verbose >= level:
678 if args:
679 msg = msg%args
680 print msg
681
Guido van Rossume284b211999-11-17 15:40:08 +0000682 # Method to retrieve names.
683 def getnames(self):
Guido van Rossum84306242000-03-28 20:10:39 +0000684 if self.parser:
685 return self.parser.names
686 else:
687 return []
Guido van Rossume284b211999-11-17 15:40:08 +0000688
Guido van Rossum272b37d1997-01-30 02:44:48 +0000689 def getlinkinfos(self):
Guido van Rossume284b211999-11-17 15:40:08 +0000690 # File reading is done in __init__() routine. Store parser in
691 # local variable to indicate success of parsing.
692
693 # If no parser was stored, fail.
694 if not self.parser: return []
695
696 rawlinks = self.parser.getlinks()
697 base = urlparse.urljoin(self.url, self.parser.getbase() or "")
Guido van Rossum986abac1998-04-06 14:29:28 +0000698 infos = []
699 for rawlink in rawlinks:
700 t = urlparse.urlparse(rawlink)
Guido van Rossume284b211999-11-17 15:40:08 +0000701 # DON'T DISCARD THE FRAGMENT! Instead, include
702 # it in the tuples which are returned. See Checker.dopage().
703 fragment = t[-1]
Guido van Rossum986abac1998-04-06 14:29:28 +0000704 t = t[:-1] + ('',)
705 rawlink = urlparse.urlunparse(t)
706 link = urlparse.urljoin(base, rawlink)
Guido van Rossume284b211999-11-17 15:40:08 +0000707 infos.append((link, rawlink, fragment))
708
Guido van Rossum986abac1998-04-06 14:29:28 +0000709 return infos
Guido van Rossum272b37d1997-01-30 02:44:48 +0000710
711
712class MyStringIO(StringIO.StringIO):
713
714 def __init__(self, url, info):
Guido van Rossum986abac1998-04-06 14:29:28 +0000715 self.__url = url
716 self.__info = info
717 StringIO.StringIO.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000718
719 def info(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000720 return self.__info
Guido van Rossum272b37d1997-01-30 02:44:48 +0000721
722 def geturl(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000723 return self.__url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000724
725
726class MyURLopener(urllib.FancyURLopener):
727
728 http_error_default = urllib.URLopener.http_error_default
729
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000730 def __init__(*args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000731 self = args[0]
732 apply(urllib.FancyURLopener.__init__, args)
733 self.addheaders = [
734 ('User-agent', 'Python-webchecker/%s' % __version__),
735 ]
Guido van Rossum89efda31997-05-07 15:00:56 +0000736
737 def http_error_401(self, url, fp, errcode, errmsg, headers):
738 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000739
Guido van Rossum272b37d1997-01-30 02:44:48 +0000740 def open_file(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000741 path = urllib.url2pathname(urllib.unquote(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000742 if os.path.isdir(path):
Guido van Rossum0ec14931999-04-26 23:11:46 +0000743 if path[-1] != os.sep:
744 url = url + '/'
Guido van Rossum986abac1998-04-06 14:29:28 +0000745 indexpath = os.path.join(path, "index.html")
746 if os.path.exists(indexpath):
747 return self.open_file(url + "index.html")
748 try:
749 names = os.listdir(path)
750 except os.error, msg:
751 raise IOError, msg, sys.exc_traceback
752 names.sort()
753 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
754 s.write('<BASE HREF="file:%s">\n' %
755 urllib.quote(os.path.join(path, "")))
756 for name in names:
757 q = urllib.quote(name)
758 s.write('<A HREF="%s">%s</A>\n' % (q, q))
759 s.seek(0)
760 return s
Guido van Rossum0ec14931999-04-26 23:11:46 +0000761 return urllib.FancyURLopener.open_file(self, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000762
763
Guido van Rossume5605ba1997-01-31 14:43:15 +0000764class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000765
Guido van Rossume284b211999-11-17 15:40:08 +0000766 def __init__(self, url, verbose=VERBOSE, checker=None):
Guido van Rossum125700a1998-07-08 03:04:39 +0000767 self.myverbose = verbose # now unused
768 self.checker = checker
Guido van Rossum986abac1998-04-06 14:29:28 +0000769 self.base = None
770 self.links = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000771 self.names = []
772 self.url = url
Guido van Rossum986abac1998-04-06 14:29:28 +0000773 sgmllib.SGMLParser.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000774
775 def start_a(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000776 self.link_attr(attributes, 'href')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000777
Guido van Rossume284b211999-11-17 15:40:08 +0000778 # We must rescue the NAME
779 # attributes from the anchor, in order to
780 # cache the internal anchors which are made
781 # available in the page.
782 for name, value in attributes:
783 if name == "name":
784 if value in self.names:
785 self.checker.message("WARNING: duplicate name %s in %s",
786 value, self.url)
787 else: self.names.append(value)
788 break
789
Guido van Rossum6133ec61997-02-01 05:16:08 +0000790 def end_a(self): pass
791
Guido van Rossum2237b731997-10-06 18:54:01 +0000792 def do_area(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000793 self.link_attr(attributes, 'href')
Guido van Rossum2237b731997-10-06 18:54:01 +0000794
Guido van Rossum6133ec61997-02-01 05:16:08 +0000795 def do_img(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000796 self.link_attr(attributes, 'src', 'lowsrc')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000797
798 def do_frame(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000799 self.link_attr(attributes, 'src')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000800
801 def link_attr(self, attributes, *args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000802 for name, value in attributes:
803 if name in args:
804 if value: value = string.strip(value)
805 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000806
807 def do_base(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000808 for name, value in attributes:
809 if name == 'href':
810 if value: value = string.strip(value)
811 if value:
Guido van Rossum125700a1998-07-08 03:04:39 +0000812 if self.checker:
813 self.checker.note(1, " Base %s", value)
Guido van Rossum986abac1998-04-06 14:29:28 +0000814 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000815
816 def getlinks(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000817 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000818
819 def getbase(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000820 return self.base
Guido van Rossum272b37d1997-01-30 02:44:48 +0000821
822
Guido van Rossum272b37d1997-01-30 02:44:48 +0000823if __name__ == '__main__':
824 main()