blob: 470b15a101a4cb964a2018eedb7a283253a66218 [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
Guido van Rossume284b211999-11-17 15:40:08 +00003# Original code by Guido van Rossum; extensive changes by Sam Bayer,
4# including code to check URL fragments.
5
Guido van Rossum272b37d1997-01-30 02:44:48 +00006"""Web tree checker.
7
8This utility is handy to check a subweb of the world-wide web for
9errors. A subweb is specified by giving one or more ``root URLs''; a
10page belongs to the subweb if one of the root URLs is an initial
11prefix of it.
12
13File URL extension:
14
15In order to easy the checking of subwebs via the local file system,
16the interpretation of ``file:'' URLs is extended to mimic the behavior
17of your average HTTP daemon: if a directory pathname is given, the
18file index.html in that directory is returned if it exists, otherwise
19a directory listing is returned. Now, you can point webchecker to the
20document tree in the local file system of your HTTP daemon, and have
21most of it checked. In fact the default works this way if your local
22web tree is located at /usr/local/etc/httpd/htdpcs (the default for
23the NCSA HTTP daemon and probably others).
24
Guido van Rossumaf310c11997-02-02 23:30:32 +000025Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
Guido van Rossumaf310c11997-02-02 23:30:32 +000027When done, it reports pages with bad links within the subweb. When
28interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000029
30In verbose mode, additional messages are printed during the
31information gathering phase. By default, it prints a summary of its
32work status every 50 URLs (adjustable with the -r option), and it
33reports errors as they are encountered. Use the -q option to disable
34this output.
35
36Checkpoint feature:
37
38Whether interrupted or not, it dumps its state (a Python pickle) to a
39checkpoint file and the -R option allows it to restart from the
40checkpoint (assuming that the pages on the subweb that were already
41processed haven't changed). Even when it has run till completion, -R
42can still be useful -- it will print the reports again, and -Rq prints
43the errors only. In this case, the checkpoint file is not written
44again. The checkpoint file can be set with the -d option.
45
46The checkpoint file is written as a Python pickle. Remember that
47Python's pickle module is currently quite slow. Give it the time it
48needs to load and save the checkpoint file. When interrupted while
49writing the checkpoint file, the old checkpoint file is not
50overwritten, but all work done in the current run is lost.
51
52Miscellaneous:
53
Guido van Rossumaf310c11997-02-02 23:30:32 +000054- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
55
Guido van Rossum3edbb351997-01-30 03:19:41 +000056- Webchecker honors the "robots.txt" convention. Thanks to Skip
57Montanaro for his robotparser.py module (included in this directory)!
58The agent name is hardwired to "webchecker". URLs that are disallowed
59by the robots.txt file are reported as external URLs.
60
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000062skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000063
Guido van Rossumaf310c11997-02-02 23:30:32 +000064- When the server or protocol does not tell us a file's type, we guess
65it based on the URL's suffix. The mimetypes.py module (also in this
66directory) has a built-in table mapping most currently known suffixes,
67and in addition attempts to read the mime.types configuration files in
68the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossume284b211999-11-17 15:40:08 +000070- We follow links indicated by <A>, <FRAME> and <IMG> tags. We also
Guido van Rossumaf310c11997-02-02 23:30:32 +000071honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000072
Guido van Rossume284b211999-11-17 15:40:08 +000073- We now check internal NAME anchor links, as well as toplevel links.
74
Guido van Rossumaf310c11997-02-02 23:30:32 +000075- Checking external links is now done by default; use -x to *disable*
76this feature. External links are now checked during normal
77processing. (XXX The status of a checked link could be categorized
78better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000079
Guido van Rossume284b211999-11-17 15:40:08 +000080- If external links are not checked, you can use the -t flag to
81provide specific overrides to -x.
Guido van Rossum272b37d1997-01-30 02:44:48 +000082
83Usage: webchecker.py [option] ... [rooturl] ...
84
85Options:
86
87-R -- restart from checkpoint file
88-d file -- checkpoint filename (default %(DUMPFILE)s)
89-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000090-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000091-q -- quiet operation (also suppresses external links report)
92-r number -- number of links processed per round (default %(ROUNDSIZE)d)
Guido van Rossume284b211999-11-17 15:40:08 +000093-t root -- specify root dir which should be treated as internal (can repeat)
Guido van Rossum272b37d1997-01-30 02:44:48 +000094-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000095-x -- don't check external links (these are often slow to check)
Guido van Rossume284b211999-11-17 15:40:08 +000096-a -- don't check name anchors
Guido van Rossum272b37d1997-01-30 02:44:48 +000097
98Arguments:
99
100rooturl -- URL to start checking
101 (default %(DEFROOT)s)
102
103"""
104
Guido van Rossume5605ba1997-01-31 14:43:15 +0000105
Guido van Rossum00756bd1998-02-21 20:02:09 +0000106__version__ = "$Revision$"
Guido van Rossum325a64f1997-01-30 03:30:20 +0000107
Guido van Rossum272b37d1997-01-30 02:44:48 +0000108
109import sys
110import os
111from types import *
112import string
113import StringIO
114import getopt
115import pickle
116
117import urllib
118import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000119import sgmllib
Guido van Rossum272b37d1997-01-30 02:44:48 +0000120
121import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000122import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000123
Guido van Rossum00756bd1998-02-21 20:02:09 +0000124# Extract real version number if necessary
125if __version__[0] == '$':
126 _v = string.split(__version__)
127 if len(_v) == 3:
Guido van Rossum986abac1998-04-06 14:29:28 +0000128 __version__ = _v[1]
Guido van Rossum00756bd1998-02-21 20:02:09 +0000129
Guido van Rossum272b37d1997-01-30 02:44:48 +0000130
131# Tunable parameters
Guido van Rossum986abac1998-04-06 14:29:28 +0000132DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
133CHECKEXT = 1 # Check external references (1 deep)
134VERBOSE = 1 # Verbosity level (0-3)
135MAXPAGE = 150000 # Ignore files bigger than this
136ROUNDSIZE = 50 # Number of links processed per round
137DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
138AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossume284b211999-11-17 15:40:08 +0000139NONAMES = 0 # Force name anchor checking
Guido van Rossum272b37d1997-01-30 02:44:48 +0000140
141
142# Global variables
Guido van Rossum272b37d1997-01-30 02:44:48 +0000143
144
145def main():
Guido van Rossum00756bd1998-02-21 20:02:09 +0000146 checkext = CHECKEXT
147 verbose = VERBOSE
148 maxpage = MAXPAGE
149 roundsize = ROUNDSIZE
Guido van Rossum272b37d1997-01-30 02:44:48 +0000150 dumpfile = DUMPFILE
151 restart = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000152 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000153
154 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000155 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:t:vxa')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000156 except getopt.error, msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000157 sys.stdout = sys.stderr
158 print msg
159 print __doc__%globals()
160 sys.exit(2)
Guido van Rossume284b211999-11-17 15:40:08 +0000161
162 # The extra_roots variable collects extra roots.
163 extra_roots = []
164 nonames = NONAMES
165
Guido van Rossum272b37d1997-01-30 02:44:48 +0000166 for o, a in opts:
Guido van Rossum986abac1998-04-06 14:29:28 +0000167 if o == '-R':
168 restart = 1
169 if o == '-d':
170 dumpfile = a
171 if o == '-m':
172 maxpage = string.atoi(a)
173 if o == '-n':
174 norun = 1
175 if o == '-q':
176 verbose = 0
177 if o == '-r':
178 roundsize = string.atoi(a)
Guido van Rossume284b211999-11-17 15:40:08 +0000179 if o == '-t':
180 extra_roots.append(a)
181 if o == '-a':
182 nonames = not nonames
Guido van Rossum986abac1998-04-06 14:29:28 +0000183 if o == '-v':
184 verbose = verbose + 1
185 if o == '-x':
186 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000187
Guido van Rossume5605ba1997-01-31 14:43:15 +0000188 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000189 print AGENTNAME, "version", __version__
Guido van Rossum325a64f1997-01-30 03:30:20 +0000190
Guido van Rossum272b37d1997-01-30 02:44:48 +0000191 if restart:
Guido van Rossum986abac1998-04-06 14:29:28 +0000192 c = load_pickle(dumpfile=dumpfile, verbose=verbose)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000193 else:
Guido van Rossum986abac1998-04-06 14:29:28 +0000194 c = Checker()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000195
196 c.setflags(checkext=checkext, verbose=verbose,
Guido van Rossume284b211999-11-17 15:40:08 +0000197 maxpage=maxpage, roundsize=roundsize,
198 nonames=nonames
199 )
Guido van Rossum00756bd1998-02-21 20:02:09 +0000200
201 if not restart and not args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000202 args.append(DEFROOT)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000203
204 for arg in args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000205 c.addroot(arg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000206
Guido van Rossume284b211999-11-17 15:40:08 +0000207 # The -t flag is only needed if external links are not to be
208 # checked. So -t values are ignored unless -x was specified.
209 if not checkext:
210 for root in extra_roots:
211 # Make sure it's terminated by a slash,
212 # so that addroot doesn't discard the last
213 # directory component.
214 if root[-1] != "/":
215 root = root + "/"
216 c.addroot(root, add_to_do = 0)
217
Guido van Rossumbee64531998-04-27 19:35:15 +0000218 try:
219
220 if not norun:
221 try:
222 c.run()
223 except KeyboardInterrupt:
224 if verbose > 0:
225 print "[run interrupted]"
226
Guido van Rossum986abac1998-04-06 14:29:28 +0000227 try:
Guido van Rossumbee64531998-04-27 19:35:15 +0000228 c.report()
Guido van Rossum986abac1998-04-06 14:29:28 +0000229 except KeyboardInterrupt:
230 if verbose > 0:
Guido van Rossumbee64531998-04-27 19:35:15 +0000231 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000232
Guido van Rossumbee64531998-04-27 19:35:15 +0000233 finally:
234 if c.save_pickle(dumpfile):
235 if dumpfile == DUMPFILE:
236 print "Use ``%s -R'' to restart." % sys.argv[0]
237 else:
238 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
239 dumpfile)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000240
241
242def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
243 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000244 print "Loading checkpoint from %s ..." % dumpfile
Guido van Rossum00756bd1998-02-21 20:02:09 +0000245 f = open(dumpfile, "rb")
246 c = pickle.load(f)
247 f.close()
248 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000249 print "Done."
250 print "Root:", string.join(c.roots, "\n ")
Guido van Rossum00756bd1998-02-21 20:02:09 +0000251 return c
Guido van Rossum272b37d1997-01-30 02:44:48 +0000252
253
254class Checker:
255
Guido van Rossum00756bd1998-02-21 20:02:09 +0000256 checkext = CHECKEXT
257 verbose = VERBOSE
258 maxpage = MAXPAGE
259 roundsize = ROUNDSIZE
Guido van Rossume284b211999-11-17 15:40:08 +0000260 nonames = NONAMES
Guido van Rossum00756bd1998-02-21 20:02:09 +0000261
262 validflags = tuple(dir())
263
264 def __init__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000265 self.reset()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000266
267 def setflags(self, **kw):
Guido van Rossum986abac1998-04-06 14:29:28 +0000268 for key in kw.keys():
269 if key not in self.validflags:
270 raise NameError, "invalid keyword argument: %s" % str(key)
271 for key, value in kw.items():
272 setattr(self, key, value)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000273
274 def reset(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000275 self.roots = []
276 self.todo = {}
277 self.done = {}
278 self.bad = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000279
280 # Add a name table, so that the name URLs can be checked. Also
281 # serves as an implicit cache for which URLs are done.
282 self.name_table = {}
283
Guido van Rossum986abac1998-04-06 14:29:28 +0000284 self.round = 0
285 # The following are not pickled:
286 self.robots = {}
287 self.errors = {}
288 self.urlopener = MyURLopener()
289 self.changed = 0
Guido van Rossume284b211999-11-17 15:40:08 +0000290
Guido van Rossum125700a1998-07-08 03:04:39 +0000291 def note(self, level, format, *args):
292 if self.verbose > level:
293 if args:
294 format = format%args
295 self.message(format)
Guido van Rossume284b211999-11-17 15:40:08 +0000296
Guido van Rossum125700a1998-07-08 03:04:39 +0000297 def message(self, format, *args):
298 if args:
299 format = format%args
300 print format
Guido van Rossum3edbb351997-01-30 03:19:41 +0000301
302 def __getstate__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000303 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000304
305 def __setstate__(self, state):
Guido van Rossum986abac1998-04-06 14:29:28 +0000306 self.reset()
307 (self.roots, self.todo, self.done, self.bad, self.round) = state
308 for root in self.roots:
309 self.addrobot(root)
310 for url in self.bad.keys():
311 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000312
Guido van Rossume284b211999-11-17 15:40:08 +0000313 def addroot(self, root, add_to_do = 1):
Guido van Rossum986abac1998-04-06 14:29:28 +0000314 if root not in self.roots:
315 troot = root
316 scheme, netloc, path, params, query, fragment = \
317 urlparse.urlparse(root)
318 i = string.rfind(path, "/") + 1
319 if 0 < i < len(path):
320 path = path[:i]
321 troot = urlparse.urlunparse((scheme, netloc, path,
322 params, query, fragment))
323 self.roots.append(troot)
324 self.addrobot(root)
Guido van Rossume284b211999-11-17 15:40:08 +0000325 if add_to_do:
326 self.newlink((root, ""), ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000327
328 def addrobot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000329 root = urlparse.urljoin(root, "/")
330 if self.robots.has_key(root): return
331 url = urlparse.urljoin(root, "/robots.txt")
332 self.robots[root] = rp = robotparser.RobotFileParser()
Guido van Rossum125700a1998-07-08 03:04:39 +0000333 self.note(2, "Parsing %s", url)
334 rp.debug = self.verbose > 3
Guido van Rossum986abac1998-04-06 14:29:28 +0000335 rp.set_url(url)
336 try:
337 rp.read()
Guido van Rossumf0953b92001-12-11 22:41:24 +0000338 except (OSError, IOError), msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000339 self.note(1, "I/O error parsing %s: %s", url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000340
341 def run(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000342 while self.todo:
343 self.round = self.round + 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000344 self.note(0, "\nRound %d (%s)\n", self.round, self.status())
Guido van Rossum6eb9d321998-06-15 12:33:02 +0000345 urls = self.todo.keys()
346 urls.sort()
347 del urls[self.roundsize:]
Guido van Rossum986abac1998-04-06 14:29:28 +0000348 for url in urls:
349 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000350
351 def status(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000352 return "%d total, %d to do, %d done, %d bad" % (
353 len(self.todo)+len(self.done),
354 len(self.todo), len(self.done),
355 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000356
Guido van Rossumaf310c11997-02-02 23:30:32 +0000357 def report(self):
Guido van Rossum125700a1998-07-08 03:04:39 +0000358 self.message("")
359 if not self.todo: s = "Final"
360 else: s = "Interim"
361 self.message("%s Report (%s)", s, self.status())
Guido van Rossum986abac1998-04-06 14:29:28 +0000362 self.report_errors()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000363
Guido van Rossum272b37d1997-01-30 02:44:48 +0000364 def report_errors(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000365 if not self.bad:
Guido van Rossum125700a1998-07-08 03:04:39 +0000366 self.message("\nNo errors")
Guido van Rossum986abac1998-04-06 14:29:28 +0000367 return
Guido van Rossum125700a1998-07-08 03:04:39 +0000368 self.message("\nError Report:")
Guido van Rossum986abac1998-04-06 14:29:28 +0000369 sources = self.errors.keys()
370 sources.sort()
371 for source in sources:
372 triples = self.errors[source]
Guido van Rossum125700a1998-07-08 03:04:39 +0000373 self.message("")
Guido van Rossum986abac1998-04-06 14:29:28 +0000374 if len(triples) > 1:
Guido van Rossum125700a1998-07-08 03:04:39 +0000375 self.message("%d Errors in %s", len(triples), source)
Guido van Rossum986abac1998-04-06 14:29:28 +0000376 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000377 self.message("Error in %s", source)
Guido van Rossume284b211999-11-17 15:40:08 +0000378 # Call self.format_url() instead of referring
379 # to the URL directly, since the URLs in these
380 # triples is now a (URL, fragment) pair. The value
381 # of the "source" variable comes from the list of
382 # origins, and is a URL, not a pair.
383 for url, rawlink, msg in triples:
384 if rawlink != self.format_url(url): s = " (%s)" % rawlink
Guido van Rossum125700a1998-07-08 03:04:39 +0000385 else: s = ""
Guido van Rossume284b211999-11-17 15:40:08 +0000386 self.message(" HREF %s%s\n msg %s",
387 self.format_url(url), s, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000388
Guido van Rossume284b211999-11-17 15:40:08 +0000389 def dopage(self, url_pair):
390
391 # All printing of URLs uses format_url(); argument changed to
392 # url_pair for clarity.
Guido van Rossum986abac1998-04-06 14:29:28 +0000393 if self.verbose > 1:
394 if self.verbose > 2:
Guido van Rossume284b211999-11-17 15:40:08 +0000395 self.show("Check ", self.format_url(url_pair),
396 " from", self.todo[url_pair])
Guido van Rossum986abac1998-04-06 14:29:28 +0000397 else:
Guido van Rossume284b211999-11-17 15:40:08 +0000398 self.message("Check %s", self.format_url(url_pair))
399 url, local_fragment = url_pair
400 if local_fragment and self.nonames:
401 self.markdone(url_pair)
402 return
403 page = self.getpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000404 if page:
Guido van Rossume284b211999-11-17 15:40:08 +0000405 # Store the page which corresponds to this URL.
406 self.name_table[url] = page
407 # If there is a fragment in this url_pair, and it's not
408 # in the list of names for the page, call setbad(), since
409 # it's a missing anchor.
410 if local_fragment and local_fragment not in page.getnames():
411 self.setbad(url_pair, ("Missing name anchor `%s'" % local_fragment))
Guido van Rossum986abac1998-04-06 14:29:28 +0000412 for info in page.getlinkinfos():
Guido van Rossume284b211999-11-17 15:40:08 +0000413 # getlinkinfos() now returns the fragment as well,
414 # and we store that fragment here in the "todo" dictionary.
415 link, rawlink, fragment = info
416 # However, we don't want the fragment as the origin, since
417 # the origin is logically a page.
Guido van Rossum986abac1998-04-06 14:29:28 +0000418 origin = url, rawlink
Guido van Rossume284b211999-11-17 15:40:08 +0000419 self.newlink((link, fragment), origin)
420 else:
421 # If no page has been created yet, we want to
422 # record that fact.
423 self.name_table[url_pair[0]] = None
424 self.markdone(url_pair)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000425
Guido van Rossumaf310c11997-02-02 23:30:32 +0000426 def newlink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000427 if self.done.has_key(url):
428 self.newdonelink(url, origin)
429 else:
430 self.newtodolink(url, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000431
432 def newdonelink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000433 if origin not in self.done[url]:
434 self.done[url].append(origin)
435
436 # Call self.format_url(), since the URL here
437 # is now a (URL, fragment) pair.
438 self.note(3, " Done link %s", self.format_url(url))
439
440 # Make sure that if it's bad, that the origin gets added.
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000441 if self.bad.has_key(url):
442 source, rawlink = origin
443 triple = url, rawlink, self.bad[url]
444 self.seterror(source, triple)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000445
446 def newtodolink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000447 # Call self.format_url(), since the URL here
448 # is now a (URL, fragment) pair.
Guido van Rossum986abac1998-04-06 14:29:28 +0000449 if self.todo.has_key(url):
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000450 if origin not in self.todo[url]:
451 self.todo[url].append(origin)
Guido van Rossume284b211999-11-17 15:40:08 +0000452 self.note(3, " Seen todo link %s", self.format_url(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000453 else:
454 self.todo[url] = [origin]
Guido van Rossume284b211999-11-17 15:40:08 +0000455 self.note(3, " New todo link %s", self.format_url(url))
456
457 def format_url(self, url):
458 link, fragment = url
459 if fragment: return link + "#" + fragment
460 else: return link
Guido van Rossume5605ba1997-01-31 14:43:15 +0000461
462 def markdone(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000463 self.done[url] = self.todo[url]
464 del self.todo[url]
465 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000466
467 def inroots(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000468 for root in self.roots:
469 if url[:len(root)] == root:
Guido van Rossum125700a1998-07-08 03:04:39 +0000470 return self.isallowed(root, url)
Guido van Rossum986abac1998-04-06 14:29:28 +0000471 return 0
Guido van Rossume284b211999-11-17 15:40:08 +0000472
Guido van Rossum125700a1998-07-08 03:04:39 +0000473 def isallowed(self, root, url):
474 root = urlparse.urljoin(root, "/")
475 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000476
Guido van Rossume284b211999-11-17 15:40:08 +0000477 def getpage(self, url_pair):
478 # Incoming argument name is a (URL, fragment) pair.
479 # The page may have been cached in the name_table variable.
480 url, fragment = url_pair
481 if self.name_table.has_key(url):
482 return self.name_table[url]
483
Andrew M. Kuchling566c0c72002-03-08 17:19:10 +0000484 scheme, path = urllib.splittype(url)
Fred Drakef3186e82001-04-04 17:47:25 +0000485 if scheme in ('mailto', 'news', 'javascript', 'telnet'):
486 self.note(1, " Not checking %s URL" % scheme)
Guido van Rossum986abac1998-04-06 14:29:28 +0000487 return None
488 isint = self.inroots(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000489
490 # Ensure that openpage gets the URL pair to
491 # print out its error message and record the error pair
492 # correctly.
Guido van Rossum986abac1998-04-06 14:29:28 +0000493 if not isint:
494 if not self.checkext:
Guido van Rossum125700a1998-07-08 03:04:39 +0000495 self.note(1, " Not checking ext link")
Guido van Rossum986abac1998-04-06 14:29:28 +0000496 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000497 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000498 if f:
499 self.safeclose(f)
500 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000501 text, nurl = self.readhtml(url_pair)
502
Guido van Rossum986abac1998-04-06 14:29:28 +0000503 if nurl != url:
Guido van Rossum125700a1998-07-08 03:04:39 +0000504 self.note(1, " Redirected to %s", nurl)
Guido van Rossum986abac1998-04-06 14:29:28 +0000505 url = nurl
506 if text:
Guido van Rossum125700a1998-07-08 03:04:39 +0000507 return Page(text, url, maxpage=self.maxpage, checker=self)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000508
Guido van Rossume284b211999-11-17 15:40:08 +0000509 # These next three functions take (URL, fragment) pairs as
510 # arguments, so that openpage() receives the appropriate tuple to
511 # record error messages.
512 def readhtml(self, url_pair):
513 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000514 text = None
Guido van Rossume284b211999-11-17 15:40:08 +0000515 f, url = self.openhtml(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000516 if f:
517 text = f.read()
518 f.close()
519 return text, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000520
Guido van Rossume284b211999-11-17 15:40:08 +0000521 def openhtml(self, url_pair):
522 url, fragment = url_pair
523 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000524 if f:
525 url = f.geturl()
526 info = f.info()
527 if not self.checkforhtml(info, url):
528 self.safeclose(f)
529 f = None
530 return f, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000531
Guido van Rossume284b211999-11-17 15:40:08 +0000532 def openpage(self, url_pair):
533 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000534 try:
535 return self.urlopener.open(url)
Guido van Rossumf0953b92001-12-11 22:41:24 +0000536 except (OSError, IOError), msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000537 msg = self.sanitize(msg)
Guido van Rossum125700a1998-07-08 03:04:39 +0000538 self.note(0, "Error %s", msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000539 if self.verbose > 0:
Guido van Rossume284b211999-11-17 15:40:08 +0000540 self.show(" HREF ", url, " from", self.todo[url_pair])
541 self.setbad(url_pair, msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000542 return None
Guido van Rossum00756bd1998-02-21 20:02:09 +0000543
544 def checkforhtml(self, info, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000545 if info.has_key('content-type'):
546 ctype = string.lower(info['content-type'])
547 else:
548 if url[-1:] == "/":
549 return 1
550 ctype, encoding = mimetypes.guess_type(url)
551 if ctype == 'text/html':
552 return 1
553 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000554 self.note(1, " Not HTML, mime type %s", ctype)
Guido van Rossum986abac1998-04-06 14:29:28 +0000555 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000556
Guido van Rossume5605ba1997-01-31 14:43:15 +0000557 def setgood(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000558 if self.bad.has_key(url):
559 del self.bad[url]
560 self.changed = 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000561 self.note(0, "(Clear previously seen error)")
Guido van Rossume5605ba1997-01-31 14:43:15 +0000562
563 def setbad(self, url, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000564 if self.bad.has_key(url) and self.bad[url] == msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000565 self.note(0, "(Seen this error before)")
Guido van Rossum986abac1998-04-06 14:29:28 +0000566 return
567 self.bad[url] = msg
568 self.changed = 1
569 self.markerror(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000570
Guido van Rossumaf310c11997-02-02 23:30:32 +0000571 def markerror(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000572 try:
573 origins = self.todo[url]
574 except KeyError:
575 origins = self.done[url]
576 for source, rawlink in origins:
577 triple = url, rawlink, self.bad[url]
578 self.seterror(source, triple)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000579
580 def seterror(self, url, triple):
Guido van Rossum986abac1998-04-06 14:29:28 +0000581 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000582 # Because of the way the URLs are now processed, I need to
583 # check to make sure the URL hasn't been entered in the
584 # error list. The first element of the triple here is a
585 # (URL, fragment) pair, but the URL key is not, since it's
586 # from the list of origins.
587 if triple not in self.errors[url]:
588 self.errors[url].append(triple)
Guido van Rossum986abac1998-04-06 14:29:28 +0000589 except KeyError:
590 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000591
Guido van Rossum00756bd1998-02-21 20:02:09 +0000592 # The following used to be toplevel functions; they have been
593 # changed into methods so they can be overridden in subclasses.
594
595 def show(self, p1, link, p2, origins):
Guido van Rossum125700a1998-07-08 03:04:39 +0000596 self.message("%s %s", p1, link)
Guido van Rossum986abac1998-04-06 14:29:28 +0000597 i = 0
598 for source, rawlink in origins:
599 i = i+1
600 if i == 2:
601 p2 = ' '*len(p2)
Guido van Rossum125700a1998-07-08 03:04:39 +0000602 if rawlink != link: s = " (%s)" % rawlink
603 else: s = ""
604 self.message("%s %s%s", p2, source, s)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000605
606 def sanitize(self, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000607 if isinstance(IOError, ClassType) and isinstance(msg, IOError):
608 # Do the other branch recursively
609 msg.args = self.sanitize(msg.args)
610 elif isinstance(msg, TupleType):
611 if len(msg) >= 4 and msg[0] == 'http error' and \
612 isinstance(msg[3], InstanceType):
613 # Remove the Message instance -- it may contain
614 # a file object which prevents pickling.
615 msg = msg[:3] + msg[4:]
616 return msg
Guido van Rossum00756bd1998-02-21 20:02:09 +0000617
618 def safeclose(self, f):
Guido van Rossum986abac1998-04-06 14:29:28 +0000619 try:
620 url = f.geturl()
621 except AttributeError:
622 pass
623 else:
624 if url[:4] == 'ftp:' or url[:7] == 'file://':
625 # Apparently ftp connections don't like to be closed
626 # prematurely...
627 text = f.read()
628 f.close()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000629
630 def save_pickle(self, dumpfile=DUMPFILE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000631 if not self.changed:
Guido van Rossum125700a1998-07-08 03:04:39 +0000632 self.note(0, "\nNo need to save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000633 elif not dumpfile:
Guido van Rossum125700a1998-07-08 03:04:39 +0000634 self.note(0, "No dumpfile, won't save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000635 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000636 self.note(0, "\nSaving checkpoint to %s ...", dumpfile)
Guido van Rossum986abac1998-04-06 14:29:28 +0000637 newfile = dumpfile + ".new"
638 f = open(newfile, "wb")
639 pickle.dump(self, f)
640 f.close()
641 try:
642 os.unlink(dumpfile)
643 except os.error:
644 pass
645 os.rename(newfile, dumpfile)
Guido van Rossum125700a1998-07-08 03:04:39 +0000646 self.note(0, "Done.")
Guido van Rossum986abac1998-04-06 14:29:28 +0000647 return 1
Guido van Rossum00756bd1998-02-21 20:02:09 +0000648
Guido van Rossum272b37d1997-01-30 02:44:48 +0000649
650class Page:
651
Guido van Rossum125700a1998-07-08 03:04:39 +0000652 def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE, checker=None):
Guido van Rossum986abac1998-04-06 14:29:28 +0000653 self.text = text
654 self.url = url
655 self.verbose = verbose
656 self.maxpage = maxpage
Guido van Rossum125700a1998-07-08 03:04:39 +0000657 self.checker = checker
Guido van Rossum272b37d1997-01-30 02:44:48 +0000658
Guido van Rossume284b211999-11-17 15:40:08 +0000659 # The parsing of the page is done in the __init__() routine in
660 # order to initialize the list of names the file
661 # contains. Stored the parser in an instance variable. Passed
662 # the URL to MyHTMLParser().
663 size = len(self.text)
664 if size > self.maxpage:
665 self.note(0, "Skip huge file %s (%.0f Kbytes)", self.url, (size*0.001))
666 self.parser = None
667 return
668 self.checker.note(2, " Parsing %s (%d bytes)", self.url, size)
669 self.parser = MyHTMLParser(url, verbose=self.verbose,
670 checker=self.checker)
671 self.parser.feed(self.text)
672 self.parser.close()
673
Guido van Rossuma42c1ee1998-08-06 21:31:13 +0000674 def note(self, level, msg, *args):
675 if self.checker:
676 apply(self.checker.note, (level, msg) + args)
677 else:
678 if self.verbose >= level:
679 if args:
680 msg = msg%args
681 print msg
682
Guido van Rossume284b211999-11-17 15:40:08 +0000683 # Method to retrieve names.
684 def getnames(self):
Guido van Rossum84306242000-03-28 20:10:39 +0000685 if self.parser:
686 return self.parser.names
687 else:
688 return []
Guido van Rossume284b211999-11-17 15:40:08 +0000689
Guido van Rossum272b37d1997-01-30 02:44:48 +0000690 def getlinkinfos(self):
Guido van Rossume284b211999-11-17 15:40:08 +0000691 # File reading is done in __init__() routine. Store parser in
692 # local variable to indicate success of parsing.
693
694 # If no parser was stored, fail.
695 if not self.parser: return []
696
697 rawlinks = self.parser.getlinks()
698 base = urlparse.urljoin(self.url, self.parser.getbase() or "")
Guido van Rossum986abac1998-04-06 14:29:28 +0000699 infos = []
700 for rawlink in rawlinks:
701 t = urlparse.urlparse(rawlink)
Guido van Rossume284b211999-11-17 15:40:08 +0000702 # DON'T DISCARD THE FRAGMENT! Instead, include
703 # it in the tuples which are returned. See Checker.dopage().
704 fragment = t[-1]
Guido van Rossum986abac1998-04-06 14:29:28 +0000705 t = t[:-1] + ('',)
706 rawlink = urlparse.urlunparse(t)
707 link = urlparse.urljoin(base, rawlink)
Guido van Rossume284b211999-11-17 15:40:08 +0000708 infos.append((link, rawlink, fragment))
709
Guido van Rossum986abac1998-04-06 14:29:28 +0000710 return infos
Guido van Rossum272b37d1997-01-30 02:44:48 +0000711
712
713class MyStringIO(StringIO.StringIO):
714
715 def __init__(self, url, info):
Guido van Rossum986abac1998-04-06 14:29:28 +0000716 self.__url = url
717 self.__info = info
718 StringIO.StringIO.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000719
720 def info(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000721 return self.__info
Guido van Rossum272b37d1997-01-30 02:44:48 +0000722
723 def geturl(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000724 return self.__url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000725
726
727class MyURLopener(urllib.FancyURLopener):
728
729 http_error_default = urllib.URLopener.http_error_default
730
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000731 def __init__(*args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000732 self = args[0]
733 apply(urllib.FancyURLopener.__init__, args)
734 self.addheaders = [
735 ('User-agent', 'Python-webchecker/%s' % __version__),
736 ]
Guido van Rossum89efda31997-05-07 15:00:56 +0000737
738 def http_error_401(self, url, fp, errcode, errmsg, headers):
739 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000740
Guido van Rossum272b37d1997-01-30 02:44:48 +0000741 def open_file(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000742 path = urllib.url2pathname(urllib.unquote(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000743 if os.path.isdir(path):
Guido van Rossum0ec14931999-04-26 23:11:46 +0000744 if path[-1] != os.sep:
745 url = url + '/'
Guido van Rossum986abac1998-04-06 14:29:28 +0000746 indexpath = os.path.join(path, "index.html")
747 if os.path.exists(indexpath):
748 return self.open_file(url + "index.html")
749 try:
750 names = os.listdir(path)
751 except os.error, msg:
752 raise IOError, msg, sys.exc_traceback
753 names.sort()
754 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
755 s.write('<BASE HREF="file:%s">\n' %
756 urllib.quote(os.path.join(path, "")))
757 for name in names:
758 q = urllib.quote(name)
759 s.write('<A HREF="%s">%s</A>\n' % (q, q))
760 s.seek(0)
761 return s
Guido van Rossum0ec14931999-04-26 23:11:46 +0000762 return urllib.FancyURLopener.open_file(self, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000763
764
Guido van Rossume5605ba1997-01-31 14:43:15 +0000765class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000766
Guido van Rossume284b211999-11-17 15:40:08 +0000767 def __init__(self, url, verbose=VERBOSE, checker=None):
Guido van Rossum125700a1998-07-08 03:04:39 +0000768 self.myverbose = verbose # now unused
769 self.checker = checker
Guido van Rossum986abac1998-04-06 14:29:28 +0000770 self.base = None
771 self.links = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000772 self.names = []
773 self.url = url
Guido van Rossum986abac1998-04-06 14:29:28 +0000774 sgmllib.SGMLParser.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000775
776 def start_a(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000777 self.link_attr(attributes, 'href')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000778
Guido van Rossume284b211999-11-17 15:40:08 +0000779 # We must rescue the NAME
780 # attributes from the anchor, in order to
781 # cache the internal anchors which are made
782 # available in the page.
783 for name, value in attributes:
784 if name == "name":
785 if value in self.names:
786 self.checker.message("WARNING: duplicate name %s in %s",
787 value, self.url)
788 else: self.names.append(value)
789 break
790
Guido van Rossum6133ec61997-02-01 05:16:08 +0000791 def end_a(self): pass
792
Guido van Rossum2237b731997-10-06 18:54:01 +0000793 def do_area(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000794 self.link_attr(attributes, 'href')
Guido van Rossum2237b731997-10-06 18:54:01 +0000795
Fred Drakef3186e82001-04-04 17:47:25 +0000796 def do_body(self, attributes):
Fred Draked34a9c92001-04-05 18:14:50 +0000797 self.link_attr(attributes, 'background', 'bgsound')
Fred Drakef3186e82001-04-04 17:47:25 +0000798
Guido van Rossum6133ec61997-02-01 05:16:08 +0000799 def do_img(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000800 self.link_attr(attributes, 'src', 'lowsrc')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000801
802 def do_frame(self, attributes):
Fred Drakef3186e82001-04-04 17:47:25 +0000803 self.link_attr(attributes, 'src', 'longdesc')
804
805 def do_iframe(self, attributes):
806 self.link_attr(attributes, 'src', 'longdesc')
807
808 def do_link(self, attributes):
809 for name, value in attributes:
810 if name == "rel":
811 parts = string.split(string.lower(value))
812 if ( parts == ["stylesheet"]
813 or parts == ["alternate", "stylesheet"]):
814 self.link_attr(attributes, "href")
815 break
816
817 def do_object(self, attributes):
818 self.link_attr(attributes, 'data', 'usemap')
819
820 def do_script(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000821 self.link_attr(attributes, 'src')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000822
Fred Draked34a9c92001-04-05 18:14:50 +0000823 def do_table(self, attributes):
824 self.link_attr(attributes, 'background')
825
826 def do_td(self, attributes):
827 self.link_attr(attributes, 'background')
828
829 def do_th(self, attributes):
830 self.link_attr(attributes, 'background')
831
832 def do_tr(self, attributes):
833 self.link_attr(attributes, 'background')
834
Guido van Rossum6133ec61997-02-01 05:16:08 +0000835 def link_attr(self, attributes, *args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000836 for name, value in attributes:
837 if name in args:
838 if value: value = string.strip(value)
839 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000840
841 def do_base(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000842 for name, value in attributes:
843 if name == 'href':
844 if value: value = string.strip(value)
845 if value:
Guido van Rossum125700a1998-07-08 03:04:39 +0000846 if self.checker:
847 self.checker.note(1, " Base %s", value)
Guido van Rossum986abac1998-04-06 14:29:28 +0000848 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000849
850 def getlinks(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000851 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000852
853 def getbase(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000854 return self.base
Guido van Rossum272b37d1997-01-30 02:44:48 +0000855
856
Guido van Rossum272b37d1997-01-30 02:44:48 +0000857if __name__ == '__main__':
858 main()