blob: 352933854dbfa27894c5eb88d96d5a3593fd8953 [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
3"""Web tree checker.
4
5This utility is handy to check a subweb of the world-wide web for
6errors. A subweb is specified by giving one or more ``root URLs''; a
7page belongs to the subweb if one of the root URLs is an initial
8prefix of it.
9
10File URL extension:
11
12In order to easy the checking of subwebs via the local file system,
13the interpretation of ``file:'' URLs is extended to mimic the behavior
14of your average HTTP daemon: if a directory pathname is given, the
15file index.html in that directory is returned if it exists, otherwise
16a directory listing is returned. Now, you can point webchecker to the
17document tree in the local file system of your HTTP daemon, and have
18most of it checked. In fact the default works this way if your local
19web tree is located at /usr/local/etc/httpd/htdpcs (the default for
20the NCSA HTTP daemon and probably others).
21
Guido van Rossumaf310c11997-02-02 23:30:32 +000022Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000023
Guido van Rossumaf310c11997-02-02 23:30:32 +000024When done, it reports pages with bad links within the subweb. When
25interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
27In verbose mode, additional messages are printed during the
28information gathering phase. By default, it prints a summary of its
29work status every 50 URLs (adjustable with the -r option), and it
30reports errors as they are encountered. Use the -q option to disable
31this output.
32
33Checkpoint feature:
34
35Whether interrupted or not, it dumps its state (a Python pickle) to a
36checkpoint file and the -R option allows it to restart from the
37checkpoint (assuming that the pages on the subweb that were already
38processed haven't changed). Even when it has run till completion, -R
39can still be useful -- it will print the reports again, and -Rq prints
40the errors only. In this case, the checkpoint file is not written
41again. The checkpoint file can be set with the -d option.
42
43The checkpoint file is written as a Python pickle. Remember that
44Python's pickle module is currently quite slow. Give it the time it
45needs to load and save the checkpoint file. When interrupted while
46writing the checkpoint file, the old checkpoint file is not
47overwritten, but all work done in the current run is lost.
48
49Miscellaneous:
50
Guido van Rossumaf310c11997-02-02 23:30:32 +000051- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
52
Guido van Rossum3edbb351997-01-30 03:19:41 +000053- Webchecker honors the "robots.txt" convention. Thanks to Skip
54Montanaro for his robotparser.py module (included in this directory)!
55The agent name is hardwired to "webchecker". URLs that are disallowed
56by the robots.txt file are reported as external URLs.
57
Guido van Rossumaf310c11997-02-02 23:30:32 +000058- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000059skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000060
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- When the server or protocol does not tell us a file's type, we guess
62it based on the URL's suffix. The mimetypes.py module (also in this
63directory) has a built-in table mapping most currently known suffixes,
64and in addition attempts to read the mime.types configuration files in
65the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000066
Guido van Rossumaf310c11997-02-02 23:30:32 +000067- We follows links indicated by <A>, <FRAME> and <IMG> tags. We also
68honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossumaf310c11997-02-02 23:30:32 +000070- Checking external links is now done by default; use -x to *disable*
71this feature. External links are now checked during normal
72processing. (XXX The status of a checked link could be categorized
73better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000074
75
76Usage: webchecker.py [option] ... [rooturl] ...
77
78Options:
79
80-R -- restart from checkpoint file
81-d file -- checkpoint filename (default %(DUMPFILE)s)
82-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000083-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000084-q -- quiet operation (also suppresses external links report)
85-r number -- number of links processed per round (default %(ROUNDSIZE)d)
86-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000087-x -- don't check external links (these are often slow to check)
Guido van Rossum272b37d1997-01-30 02:44:48 +000088
89Arguments:
90
91rooturl -- URL to start checking
92 (default %(DEFROOT)s)
93
94"""
95
Guido van Rossume5605ba1997-01-31 14:43:15 +000096
Guido van Rossum00756bd1998-02-21 20:02:09 +000097__version__ = "$Revision$"
Guido van Rossum325a64f1997-01-30 03:30:20 +000098
Guido van Rossum272b37d1997-01-30 02:44:48 +000099
100import sys
101import os
102from types import *
103import string
104import StringIO
105import getopt
106import pickle
107
108import urllib
109import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000110import sgmllib
Guido van Rossum272b37d1997-01-30 02:44:48 +0000111
112import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000113import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000114
Guido van Rossum00756bd1998-02-21 20:02:09 +0000115# Extract real version number if necessary
116if __version__[0] == '$':
117 _v = string.split(__version__)
118 if len(_v) == 3:
Guido van Rossum986abac1998-04-06 14:29:28 +0000119 __version__ = _v[1]
Guido van Rossum00756bd1998-02-21 20:02:09 +0000120
Guido van Rossum272b37d1997-01-30 02:44:48 +0000121
122# Tunable parameters
Guido van Rossum986abac1998-04-06 14:29:28 +0000123DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
124CHECKEXT = 1 # Check external references (1 deep)
125VERBOSE = 1 # Verbosity level (0-3)
126MAXPAGE = 150000 # Ignore files bigger than this
127ROUNDSIZE = 50 # Number of links processed per round
128DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
129AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000130
131
132# Global variables
Guido van Rossum272b37d1997-01-30 02:44:48 +0000133
134
135def main():
Guido van Rossum00756bd1998-02-21 20:02:09 +0000136 checkext = CHECKEXT
137 verbose = VERBOSE
138 maxpage = MAXPAGE
139 roundsize = ROUNDSIZE
Guido van Rossum272b37d1997-01-30 02:44:48 +0000140 dumpfile = DUMPFILE
141 restart = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000142 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000143
144 try:
Guido van Rossum986abac1998-04-06 14:29:28 +0000145 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:vx')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000146 except getopt.error, msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000147 sys.stdout = sys.stderr
148 print msg
149 print __doc__%globals()
150 sys.exit(2)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000151 for o, a in opts:
Guido van Rossum986abac1998-04-06 14:29:28 +0000152 if o == '-R':
153 restart = 1
154 if o == '-d':
155 dumpfile = a
156 if o == '-m':
157 maxpage = string.atoi(a)
158 if o == '-n':
159 norun = 1
160 if o == '-q':
161 verbose = 0
162 if o == '-r':
163 roundsize = string.atoi(a)
164 if o == '-v':
165 verbose = verbose + 1
166 if o == '-x':
167 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000168
Guido van Rossume5605ba1997-01-31 14:43:15 +0000169 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000170 print AGENTNAME, "version", __version__
Guido van Rossum325a64f1997-01-30 03:30:20 +0000171
Guido van Rossum272b37d1997-01-30 02:44:48 +0000172 if restart:
Guido van Rossum986abac1998-04-06 14:29:28 +0000173 c = load_pickle(dumpfile=dumpfile, verbose=verbose)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000174 else:
Guido van Rossum986abac1998-04-06 14:29:28 +0000175 c = Checker()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000176
177 c.setflags(checkext=checkext, verbose=verbose,
Guido van Rossum986abac1998-04-06 14:29:28 +0000178 maxpage=maxpage, roundsize=roundsize)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000179
180 if not restart and not args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000181 args.append(DEFROOT)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000182
183 for arg in args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000184 c.addroot(arg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000185
Guido van Rossumbee64531998-04-27 19:35:15 +0000186 try:
187
188 if not norun:
189 try:
190 c.run()
191 except KeyboardInterrupt:
192 if verbose > 0:
193 print "[run interrupted]"
194
Guido van Rossum986abac1998-04-06 14:29:28 +0000195 try:
Guido van Rossumbee64531998-04-27 19:35:15 +0000196 c.report()
Guido van Rossum986abac1998-04-06 14:29:28 +0000197 except KeyboardInterrupt:
198 if verbose > 0:
Guido van Rossumbee64531998-04-27 19:35:15 +0000199 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000200
Guido van Rossumbee64531998-04-27 19:35:15 +0000201 finally:
202 if c.save_pickle(dumpfile):
203 if dumpfile == DUMPFILE:
204 print "Use ``%s -R'' to restart." % sys.argv[0]
205 else:
206 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
207 dumpfile)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000208
209
210def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
211 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000212 print "Loading checkpoint from %s ..." % dumpfile
Guido van Rossum00756bd1998-02-21 20:02:09 +0000213 f = open(dumpfile, "rb")
214 c = pickle.load(f)
215 f.close()
216 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000217 print "Done."
218 print "Root:", string.join(c.roots, "\n ")
Guido van Rossum00756bd1998-02-21 20:02:09 +0000219 return c
Guido van Rossum272b37d1997-01-30 02:44:48 +0000220
221
222class Checker:
223
Guido van Rossum00756bd1998-02-21 20:02:09 +0000224 checkext = CHECKEXT
225 verbose = VERBOSE
226 maxpage = MAXPAGE
227 roundsize = ROUNDSIZE
228
229 validflags = tuple(dir())
230
231 def __init__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000232 self.reset()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000233
234 def setflags(self, **kw):
Guido van Rossum986abac1998-04-06 14:29:28 +0000235 for key in kw.keys():
236 if key not in self.validflags:
237 raise NameError, "invalid keyword argument: %s" % str(key)
238 for key, value in kw.items():
239 setattr(self, key, value)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000240
241 def reset(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000242 self.roots = []
243 self.todo = {}
244 self.done = {}
245 self.bad = {}
246 self.round = 0
247 # The following are not pickled:
248 self.robots = {}
249 self.errors = {}
250 self.urlopener = MyURLopener()
251 self.changed = 0
Guido van Rossum125700a1998-07-08 03:04:39 +0000252
253 def note(self, level, format, *args):
254 if self.verbose > level:
255 if args:
256 format = format%args
257 self.message(format)
258
259 def message(self, format, *args):
260 if args:
261 format = format%args
262 print format
Guido van Rossum3edbb351997-01-30 03:19:41 +0000263
264 def __getstate__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000265 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000266
267 def __setstate__(self, state):
Guido van Rossum986abac1998-04-06 14:29:28 +0000268 self.reset()
269 (self.roots, self.todo, self.done, self.bad, self.round) = state
270 for root in self.roots:
271 self.addrobot(root)
272 for url in self.bad.keys():
273 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000274
275 def addroot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000276 if root not in self.roots:
277 troot = root
278 scheme, netloc, path, params, query, fragment = \
279 urlparse.urlparse(root)
280 i = string.rfind(path, "/") + 1
281 if 0 < i < len(path):
282 path = path[:i]
283 troot = urlparse.urlunparse((scheme, netloc, path,
284 params, query, fragment))
285 self.roots.append(troot)
286 self.addrobot(root)
287 self.newlink(root, ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000288
289 def addrobot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000290 root = urlparse.urljoin(root, "/")
291 if self.robots.has_key(root): return
292 url = urlparse.urljoin(root, "/robots.txt")
293 self.robots[root] = rp = robotparser.RobotFileParser()
Guido van Rossum125700a1998-07-08 03:04:39 +0000294 self.note(2, "Parsing %s", url)
295 rp.debug = self.verbose > 3
Guido van Rossum986abac1998-04-06 14:29:28 +0000296 rp.set_url(url)
297 try:
298 rp.read()
299 except IOError, msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000300 self.note(1, "I/O error parsing %s: %s", url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000301
302 def run(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000303 while self.todo:
304 self.round = self.round + 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000305 self.note(0, "\nRound %d (%s)\n", self.round, self.status())
Guido van Rossum6eb9d321998-06-15 12:33:02 +0000306 urls = self.todo.keys()
307 urls.sort()
308 del urls[self.roundsize:]
Guido van Rossum986abac1998-04-06 14:29:28 +0000309 for url in urls:
310 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000311
312 def status(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000313 return "%d total, %d to do, %d done, %d bad" % (
314 len(self.todo)+len(self.done),
315 len(self.todo), len(self.done),
316 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000317
Guido van Rossumaf310c11997-02-02 23:30:32 +0000318 def report(self):
Guido van Rossum125700a1998-07-08 03:04:39 +0000319 self.message("")
320 if not self.todo: s = "Final"
321 else: s = "Interim"
322 self.message("%s Report (%s)", s, self.status())
Guido van Rossum986abac1998-04-06 14:29:28 +0000323 self.report_errors()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000324
Guido van Rossum272b37d1997-01-30 02:44:48 +0000325 def report_errors(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000326 if not self.bad:
Guido van Rossum125700a1998-07-08 03:04:39 +0000327 self.message("\nNo errors")
Guido van Rossum986abac1998-04-06 14:29:28 +0000328 return
Guido van Rossum125700a1998-07-08 03:04:39 +0000329 self.message("\nError Report:")
Guido van Rossum986abac1998-04-06 14:29:28 +0000330 sources = self.errors.keys()
331 sources.sort()
332 for source in sources:
333 triples = self.errors[source]
Guido van Rossum125700a1998-07-08 03:04:39 +0000334 self.message("")
Guido van Rossum986abac1998-04-06 14:29:28 +0000335 if len(triples) > 1:
Guido van Rossum125700a1998-07-08 03:04:39 +0000336 self.message("%d Errors in %s", len(triples), source)
Guido van Rossum986abac1998-04-06 14:29:28 +0000337 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000338 self.message("Error in %s", source)
Guido van Rossum986abac1998-04-06 14:29:28 +0000339 for url, rawlink, msg in triples:
Guido van Rossum125700a1998-07-08 03:04:39 +0000340 if rawlink != url: s = " (%s)" % rawlink
341 else: s = ""
342 self.message(" HREF %s%s\n msg %s", url, s, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000343
344 def dopage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000345 if self.verbose > 1:
346 if self.verbose > 2:
347 self.show("Check ", url, " from", self.todo[url])
348 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000349 self.message("Check %s", url)
Guido van Rossum986abac1998-04-06 14:29:28 +0000350 page = self.getpage(url)
351 if page:
352 for info in page.getlinkinfos():
353 link, rawlink = info
354 origin = url, rawlink
355 self.newlink(link, origin)
356 self.markdone(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000357
Guido van Rossumaf310c11997-02-02 23:30:32 +0000358 def newlink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000359 if self.done.has_key(url):
360 self.newdonelink(url, origin)
361 else:
362 self.newtodolink(url, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000363
364 def newdonelink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000365 self.done[url].append(origin)
Guido van Rossum125700a1998-07-08 03:04:39 +0000366 self.note(3, " Done link %s", url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000367
368 def newtodolink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000369 if self.todo.has_key(url):
370 self.todo[url].append(origin)
Guido van Rossum125700a1998-07-08 03:04:39 +0000371 self.note(3, " Seen todo link %s", url)
Guido van Rossum986abac1998-04-06 14:29:28 +0000372 else:
373 self.todo[url] = [origin]
Guido van Rossum125700a1998-07-08 03:04:39 +0000374 self.note(3, " New todo link %s", url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000375
376 def markdone(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000377 self.done[url] = self.todo[url]
378 del self.todo[url]
379 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000380
381 def inroots(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000382 for root in self.roots:
383 if url[:len(root)] == root:
Guido van Rossum125700a1998-07-08 03:04:39 +0000384 return self.isallowed(root, url)
Guido van Rossum986abac1998-04-06 14:29:28 +0000385 return 0
Guido van Rossum125700a1998-07-08 03:04:39 +0000386
387 def isallowed(self, root, url):
388 root = urlparse.urljoin(root, "/")
389 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000390
391 def getpage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000392 if url[:7] == 'mailto:' or url[:5] == 'news:':
Guido van Rossum125700a1998-07-08 03:04:39 +0000393 self.note(1, " Not checking mailto/news URL")
Guido van Rossum986abac1998-04-06 14:29:28 +0000394 return None
395 isint = self.inroots(url)
396 if not isint:
397 if not self.checkext:
Guido van Rossum125700a1998-07-08 03:04:39 +0000398 self.note(1, " Not checking ext link")
Guido van Rossum986abac1998-04-06 14:29:28 +0000399 return None
400 f = self.openpage(url)
401 if f:
402 self.safeclose(f)
403 return None
404 text, nurl = self.readhtml(url)
405 if nurl != url:
Guido van Rossum125700a1998-07-08 03:04:39 +0000406 self.note(1, " Redirected to %s", nurl)
Guido van Rossum986abac1998-04-06 14:29:28 +0000407 url = nurl
408 if text:
Guido van Rossum125700a1998-07-08 03:04:39 +0000409 return Page(text, url, maxpage=self.maxpage, checker=self)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000410
411 def readhtml(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000412 text = None
413 f, url = self.openhtml(url)
414 if f:
415 text = f.read()
416 f.close()
417 return text, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000418
419 def openhtml(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000420 f = self.openpage(url)
421 if f:
422 url = f.geturl()
423 info = f.info()
424 if not self.checkforhtml(info, url):
425 self.safeclose(f)
426 f = None
427 return f, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000428
429 def openpage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000430 try:
431 return self.urlopener.open(url)
432 except IOError, msg:
433 msg = self.sanitize(msg)
Guido van Rossum125700a1998-07-08 03:04:39 +0000434 self.note(0, "Error %s", msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000435 if self.verbose > 0:
436 self.show(" HREF ", url, " from", self.todo[url])
437 self.setbad(url, msg)
438 return None
Guido van Rossum00756bd1998-02-21 20:02:09 +0000439
440 def checkforhtml(self, info, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000441 if info.has_key('content-type'):
442 ctype = string.lower(info['content-type'])
443 else:
444 if url[-1:] == "/":
445 return 1
446 ctype, encoding = mimetypes.guess_type(url)
447 if ctype == 'text/html':
448 return 1
449 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000450 self.note(1, " Not HTML, mime type %s", ctype)
Guido van Rossum986abac1998-04-06 14:29:28 +0000451 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000452
Guido van Rossume5605ba1997-01-31 14:43:15 +0000453 def setgood(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000454 if self.bad.has_key(url):
455 del self.bad[url]
456 self.changed = 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000457 self.note(0, "(Clear previously seen error)")
Guido van Rossume5605ba1997-01-31 14:43:15 +0000458
459 def setbad(self, url, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000460 if self.bad.has_key(url) and self.bad[url] == msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000461 self.note(0, "(Seen this error before)")
Guido van Rossum986abac1998-04-06 14:29:28 +0000462 return
463 self.bad[url] = msg
464 self.changed = 1
465 self.markerror(url)
466
Guido van Rossumaf310c11997-02-02 23:30:32 +0000467 def markerror(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000468 try:
469 origins = self.todo[url]
470 except KeyError:
471 origins = self.done[url]
472 for source, rawlink in origins:
473 triple = url, rawlink, self.bad[url]
474 self.seterror(source, triple)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000475
476 def seterror(self, url, triple):
Guido van Rossum986abac1998-04-06 14:29:28 +0000477 try:
478 self.errors[url].append(triple)
479 except KeyError:
480 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000481
Guido van Rossum00756bd1998-02-21 20:02:09 +0000482 # The following used to be toplevel functions; they have been
483 # changed into methods so they can be overridden in subclasses.
484
485 def show(self, p1, link, p2, origins):
Guido van Rossum125700a1998-07-08 03:04:39 +0000486 self.message("%s %s", p1, link)
Guido van Rossum986abac1998-04-06 14:29:28 +0000487 i = 0
488 for source, rawlink in origins:
489 i = i+1
490 if i == 2:
491 p2 = ' '*len(p2)
Guido van Rossum125700a1998-07-08 03:04:39 +0000492 if rawlink != link: s = " (%s)" % rawlink
493 else: s = ""
494 self.message("%s %s%s", p2, source, s)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000495
496 def sanitize(self, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000497 if isinstance(IOError, ClassType) and isinstance(msg, IOError):
498 # Do the other branch recursively
499 msg.args = self.sanitize(msg.args)
500 elif isinstance(msg, TupleType):
501 if len(msg) >= 4 and msg[0] == 'http error' and \
502 isinstance(msg[3], InstanceType):
503 # Remove the Message instance -- it may contain
504 # a file object which prevents pickling.
505 msg = msg[:3] + msg[4:]
506 return msg
Guido van Rossum00756bd1998-02-21 20:02:09 +0000507
508 def safeclose(self, f):
Guido van Rossum986abac1998-04-06 14:29:28 +0000509 try:
510 url = f.geturl()
511 except AttributeError:
512 pass
513 else:
514 if url[:4] == 'ftp:' or url[:7] == 'file://':
515 # Apparently ftp connections don't like to be closed
516 # prematurely...
517 text = f.read()
518 f.close()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000519
520 def save_pickle(self, dumpfile=DUMPFILE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000521 if not self.changed:
Guido van Rossum125700a1998-07-08 03:04:39 +0000522 self.note(0, "\nNo need to save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000523 elif not dumpfile:
Guido van Rossum125700a1998-07-08 03:04:39 +0000524 self.note(0, "No dumpfile, won't save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000525 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000526 self.note(0, "\nSaving checkpoint to %s ...", dumpfile)
Guido van Rossum986abac1998-04-06 14:29:28 +0000527 newfile = dumpfile + ".new"
528 f = open(newfile, "wb")
529 pickle.dump(self, f)
530 f.close()
531 try:
532 os.unlink(dumpfile)
533 except os.error:
534 pass
535 os.rename(newfile, dumpfile)
Guido van Rossum125700a1998-07-08 03:04:39 +0000536 self.note(0, "Done.")
Guido van Rossum986abac1998-04-06 14:29:28 +0000537 return 1
Guido van Rossum00756bd1998-02-21 20:02:09 +0000538
Guido van Rossum272b37d1997-01-30 02:44:48 +0000539
540class Page:
541
Guido van Rossum125700a1998-07-08 03:04:39 +0000542 def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE, checker=None):
Guido van Rossum986abac1998-04-06 14:29:28 +0000543 self.text = text
544 self.url = url
545 self.verbose = verbose
546 self.maxpage = maxpage
Guido van Rossum125700a1998-07-08 03:04:39 +0000547 self.checker = checker
Guido van Rossum272b37d1997-01-30 02:44:48 +0000548
Guido van Rossuma42c1ee1998-08-06 21:31:13 +0000549 def note(self, level, msg, *args):
550 if self.checker:
551 apply(self.checker.note, (level, msg) + args)
552 else:
553 if self.verbose >= level:
554 if args:
555 msg = msg%args
556 print msg
557
Guido van Rossum272b37d1997-01-30 02:44:48 +0000558 def getlinkinfos(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000559 size = len(self.text)
560 if size > self.maxpage:
Guido van Rossum125700a1998-07-08 03:04:39 +0000561 self.note(0, "Skip huge file %s (%.0f Kbytes)", self.url, (size*0.001))
Guido van Rossum986abac1998-04-06 14:29:28 +0000562 return []
Guido van Rossum125700a1998-07-08 03:04:39 +0000563 self.checker.note(2, " Parsing %s (%d bytes)", self.url, size)
564 parser = MyHTMLParser(verbose=self.verbose, checker=self.checker)
Guido van Rossum986abac1998-04-06 14:29:28 +0000565 parser.feed(self.text)
566 parser.close()
567 rawlinks = parser.getlinks()
568 base = urlparse.urljoin(self.url, parser.getbase() or "")
569 infos = []
570 for rawlink in rawlinks:
571 t = urlparse.urlparse(rawlink)
572 t = t[:-1] + ('',)
573 rawlink = urlparse.urlunparse(t)
574 link = urlparse.urljoin(base, rawlink)
575 infos.append((link, rawlink))
576 return infos
Guido van Rossum272b37d1997-01-30 02:44:48 +0000577
578
579class MyStringIO(StringIO.StringIO):
580
581 def __init__(self, url, info):
Guido van Rossum986abac1998-04-06 14:29:28 +0000582 self.__url = url
583 self.__info = info
584 StringIO.StringIO.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000585
586 def info(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000587 return self.__info
Guido van Rossum272b37d1997-01-30 02:44:48 +0000588
589 def geturl(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000590 return self.__url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000591
592
593class MyURLopener(urllib.FancyURLopener):
594
595 http_error_default = urllib.URLopener.http_error_default
596
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000597 def __init__(*args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000598 self = args[0]
599 apply(urllib.FancyURLopener.__init__, args)
600 self.addheaders = [
601 ('User-agent', 'Python-webchecker/%s' % __version__),
602 ]
Guido van Rossum89efda31997-05-07 15:00:56 +0000603
604 def http_error_401(self, url, fp, errcode, errmsg, headers):
605 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000606
Guido van Rossum272b37d1997-01-30 02:44:48 +0000607 def open_file(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000608 path = urllib.url2pathname(urllib.unquote(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000609 if os.path.isdir(path):
Guido van Rossum0ec14931999-04-26 23:11:46 +0000610 if path[-1] != os.sep:
611 url = url + '/'
Guido van Rossum986abac1998-04-06 14:29:28 +0000612 indexpath = os.path.join(path, "index.html")
613 if os.path.exists(indexpath):
614 return self.open_file(url + "index.html")
615 try:
616 names = os.listdir(path)
617 except os.error, msg:
618 raise IOError, msg, sys.exc_traceback
619 names.sort()
620 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
621 s.write('<BASE HREF="file:%s">\n' %
622 urllib.quote(os.path.join(path, "")))
623 for name in names:
624 q = urllib.quote(name)
625 s.write('<A HREF="%s">%s</A>\n' % (q, q))
626 s.seek(0)
627 return s
Guido van Rossum0ec14931999-04-26 23:11:46 +0000628 return urllib.FancyURLopener.open_file(self, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000629
630
Guido van Rossume5605ba1997-01-31 14:43:15 +0000631class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000632
Guido van Rossum125700a1998-07-08 03:04:39 +0000633 def __init__(self, verbose=VERBOSE, checker=None):
634 self.myverbose = verbose # now unused
635 self.checker = checker
Guido van Rossum986abac1998-04-06 14:29:28 +0000636 self.base = None
637 self.links = {}
Guido van Rossum986abac1998-04-06 14:29:28 +0000638 sgmllib.SGMLParser.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000639
640 def start_a(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000641 self.link_attr(attributes, 'href')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000642
643 def end_a(self): pass
644
Guido van Rossum2237b731997-10-06 18:54:01 +0000645 def do_area(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000646 self.link_attr(attributes, 'href')
Guido van Rossum2237b731997-10-06 18:54:01 +0000647
Guido van Rossum6133ec61997-02-01 05:16:08 +0000648 def do_img(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000649 self.link_attr(attributes, 'src', 'lowsrc')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000650
651 def do_frame(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000652 self.link_attr(attributes, 'src')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000653
654 def link_attr(self, attributes, *args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000655 for name, value in attributes:
656 if name in args:
657 if value: value = string.strip(value)
658 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000659
660 def do_base(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000661 for name, value in attributes:
662 if name == 'href':
663 if value: value = string.strip(value)
664 if value:
Guido van Rossum125700a1998-07-08 03:04:39 +0000665 if self.checker:
666 self.checker.note(1, " Base %s", value)
Guido van Rossum986abac1998-04-06 14:29:28 +0000667 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000668
669 def getlinks(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000670 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000671
672 def getbase(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000673 return self.base
Guido van Rossum272b37d1997-01-30 02:44:48 +0000674
675
Guido van Rossum272b37d1997-01-30 02:44:48 +0000676if __name__ == '__main__':
677 main()