blob: bf56cece355d3594a99eebc46da20597753f9a94 [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
3"""Web tree checker.
4
5This utility is handy to check a subweb of the world-wide web for
6errors. A subweb is specified by giving one or more ``root URLs''; a
7page belongs to the subweb if one of the root URLs is an initial
8prefix of it.
9
10File URL extension:
11
12In order to easy the checking of subwebs via the local file system,
13the interpretation of ``file:'' URLs is extended to mimic the behavior
14of your average HTTP daemon: if a directory pathname is given, the
15file index.html in that directory is returned if it exists, otherwise
16a directory listing is returned. Now, you can point webchecker to the
17document tree in the local file system of your HTTP daemon, and have
18most of it checked. In fact the default works this way if your local
19web tree is located at /usr/local/etc/httpd/htdpcs (the default for
20the NCSA HTTP daemon and probably others).
21
Guido van Rossumaf310c11997-02-02 23:30:32 +000022Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000023
Guido van Rossumaf310c11997-02-02 23:30:32 +000024When done, it reports pages with bad links within the subweb. When
25interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
27In verbose mode, additional messages are printed during the
28information gathering phase. By default, it prints a summary of its
29work status every 50 URLs (adjustable with the -r option), and it
30reports errors as they are encountered. Use the -q option to disable
31this output.
32
33Checkpoint feature:
34
35Whether interrupted or not, it dumps its state (a Python pickle) to a
36checkpoint file and the -R option allows it to restart from the
37checkpoint (assuming that the pages on the subweb that were already
38processed haven't changed). Even when it has run till completion, -R
39can still be useful -- it will print the reports again, and -Rq prints
40the errors only. In this case, the checkpoint file is not written
41again. The checkpoint file can be set with the -d option.
42
43The checkpoint file is written as a Python pickle. Remember that
44Python's pickle module is currently quite slow. Give it the time it
45needs to load and save the checkpoint file. When interrupted while
46writing the checkpoint file, the old checkpoint file is not
47overwritten, but all work done in the current run is lost.
48
49Miscellaneous:
50
Guido van Rossumaf310c11997-02-02 23:30:32 +000051- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
52
Guido van Rossum3edbb351997-01-30 03:19:41 +000053- Webchecker honors the "robots.txt" convention. Thanks to Skip
54Montanaro for his robotparser.py module (included in this directory)!
55The agent name is hardwired to "webchecker". URLs that are disallowed
56by the robots.txt file are reported as external URLs.
57
Guido van Rossumaf310c11997-02-02 23:30:32 +000058- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000059skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000060
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- When the server or protocol does not tell us a file's type, we guess
62it based on the URL's suffix. The mimetypes.py module (also in this
63directory) has a built-in table mapping most currently known suffixes,
64and in addition attempts to read the mime.types configuration files in
65the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000066
Guido van Rossumaf310c11997-02-02 23:30:32 +000067- We follows links indicated by <A>, <FRAME> and <IMG> tags. We also
68honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossumaf310c11997-02-02 23:30:32 +000070- Checking external links is now done by default; use -x to *disable*
71this feature. External links are now checked during normal
72processing. (XXX The status of a checked link could be categorized
73better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000074
75
76Usage: webchecker.py [option] ... [rooturl] ...
77
78Options:
79
80-R -- restart from checkpoint file
81-d file -- checkpoint filename (default %(DUMPFILE)s)
82-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000083-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000084-q -- quiet operation (also suppresses external links report)
85-r number -- number of links processed per round (default %(ROUNDSIZE)d)
86-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000087-x -- don't check external links (these are often slow to check)
Guido van Rossum272b37d1997-01-30 02:44:48 +000088
89Arguments:
90
91rooturl -- URL to start checking
92 (default %(DEFROOT)s)
93
94"""
95
Guido van Rossume5605ba1997-01-31 14:43:15 +000096
Guido van Rossum00756bd1998-02-21 20:02:09 +000097__version__ = "$Revision$"
Guido van Rossum325a64f1997-01-30 03:30:20 +000098
Guido van Rossum272b37d1997-01-30 02:44:48 +000099
100import sys
101import os
102from types import *
103import string
104import StringIO
105import getopt
106import pickle
107
108import urllib
109import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000110import sgmllib
Guido van Rossum272b37d1997-01-30 02:44:48 +0000111
112import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000113import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000114
Guido van Rossum00756bd1998-02-21 20:02:09 +0000115# Extract real version number if necessary
116if __version__[0] == '$':
117 _v = string.split(__version__)
118 if len(_v) == 3:
Guido van Rossum986abac1998-04-06 14:29:28 +0000119 __version__ = _v[1]
Guido van Rossum00756bd1998-02-21 20:02:09 +0000120
Guido van Rossum272b37d1997-01-30 02:44:48 +0000121
122# Tunable parameters
Guido van Rossum986abac1998-04-06 14:29:28 +0000123DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
124CHECKEXT = 1 # Check external references (1 deep)
125VERBOSE = 1 # Verbosity level (0-3)
126MAXPAGE = 150000 # Ignore files bigger than this
127ROUNDSIZE = 50 # Number of links processed per round
128DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
129AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000130
131
132# Global variables
Guido van Rossum272b37d1997-01-30 02:44:48 +0000133
134
135def main():
Guido van Rossum00756bd1998-02-21 20:02:09 +0000136 checkext = CHECKEXT
137 verbose = VERBOSE
138 maxpage = MAXPAGE
139 roundsize = ROUNDSIZE
Guido van Rossum272b37d1997-01-30 02:44:48 +0000140 dumpfile = DUMPFILE
141 restart = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000142 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000143
144 try:
Guido van Rossum986abac1998-04-06 14:29:28 +0000145 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:vx')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000146 except getopt.error, msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000147 sys.stdout = sys.stderr
148 print msg
149 print __doc__%globals()
150 sys.exit(2)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000151 for o, a in opts:
Guido van Rossum986abac1998-04-06 14:29:28 +0000152 if o == '-R':
153 restart = 1
154 if o == '-d':
155 dumpfile = a
156 if o == '-m':
157 maxpage = string.atoi(a)
158 if o == '-n':
159 norun = 1
160 if o == '-q':
161 verbose = 0
162 if o == '-r':
163 roundsize = string.atoi(a)
164 if o == '-v':
165 verbose = verbose + 1
166 if o == '-x':
167 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000168
Guido van Rossume5605ba1997-01-31 14:43:15 +0000169 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000170 print AGENTNAME, "version", __version__
Guido van Rossum325a64f1997-01-30 03:30:20 +0000171
Guido van Rossum272b37d1997-01-30 02:44:48 +0000172 if restart:
Guido van Rossum986abac1998-04-06 14:29:28 +0000173 c = load_pickle(dumpfile=dumpfile, verbose=verbose)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000174 else:
Guido van Rossum986abac1998-04-06 14:29:28 +0000175 c = Checker()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000176
177 c.setflags(checkext=checkext, verbose=verbose,
Guido van Rossum986abac1998-04-06 14:29:28 +0000178 maxpage=maxpage, roundsize=roundsize)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000179
180 if not restart and not args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000181 args.append(DEFROOT)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000182
183 for arg in args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000184 c.addroot(arg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000185
Guido van Rossumbee64531998-04-27 19:35:15 +0000186 try:
187
188 if not norun:
189 try:
190 c.run()
191 except KeyboardInterrupt:
192 if verbose > 0:
193 print "[run interrupted]"
194
Guido van Rossum986abac1998-04-06 14:29:28 +0000195 try:
Guido van Rossumbee64531998-04-27 19:35:15 +0000196 c.report()
Guido van Rossum986abac1998-04-06 14:29:28 +0000197 except KeyboardInterrupt:
198 if verbose > 0:
Guido van Rossumbee64531998-04-27 19:35:15 +0000199 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000200
Guido van Rossumbee64531998-04-27 19:35:15 +0000201 finally:
202 if c.save_pickle(dumpfile):
203 if dumpfile == DUMPFILE:
204 print "Use ``%s -R'' to restart." % sys.argv[0]
205 else:
206 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
207 dumpfile)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000208
209
210def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
211 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000212 print "Loading checkpoint from %s ..." % dumpfile
Guido van Rossum00756bd1998-02-21 20:02:09 +0000213 f = open(dumpfile, "rb")
214 c = pickle.load(f)
215 f.close()
216 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000217 print "Done."
218 print "Root:", string.join(c.roots, "\n ")
Guido van Rossum00756bd1998-02-21 20:02:09 +0000219 return c
Guido van Rossum272b37d1997-01-30 02:44:48 +0000220
221
222class Checker:
223
Guido van Rossum00756bd1998-02-21 20:02:09 +0000224 checkext = CHECKEXT
225 verbose = VERBOSE
226 maxpage = MAXPAGE
227 roundsize = ROUNDSIZE
228
229 validflags = tuple(dir())
230
231 def __init__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000232 self.reset()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000233
234 def setflags(self, **kw):
Guido van Rossum986abac1998-04-06 14:29:28 +0000235 for key in kw.keys():
236 if key not in self.validflags:
237 raise NameError, "invalid keyword argument: %s" % str(key)
238 for key, value in kw.items():
239 setattr(self, key, value)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000240
241 def reset(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000242 self.roots = []
243 self.todo = {}
244 self.done = {}
245 self.bad = {}
246 self.round = 0
247 # The following are not pickled:
248 self.robots = {}
249 self.errors = {}
250 self.urlopener = MyURLopener()
251 self.changed = 0
Guido van Rossum3edbb351997-01-30 03:19:41 +0000252
253 def __getstate__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000254 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000255
256 def __setstate__(self, state):
Guido van Rossum986abac1998-04-06 14:29:28 +0000257 self.reset()
258 (self.roots, self.todo, self.done, self.bad, self.round) = state
259 for root in self.roots:
260 self.addrobot(root)
261 for url in self.bad.keys():
262 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000263
264 def addroot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000265 if root not in self.roots:
266 troot = root
267 scheme, netloc, path, params, query, fragment = \
268 urlparse.urlparse(root)
269 i = string.rfind(path, "/") + 1
270 if 0 < i < len(path):
271 path = path[:i]
272 troot = urlparse.urlunparse((scheme, netloc, path,
273 params, query, fragment))
274 self.roots.append(troot)
275 self.addrobot(root)
276 self.newlink(root, ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000277
278 def addrobot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000279 root = urlparse.urljoin(root, "/")
280 if self.robots.has_key(root): return
281 url = urlparse.urljoin(root, "/robots.txt")
282 self.robots[root] = rp = robotparser.RobotFileParser()
283 if self.verbose > 2:
284 print "Parsing", url
285 rp.debug = self.verbose > 3
286 rp.set_url(url)
287 try:
288 rp.read()
289 except IOError, msg:
290 if self.verbose > 1:
291 print "I/O error parsing", url, ":", msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000292
293 def run(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000294 while self.todo:
295 self.round = self.round + 1
296 if self.verbose > 0:
297 print
298 print "Round %d (%s)" % (self.round, self.status())
299 print
300 urls = self.todo.keys()[:self.roundsize]
301 for url in urls:
302 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000303
304 def status(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000305 return "%d total, %d to do, %d done, %d bad" % (
306 len(self.todo)+len(self.done),
307 len(self.todo), len(self.done),
308 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000309
Guido van Rossumaf310c11997-02-02 23:30:32 +0000310 def report(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000311 print
312 if not self.todo: print "Final",
313 else: print "Interim",
314 print "Report (%s)" % self.status()
315 self.report_errors()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000316
Guido van Rossum272b37d1997-01-30 02:44:48 +0000317 def report_errors(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000318 if not self.bad:
319 print
320 print "No errors"
321 return
322 print
323 print "Error Report:"
324 sources = self.errors.keys()
325 sources.sort()
326 for source in sources:
327 triples = self.errors[source]
328 print
329 if len(triples) > 1:
330 print len(triples), "Errors in", source
331 else:
332 print "Error in", source
333 for url, rawlink, msg in triples:
334 print " HREF", url,
335 if rawlink != url: print "(%s)" % rawlink,
336 print
337 print " msg", msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000338
339 def dopage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000340 if self.verbose > 1:
341 if self.verbose > 2:
342 self.show("Check ", url, " from", self.todo[url])
343 else:
344 print "Check ", url
345 page = self.getpage(url)
346 if page:
347 for info in page.getlinkinfos():
348 link, rawlink = info
349 origin = url, rawlink
350 self.newlink(link, origin)
351 self.markdone(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000352
Guido van Rossumaf310c11997-02-02 23:30:32 +0000353 def newlink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000354 if self.done.has_key(url):
355 self.newdonelink(url, origin)
356 else:
357 self.newtodolink(url, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000358
359 def newdonelink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000360 self.done[url].append(origin)
361 if self.verbose > 3:
362 print " Done link", url
Guido van Rossume5605ba1997-01-31 14:43:15 +0000363
364 def newtodolink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000365 if self.todo.has_key(url):
366 self.todo[url].append(origin)
367 if self.verbose > 3:
368 print " Seen todo link", url
369 else:
370 self.todo[url] = [origin]
371 if self.verbose > 3:
372 print " New todo link", url
Guido van Rossume5605ba1997-01-31 14:43:15 +0000373
374 def markdone(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000375 self.done[url] = self.todo[url]
376 del self.todo[url]
377 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000378
379 def inroots(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000380 for root in self.roots:
381 if url[:len(root)] == root:
382 root = urlparse.urljoin(root, "/")
383 return self.robots[root].can_fetch(AGENTNAME, url)
384 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000385
386 def getpage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000387 if url[:7] == 'mailto:' or url[:5] == 'news:':
388 if self.verbose > 1: print " Not checking mailto/news URL"
389 return None
390 isint = self.inroots(url)
391 if not isint:
392 if not self.checkext:
393 if self.verbose > 1: print " Not checking ext link"
394 return None
395 f = self.openpage(url)
396 if f:
397 self.safeclose(f)
398 return None
399 text, nurl = self.readhtml(url)
400 if nurl != url:
401 if self.verbose > 1:
402 print " Redirected to", nurl
403 url = nurl
404 if text:
405 return Page(text, url, verbose=self.verbose, maxpage=self.maxpage)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000406
407 def readhtml(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000408 text = None
409 f, url = self.openhtml(url)
410 if f:
411 text = f.read()
412 f.close()
413 return text, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000414
415 def openhtml(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000416 f = self.openpage(url)
417 if f:
418 url = f.geturl()
419 info = f.info()
420 if not self.checkforhtml(info, url):
421 self.safeclose(f)
422 f = None
423 return f, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000424
425 def openpage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000426 try:
427 return self.urlopener.open(url)
428 except IOError, msg:
429 msg = self.sanitize(msg)
430 if self.verbose > 0:
431 print "Error ", msg
432 if self.verbose > 0:
433 self.show(" HREF ", url, " from", self.todo[url])
434 self.setbad(url, msg)
435 return None
Guido van Rossum00756bd1998-02-21 20:02:09 +0000436
437 def checkforhtml(self, info, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000438 if info.has_key('content-type'):
439 ctype = string.lower(info['content-type'])
440 else:
441 if url[-1:] == "/":
442 return 1
443 ctype, encoding = mimetypes.guess_type(url)
444 if ctype == 'text/html':
445 return 1
446 else:
447 if self.verbose > 1:
448 print " Not HTML, mime type", ctype
449 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000450
Guido van Rossume5605ba1997-01-31 14:43:15 +0000451 def setgood(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000452 if self.bad.has_key(url):
453 del self.bad[url]
454 self.changed = 1
455 if self.verbose > 0:
456 print "(Clear previously seen error)"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000457
458 def setbad(self, url, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000459 if self.bad.has_key(url) and self.bad[url] == msg:
460 if self.verbose > 0:
461 print "(Seen this error before)"
462 return
463 self.bad[url] = msg
464 self.changed = 1
465 self.markerror(url)
466
Guido van Rossumaf310c11997-02-02 23:30:32 +0000467 def markerror(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000468 try:
469 origins = self.todo[url]
470 except KeyError:
471 origins = self.done[url]
472 for source, rawlink in origins:
473 triple = url, rawlink, self.bad[url]
474 self.seterror(source, triple)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000475
476 def seterror(self, url, triple):
Guido van Rossum986abac1998-04-06 14:29:28 +0000477 try:
478 self.errors[url].append(triple)
479 except KeyError:
480 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000481
Guido van Rossum00756bd1998-02-21 20:02:09 +0000482 # The following used to be toplevel functions; they have been
483 # changed into methods so they can be overridden in subclasses.
484
485 def show(self, p1, link, p2, origins):
Guido van Rossum986abac1998-04-06 14:29:28 +0000486 print p1, link
487 i = 0
488 for source, rawlink in origins:
489 i = i+1
490 if i == 2:
491 p2 = ' '*len(p2)
492 print p2, source,
493 if rawlink != link: print "(%s)" % rawlink,
494 print
Guido van Rossum00756bd1998-02-21 20:02:09 +0000495
496 def sanitize(self, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000497 if isinstance(IOError, ClassType) and isinstance(msg, IOError):
498 # Do the other branch recursively
499 msg.args = self.sanitize(msg.args)
500 elif isinstance(msg, TupleType):
501 if len(msg) >= 4 and msg[0] == 'http error' and \
502 isinstance(msg[3], InstanceType):
503 # Remove the Message instance -- it may contain
504 # a file object which prevents pickling.
505 msg = msg[:3] + msg[4:]
506 return msg
Guido van Rossum00756bd1998-02-21 20:02:09 +0000507
508 def safeclose(self, f):
Guido van Rossum986abac1998-04-06 14:29:28 +0000509 try:
510 url = f.geturl()
511 except AttributeError:
512 pass
513 else:
514 if url[:4] == 'ftp:' or url[:7] == 'file://':
515 # Apparently ftp connections don't like to be closed
516 # prematurely...
517 text = f.read()
518 f.close()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000519
520 def save_pickle(self, dumpfile=DUMPFILE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000521 if not self.changed:
522 if self.verbose > 0:
523 print
524 print "No need to save checkpoint"
525 elif not dumpfile:
526 if self.verbose > 0:
527 print "No dumpfile, won't save checkpoint"
528 else:
529 if self.verbose > 0:
530 print
531 print "Saving checkpoint to %s ..." % dumpfile
532 newfile = dumpfile + ".new"
533 f = open(newfile, "wb")
534 pickle.dump(self, f)
535 f.close()
536 try:
537 os.unlink(dumpfile)
538 except os.error:
539 pass
540 os.rename(newfile, dumpfile)
541 if self.verbose > 0:
542 print "Done."
543 return 1
Guido van Rossum00756bd1998-02-21 20:02:09 +0000544
Guido van Rossum272b37d1997-01-30 02:44:48 +0000545
546class Page:
547
Guido van Rossum00756bd1998-02-21 20:02:09 +0000548 def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000549 self.text = text
550 self.url = url
551 self.verbose = verbose
552 self.maxpage = maxpage
Guido van Rossum272b37d1997-01-30 02:44:48 +0000553
554 def getlinkinfos(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000555 size = len(self.text)
556 if size > self.maxpage:
557 if self.verbose > 0:
558 print "Skip huge file", self.url
559 print " (%.0f Kbytes)" % (size*0.001)
560 return []
561 if self.verbose > 2:
562 print " Parsing", self.url, "(%d bytes)" % size
563 parser = MyHTMLParser(verbose=self.verbose)
564 parser.feed(self.text)
565 parser.close()
566 rawlinks = parser.getlinks()
567 base = urlparse.urljoin(self.url, parser.getbase() or "")
568 infos = []
569 for rawlink in rawlinks:
570 t = urlparse.urlparse(rawlink)
571 t = t[:-1] + ('',)
572 rawlink = urlparse.urlunparse(t)
573 link = urlparse.urljoin(base, rawlink)
574 infos.append((link, rawlink))
575 return infos
Guido van Rossum272b37d1997-01-30 02:44:48 +0000576
577
578class MyStringIO(StringIO.StringIO):
579
580 def __init__(self, url, info):
Guido van Rossum986abac1998-04-06 14:29:28 +0000581 self.__url = url
582 self.__info = info
583 StringIO.StringIO.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000584
585 def info(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000586 return self.__info
Guido van Rossum272b37d1997-01-30 02:44:48 +0000587
588 def geturl(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000589 return self.__url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000590
591
592class MyURLopener(urllib.FancyURLopener):
593
594 http_error_default = urllib.URLopener.http_error_default
595
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000596 def __init__(*args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000597 self = args[0]
598 apply(urllib.FancyURLopener.__init__, args)
599 self.addheaders = [
600 ('User-agent', 'Python-webchecker/%s' % __version__),
601 ]
Guido van Rossum89efda31997-05-07 15:00:56 +0000602
603 def http_error_401(self, url, fp, errcode, errmsg, headers):
604 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000605
Guido van Rossum272b37d1997-01-30 02:44:48 +0000606 def open_file(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000607 path = urllib.url2pathname(urllib.unquote(url))
608 if path[-1] != os.sep:
609 url = url + '/'
610 if os.path.isdir(path):
611 indexpath = os.path.join(path, "index.html")
612 if os.path.exists(indexpath):
613 return self.open_file(url + "index.html")
614 try:
615 names = os.listdir(path)
616 except os.error, msg:
617 raise IOError, msg, sys.exc_traceback
618 names.sort()
619 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
620 s.write('<BASE HREF="file:%s">\n' %
621 urllib.quote(os.path.join(path, "")))
622 for name in names:
623 q = urllib.quote(name)
624 s.write('<A HREF="%s">%s</A>\n' % (q, q))
625 s.seek(0)
626 return s
627 return urllib.FancyURLopener.open_file(self, path)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000628
629
Guido van Rossume5605ba1997-01-31 14:43:15 +0000630class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000631
Guido van Rossum00756bd1998-02-21 20:02:09 +0000632 def __init__(self, verbose=VERBOSE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000633 self.base = None
634 self.links = {}
635 self.myverbose = verbose
636 sgmllib.SGMLParser.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000637
638 def start_a(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000639 self.link_attr(attributes, 'href')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000640
641 def end_a(self): pass
642
Guido van Rossum2237b731997-10-06 18:54:01 +0000643 def do_area(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000644 self.link_attr(attributes, 'href')
Guido van Rossum2237b731997-10-06 18:54:01 +0000645
Guido van Rossum6133ec61997-02-01 05:16:08 +0000646 def do_img(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000647 self.link_attr(attributes, 'src', 'lowsrc')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000648
649 def do_frame(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000650 self.link_attr(attributes, 'src')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000651
652 def link_attr(self, attributes, *args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000653 for name, value in attributes:
654 if name in args:
655 if value: value = string.strip(value)
656 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000657
658 def do_base(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000659 for name, value in attributes:
660 if name == 'href':
661 if value: value = string.strip(value)
662 if value:
663 if self.myverbose > 1:
664 print " Base", value
665 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000666
667 def getlinks(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000668 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000669
670 def getbase(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000671 return self.base
Guido van Rossum272b37d1997-01-30 02:44:48 +0000672
673
Guido van Rossum272b37d1997-01-30 02:44:48 +0000674if __name__ == '__main__':
675 main()