blob: dba641c848fa9bd0ce7613ee175ce387935b5695 [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
3"""Web tree checker.
4
5This utility is handy to check a subweb of the world-wide web for
6errors. A subweb is specified by giving one or more ``root URLs''; a
7page belongs to the subweb if one of the root URLs is an initial
8prefix of it.
9
10File URL extension:
11
12In order to easy the checking of subwebs via the local file system,
13the interpretation of ``file:'' URLs is extended to mimic the behavior
14of your average HTTP daemon: if a directory pathname is given, the
15file index.html in that directory is returned if it exists, otherwise
16a directory listing is returned. Now, you can point webchecker to the
17document tree in the local file system of your HTTP daemon, and have
18most of it checked. In fact the default works this way if your local
19web tree is located at /usr/local/etc/httpd/htdpcs (the default for
20the NCSA HTTP daemon and probably others).
21
Guido van Rossumaf310c11997-02-02 23:30:32 +000022Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000023
Guido van Rossumaf310c11997-02-02 23:30:32 +000024When done, it reports pages with bad links within the subweb. When
25interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
27In verbose mode, additional messages are printed during the
28information gathering phase. By default, it prints a summary of its
29work status every 50 URLs (adjustable with the -r option), and it
30reports errors as they are encountered. Use the -q option to disable
31this output.
32
33Checkpoint feature:
34
35Whether interrupted or not, it dumps its state (a Python pickle) to a
36checkpoint file and the -R option allows it to restart from the
37checkpoint (assuming that the pages on the subweb that were already
38processed haven't changed). Even when it has run till completion, -R
39can still be useful -- it will print the reports again, and -Rq prints
40the errors only. In this case, the checkpoint file is not written
41again. The checkpoint file can be set with the -d option.
42
43The checkpoint file is written as a Python pickle. Remember that
44Python's pickle module is currently quite slow. Give it the time it
45needs to load and save the checkpoint file. When interrupted while
46writing the checkpoint file, the old checkpoint file is not
47overwritten, but all work done in the current run is lost.
48
49Miscellaneous:
50
Guido van Rossumaf310c11997-02-02 23:30:32 +000051- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
52
Guido van Rossum3edbb351997-01-30 03:19:41 +000053- Webchecker honors the "robots.txt" convention. Thanks to Skip
54Montanaro for his robotparser.py module (included in this directory)!
55The agent name is hardwired to "webchecker". URLs that are disallowed
56by the robots.txt file are reported as external URLs.
57
Guido van Rossumaf310c11997-02-02 23:30:32 +000058- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000059skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000060
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- When the server or protocol does not tell us a file's type, we guess
62it based on the URL's suffix. The mimetypes.py module (also in this
63directory) has a built-in table mapping most currently known suffixes,
64and in addition attempts to read the mime.types configuration files in
65the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000066
Guido van Rossumaf310c11997-02-02 23:30:32 +000067- We follows links indicated by <A>, <FRAME> and <IMG> tags. We also
68honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossumaf310c11997-02-02 23:30:32 +000070- Checking external links is now done by default; use -x to *disable*
71this feature. External links are now checked during normal
72processing. (XXX The status of a checked link could be categorized
73better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000074
75
76Usage: webchecker.py [option] ... [rooturl] ...
77
78Options:
79
80-R -- restart from checkpoint file
81-d file -- checkpoint filename (default %(DUMPFILE)s)
82-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000083-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000084-q -- quiet operation (also suppresses external links report)
85-r number -- number of links processed per round (default %(ROUNDSIZE)d)
86-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000087-x -- don't check external links (these are often slow to check)
Guido van Rossum272b37d1997-01-30 02:44:48 +000088
89Arguments:
90
91rooturl -- URL to start checking
92 (default %(DEFROOT)s)
93
94"""
95
Guido van Rossume5605ba1997-01-31 14:43:15 +000096
Guido van Rossum89efda31997-05-07 15:00:56 +000097__version__ = "0.5"
Guido van Rossum325a64f1997-01-30 03:30:20 +000098
Guido van Rossum272b37d1997-01-30 02:44:48 +000099
100import sys
101import os
102from types import *
103import string
104import StringIO
105import getopt
106import pickle
107
108import urllib
109import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000110import sgmllib
Guido van Rossum272b37d1997-01-30 02:44:48 +0000111
112import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000113import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000114
115
116# Tunable parameters
117DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
Guido van Rossum0b0b5f01997-01-31 18:57:23 +0000118MAXPAGE = 150000 # Ignore files bigger than this
Guido van Rossum272b37d1997-01-30 02:44:48 +0000119ROUNDSIZE = 50 # Number of links processed per round
120DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
Guido van Rossum3edbb351997-01-30 03:19:41 +0000121AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000122
123
124# Global variables
125verbose = 1
126maxpage = MAXPAGE
127roundsize = ROUNDSIZE
128
129
130def main():
131 global verbose, maxpage, roundsize
132 dumpfile = DUMPFILE
133 restart = 0
Guido van Rossumaf310c11997-02-02 23:30:32 +0000134 checkext = 1
Guido van Rossume5605ba1997-01-31 14:43:15 +0000135 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000136
137 try:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000138 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:vx')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000139 except getopt.error, msg:
140 sys.stdout = sys.stderr
141 print msg
Guido van Rossum0b0b5f01997-01-31 18:57:23 +0000142 print __doc__%globals()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000143 sys.exit(2)
144 for o, a in opts:
145 if o == '-R':
146 restart = 1
147 if o == '-d':
148 dumpfile = a
149 if o == '-m':
150 maxpage = string.atoi(a)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000151 if o == '-n':
152 norun = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000153 if o == '-q':
154 verbose = 0
155 if o == '-r':
156 roundsize = string.atoi(a)
157 if o == '-v':
158 verbose = verbose + 1
Guido van Rossumde662681997-01-30 03:58:21 +0000159 if o == '-x':
Guido van Rossumaf310c11997-02-02 23:30:32 +0000160 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000161
Guido van Rossume5605ba1997-01-31 14:43:15 +0000162 if verbose > 0:
Guido van Rossum325a64f1997-01-30 03:30:20 +0000163 print AGENTNAME, "version", __version__
164
Guido van Rossum272b37d1997-01-30 02:44:48 +0000165 if restart:
166 if verbose > 0:
167 print "Loading checkpoint from %s ..." % dumpfile
168 f = open(dumpfile, "rb")
169 c = pickle.load(f)
170 f.close()
171 if verbose > 0:
172 print "Done."
173 print "Root:", string.join(c.roots, "\n ")
174 else:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000175 c = Checker(checkext)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000176 if not args:
177 args.append(DEFROOT)
178
179 for arg in args:
180 c.addroot(arg)
181
Guido van Rossume5605ba1997-01-31 14:43:15 +0000182 if not norun:
183 try:
184 c.run()
185 except KeyboardInterrupt:
186 if verbose > 0:
187 print "[run interrupted]"
188
Guido van Rossumde662681997-01-30 03:58:21 +0000189 try:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000190 c.report()
Guido van Rossumde662681997-01-30 03:58:21 +0000191 except KeyboardInterrupt:
192 if verbose > 0:
193 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000194
195 if not c.changed:
Guido van Rossum272b37d1997-01-30 02:44:48 +0000196 if verbose > 0:
197 print
198 print "No need to save checkpoint"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000199 elif not dumpfile:
200 if verbose > 0:
201 print "No dumpfile, won't save checkpoint"
202 else:
Guido van Rossum272b37d1997-01-30 02:44:48 +0000203 if verbose > 0:
204 print
205 print "Saving checkpoint to %s ..." % dumpfile
206 newfile = dumpfile + ".new"
207 f = open(newfile, "wb")
208 pickle.dump(c, f)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000209 f.close()
210 try:
211 os.unlink(dumpfile)
212 except os.error:
213 pass
214 os.rename(newfile, dumpfile)
215 if verbose > 0:
216 print "Done."
217 if dumpfile == DUMPFILE:
218 print "Use ``%s -R'' to restart." % sys.argv[0]
219 else:
220 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
221 dumpfile)
222
223
224class Checker:
225
Guido van Rossumaf310c11997-02-02 23:30:32 +0000226 def __init__(self, checkext=1):
227 self.reset()
228 self.checkext = checkext
229
230 def reset(self):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000231 self.roots = []
232 self.todo = {}
233 self.done = {}
Guido van Rossum272b37d1997-01-30 02:44:48 +0000234 self.bad = {}
Guido van Rossum272b37d1997-01-30 02:44:48 +0000235 self.round = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000236 # The following are not pickled:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000237 self.robots = {}
Guido van Rossumaf310c11997-02-02 23:30:32 +0000238 self.errors = {}
Guido van Rossume5605ba1997-01-31 14:43:15 +0000239 self.urlopener = MyURLopener()
240 self.changed = 0
Guido van Rossum3edbb351997-01-30 03:19:41 +0000241
242 def __getstate__(self):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000243 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000244
245 def __setstate__(self, state):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000246 (self.roots, self.todo, self.done, self.bad, self.round) = state
Guido van Rossum3edbb351997-01-30 03:19:41 +0000247 for root in self.roots:
248 self.addrobot(root)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000249 for url in self.bad.keys():
250 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000251
252 def addroot(self, root):
253 if root not in self.roots:
254 self.roots.append(root)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000255 self.addrobot(root)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000256 self.newlink(root, ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000257
258 def addrobot(self, root):
Guido van Rossum3edbb351997-01-30 03:19:41 +0000259 url = urlparse.urljoin(root, "/robots.txt")
Guido van Rossum325a64f1997-01-30 03:30:20 +0000260 self.robots[root] = rp = robotparser.RobotFileParser()
261 if verbose > 2:
262 print "Parsing", url
Guido van Rossume5605ba1997-01-31 14:43:15 +0000263 rp.debug = verbose > 3
Guido van Rossum3edbb351997-01-30 03:19:41 +0000264 rp.set_url(url)
Guido van Rossum325a64f1997-01-30 03:30:20 +0000265 try:
266 rp.read()
267 except IOError, msg:
268 if verbose > 1:
269 print "I/O error parsing", url, ":", msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000270
271 def run(self):
272 while self.todo:
273 self.round = self.round + 1
274 if verbose > 0:
275 print
Guido van Rossumaf310c11997-02-02 23:30:32 +0000276 print "Round %d (%s)" % (self.round, self.status())
Guido van Rossume5605ba1997-01-31 14:43:15 +0000277 print
Guido van Rossum272b37d1997-01-30 02:44:48 +0000278 urls = self.todo.keys()[:roundsize]
279 for url in urls:
280 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000281
282 def status(self):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000283 return "%d total, %d to do, %d done, %d bad" % (
Guido van Rossume5605ba1997-01-31 14:43:15 +0000284 len(self.todo)+len(self.done),
285 len(self.todo), len(self.done),
Guido van Rossumaf310c11997-02-02 23:30:32 +0000286 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000287
Guido van Rossumaf310c11997-02-02 23:30:32 +0000288 def report(self):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000289 print
290 if not self.todo: print "Final",
291 else: print "Interim",
Guido van Rossumaf310c11997-02-02 23:30:32 +0000292 print "Report (%s)" % self.status()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000293 self.report_errors()
294
Guido van Rossum272b37d1997-01-30 02:44:48 +0000295 def report_errors(self):
296 if not self.bad:
297 print
298 print "No errors"
299 return
300 print
301 print "Error Report:"
Guido van Rossumaf310c11997-02-02 23:30:32 +0000302 sources = self.errors.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000303 sources.sort()
304 for source in sources:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000305 triples = self.errors[source]
Guido van Rossum272b37d1997-01-30 02:44:48 +0000306 print
307 if len(triples) > 1:
308 print len(triples), "Errors in", source
309 else:
310 print "Error in", source
311 for url, rawlink, msg in triples:
312 print " HREF", url,
313 if rawlink != url: print "(%s)" % rawlink,
314 print
315 print " msg", msg
316
317 def dopage(self, url):
318 if verbose > 1:
319 if verbose > 2:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000320 show("Check ", url, " from", self.todo[url])
Guido van Rossum272b37d1997-01-30 02:44:48 +0000321 else:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000322 print "Check ", url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000323 page = self.getpage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000324 if page:
325 for info in page.getlinkinfos():
326 link, rawlink = info
327 origin = url, rawlink
Guido van Rossumaf310c11997-02-02 23:30:32 +0000328 self.newlink(link, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000329 self.markdone(url)
330
Guido van Rossumaf310c11997-02-02 23:30:32 +0000331 def newlink(self, url, origin):
Guido van Rossume5605ba1997-01-31 14:43:15 +0000332 if self.done.has_key(url):
333 self.newdonelink(url, origin)
334 else:
335 self.newtodolink(url, origin)
336
337 def newdonelink(self, url, origin):
338 self.done[url].append(origin)
339 if verbose > 3:
340 print " Done link", url
341
342 def newtodolink(self, url, origin):
343 if self.todo.has_key(url):
344 self.todo[url].append(origin)
345 if verbose > 3:
346 print " Seen todo link", url
347 else:
348 self.todo[url] = [origin]
349 if verbose > 3:
350 print " New todo link", url
351
352 def markdone(self, url):
353 self.done[url] = self.todo[url]
354 del self.todo[url]
355 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000356
357 def inroots(self, url):
358 for root in self.roots:
359 if url[:len(root)] == root:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000360 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000361 return 0
362
363 def getpage(self, url):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000364 if url[:7] == 'mailto:' or url[:5] == 'news:':
365 if verbose > 1: print " Not checking mailto/news URL"
366 return None
367 isint = self.inroots(url)
368 if not isint and not self.checkext:
369 if verbose > 1: print " Not checking ext link"
370 return None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000371 try:
372 f = self.urlopener.open(url)
373 except IOError, msg:
Guido van Rossum2739cd71997-01-30 04:26:57 +0000374 msg = sanitize(msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000375 if verbose > 0:
376 print "Error ", msg
377 if verbose > 0:
378 show(" HREF ", url, " from", self.todo[url])
Guido van Rossume5605ba1997-01-31 14:43:15 +0000379 self.setbad(url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000380 return None
Guido van Rossumaf310c11997-02-02 23:30:32 +0000381 if not isint:
382 if verbose > 1: print " Not gathering links from ext URL"
383 safeclose(f)
384 return None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000385 nurl = f.geturl()
386 info = f.info()
387 if info.has_key('content-type'):
388 ctype = string.lower(info['content-type'])
Guido van Rossume5605ba1997-01-31 14:43:15 +0000389 else:
390 ctype = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000391 if nurl != url:
392 if verbose > 1:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000393 print " Redirected to", nurl
Guido van Rossume5605ba1997-01-31 14:43:15 +0000394 if not ctype:
395 ctype, encoding = mimetypes.guess_type(nurl)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000396 if ctype != 'text/html':
Guido van Rossume5605ba1997-01-31 14:43:15 +0000397 safeclose(f)
398 if verbose > 1:
399 print " Not HTML, mime type", ctype
Guido van Rossum272b37d1997-01-30 02:44:48 +0000400 return None
401 text = f.read()
402 f.close()
403 return Page(text, nurl)
404
Guido van Rossume5605ba1997-01-31 14:43:15 +0000405 def setgood(self, url):
406 if self.bad.has_key(url):
407 del self.bad[url]
408 self.changed = 1
409 if verbose > 0:
410 print "(Clear previously seen error)"
411
412 def setbad(self, url, msg):
413 if self.bad.has_key(url) and self.bad[url] == msg:
414 if verbose > 0:
415 print "(Seen this error before)"
416 return
417 self.bad[url] = msg
418 self.changed = 1
Guido van Rossumaf310c11997-02-02 23:30:32 +0000419 self.markerror(url)
420
421 def markerror(self, url):
422 try:
423 origins = self.todo[url]
424 except KeyError:
425 origins = self.done[url]
426 for source, rawlink in origins:
427 triple = url, rawlink, self.bad[url]
428 self.seterror(source, triple)
429
430 def seterror(self, url, triple):
431 try:
432 self.errors[url].append(triple)
433 except KeyError:
434 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000435
Guido van Rossum272b37d1997-01-30 02:44:48 +0000436
437class Page:
438
439 def __init__(self, text, url):
440 self.text = text
441 self.url = url
442
443 def getlinkinfos(self):
444 size = len(self.text)
445 if size > maxpage:
446 if verbose > 0:
447 print "Skip huge file", self.url
448 print " (%.0f Kbytes)" % (size*0.001)
449 return []
450 if verbose > 2:
451 print " Parsing", self.url, "(%d bytes)" % size
Guido van Rossume5605ba1997-01-31 14:43:15 +0000452 parser = MyHTMLParser()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000453 parser.feed(self.text)
454 parser.close()
455 rawlinks = parser.getlinks()
456 base = urlparse.urljoin(self.url, parser.getbase() or "")
457 infos = []
458 for rawlink in rawlinks:
459 t = urlparse.urlparse(rawlink)
460 t = t[:-1] + ('',)
461 rawlink = urlparse.urlunparse(t)
462 link = urlparse.urljoin(base, rawlink)
463 infos.append((link, rawlink))
464 return infos
465
466
467class MyStringIO(StringIO.StringIO):
468
469 def __init__(self, url, info):
470 self.__url = url
471 self.__info = info
472 StringIO.StringIO.__init__(self)
473
474 def info(self):
475 return self.__info
476
477 def geturl(self):
478 return self.__url
479
480
481class MyURLopener(urllib.FancyURLopener):
482
483 http_error_default = urllib.URLopener.http_error_default
484
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000485 def __init__(*args):
486 self = args[0]
487 apply(urllib.FancyURLopener.__init__, args)
Guido van Rossum89efda31997-05-07 15:00:56 +0000488 self.addheaders = [
489 ('User-agent', 'Python-webchecker/%s' % __version__),
490 ]
491
492 def http_error_401(self, url, fp, errcode, errmsg, headers):
493 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000494
Guido van Rossum272b37d1997-01-30 02:44:48 +0000495 def open_file(self, url):
496 path = urllib.url2pathname(urllib.unquote(url))
497 if path[-1] != os.sep:
498 url = url + '/'
499 if os.path.isdir(path):
500 indexpath = os.path.join(path, "index.html")
501 if os.path.exists(indexpath):
502 return self.open_file(url + "index.html")
503 try:
504 names = os.listdir(path)
505 except os.error, msg:
506 raise IOError, msg, sys.exc_traceback
507 names.sort()
508 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
509 s.write('<BASE HREF="file:%s">\n' %
510 urllib.quote(os.path.join(path, "")))
511 for name in names:
512 q = urllib.quote(name)
513 s.write('<A HREF="%s">%s</A>\n' % (q, q))
514 s.seek(0)
515 return s
516 return urllib.FancyURLopener.open_file(self, path)
517
518
Guido van Rossume5605ba1997-01-31 14:43:15 +0000519class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000520
Guido van Rossume5605ba1997-01-31 14:43:15 +0000521 def __init__(self):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000522 self.base = None
Guido van Rossume5605ba1997-01-31 14:43:15 +0000523 self.links = {}
524 sgmllib.SGMLParser.__init__ (self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000525
526 def start_a(self, attributes):
Guido van Rossum6133ec61997-02-01 05:16:08 +0000527 self.link_attr(attributes, 'href')
528
529 def end_a(self): pass
530
531 def do_img(self, attributes):
532 self.link_attr(attributes, 'src', 'lowsrc')
533
534 def do_frame(self, attributes):
535 self.link_attr(attributes, 'src')
536
537 def link_attr(self, attributes, *args):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000538 for name, value in attributes:
Guido van Rossum6133ec61997-02-01 05:16:08 +0000539 if name in args:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000540 if value: value = string.strip(value)
541 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000542
543 def do_base(self, attributes):
544 for name, value in attributes:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000545 if name == 'href':
546 if value: value = string.strip(value)
547 if value:
548 if verbose > 1:
549 print " Base", value
550 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000551
552 def getlinks(self):
Guido van Rossume5605ba1997-01-31 14:43:15 +0000553 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000554
555 def getbase(self):
556 return self.base
557
558
559def show(p1, link, p2, origins):
560 print p1, link
561 i = 0
562 for source, rawlink in origins:
563 i = i+1
564 if i == 2:
565 p2 = ' '*len(p2)
566 print p2, source,
567 if rawlink != link: print "(%s)" % rawlink,
568 print
569
570
Guido van Rossum2739cd71997-01-30 04:26:57 +0000571def sanitize(msg):
572 if (type(msg) == TupleType and
573 len(msg) >= 4 and
574 msg[0] == 'http error' and
575 type(msg[3]) == InstanceType):
576 # Remove the Message instance -- it may contain
577 # a file object which prevents pickling.
578 msg = msg[:3] + msg[4:]
579 return msg
580
581
Guido van Rossume5605ba1997-01-31 14:43:15 +0000582def safeclose(f):
583 url = f.geturl()
584 if url[:4] == 'ftp:' or url[:7] == 'file://':
585 # Apparently ftp connections don't like to be closed
586 # prematurely...
587 text = f.read()
588 f.close()
589
590
Guido van Rossum272b37d1997-01-30 02:44:48 +0000591if __name__ == '__main__':
592 main()