blob: 8b27dd88536ad1384f99d3d173c6d9948b0d4cfe [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
3"""Web tree checker.
4
5This utility is handy to check a subweb of the world-wide web for
6errors. A subweb is specified by giving one or more ``root URLs''; a
7page belongs to the subweb if one of the root URLs is an initial
8prefix of it.
9
10File URL extension:
11
12In order to easy the checking of subwebs via the local file system,
13the interpretation of ``file:'' URLs is extended to mimic the behavior
14of your average HTTP daemon: if a directory pathname is given, the
15file index.html in that directory is returned if it exists, otherwise
16a directory listing is returned. Now, you can point webchecker to the
17document tree in the local file system of your HTTP daemon, and have
18most of it checked. In fact the default works this way if your local
19web tree is located at /usr/local/etc/httpd/htdpcs (the default for
20the NCSA HTTP daemon and probably others).
21
Guido van Rossumaf310c11997-02-02 23:30:32 +000022Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000023
Guido van Rossumaf310c11997-02-02 23:30:32 +000024When done, it reports pages with bad links within the subweb. When
25interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
27In verbose mode, additional messages are printed during the
28information gathering phase. By default, it prints a summary of its
29work status every 50 URLs (adjustable with the -r option), and it
30reports errors as they are encountered. Use the -q option to disable
31this output.
32
33Checkpoint feature:
34
35Whether interrupted or not, it dumps its state (a Python pickle) to a
36checkpoint file and the -R option allows it to restart from the
37checkpoint (assuming that the pages on the subweb that were already
38processed haven't changed). Even when it has run till completion, -R
39can still be useful -- it will print the reports again, and -Rq prints
40the errors only. In this case, the checkpoint file is not written
41again. The checkpoint file can be set with the -d option.
42
43The checkpoint file is written as a Python pickle. Remember that
44Python's pickle module is currently quite slow. Give it the time it
45needs to load and save the checkpoint file. When interrupted while
46writing the checkpoint file, the old checkpoint file is not
47overwritten, but all work done in the current run is lost.
48
49Miscellaneous:
50
Guido van Rossumaf310c11997-02-02 23:30:32 +000051- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
52
Guido van Rossum3edbb351997-01-30 03:19:41 +000053- Webchecker honors the "robots.txt" convention. Thanks to Skip
54Montanaro for his robotparser.py module (included in this directory)!
55The agent name is hardwired to "webchecker". URLs that are disallowed
56by the robots.txt file are reported as external URLs.
57
Guido van Rossumaf310c11997-02-02 23:30:32 +000058- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000059skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000060
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- When the server or protocol does not tell us a file's type, we guess
62it based on the URL's suffix. The mimetypes.py module (also in this
63directory) has a built-in table mapping most currently known suffixes,
64and in addition attempts to read the mime.types configuration files in
65the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000066
Guido van Rossumaf310c11997-02-02 23:30:32 +000067- We follows links indicated by <A>, <FRAME> and <IMG> tags. We also
68honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossumaf310c11997-02-02 23:30:32 +000070- Checking external links is now done by default; use -x to *disable*
71this feature. External links are now checked during normal
72processing. (XXX The status of a checked link could be categorized
73better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000074
75
76Usage: webchecker.py [option] ... [rooturl] ...
77
78Options:
79
80-R -- restart from checkpoint file
81-d file -- checkpoint filename (default %(DUMPFILE)s)
82-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000083-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000084-q -- quiet operation (also suppresses external links report)
85-r number -- number of links processed per round (default %(ROUNDSIZE)d)
86-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000087-x -- don't check external links (these are often slow to check)
Guido van Rossum272b37d1997-01-30 02:44:48 +000088
89Arguments:
90
91rooturl -- URL to start checking
92 (default %(DEFROOT)s)
93
94"""
95
Guido van Rossume5605ba1997-01-31 14:43:15 +000096# ' Emacs bait
97
98
Guido van Rossumaf310c11997-02-02 23:30:32 +000099__version__ = "0.4"
Guido van Rossum325a64f1997-01-30 03:30:20 +0000100
Guido van Rossum272b37d1997-01-30 02:44:48 +0000101
102import sys
103import os
104from types import *
105import string
106import StringIO
107import getopt
108import pickle
109
110import urllib
111import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000112import sgmllib
Guido van Rossum272b37d1997-01-30 02:44:48 +0000113
114import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000115import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000116
117
118# Tunable parameters
119DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
Guido van Rossum0b0b5f01997-01-31 18:57:23 +0000120MAXPAGE = 150000 # Ignore files bigger than this
Guido van Rossum272b37d1997-01-30 02:44:48 +0000121ROUNDSIZE = 50 # Number of links processed per round
122DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
Guido van Rossum3edbb351997-01-30 03:19:41 +0000123AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000124
125
126# Global variables
127verbose = 1
128maxpage = MAXPAGE
129roundsize = ROUNDSIZE
130
131
132def main():
133 global verbose, maxpage, roundsize
134 dumpfile = DUMPFILE
135 restart = 0
Guido van Rossumaf310c11997-02-02 23:30:32 +0000136 checkext = 1
Guido van Rossume5605ba1997-01-31 14:43:15 +0000137 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000138
139 try:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000140 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:vx')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000141 except getopt.error, msg:
142 sys.stdout = sys.stderr
143 print msg
Guido van Rossum0b0b5f01997-01-31 18:57:23 +0000144 print __doc__%globals()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000145 sys.exit(2)
146 for o, a in opts:
147 if o == '-R':
148 restart = 1
149 if o == '-d':
150 dumpfile = a
151 if o == '-m':
152 maxpage = string.atoi(a)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000153 if o == '-n':
154 norun = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000155 if o == '-q':
156 verbose = 0
157 if o == '-r':
158 roundsize = string.atoi(a)
159 if o == '-v':
160 verbose = verbose + 1
Guido van Rossumde662681997-01-30 03:58:21 +0000161 if o == '-x':
Guido van Rossumaf310c11997-02-02 23:30:32 +0000162 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000163
Guido van Rossume5605ba1997-01-31 14:43:15 +0000164 if verbose > 0:
Guido van Rossum325a64f1997-01-30 03:30:20 +0000165 print AGENTNAME, "version", __version__
166
Guido van Rossum272b37d1997-01-30 02:44:48 +0000167 if restart:
168 if verbose > 0:
169 print "Loading checkpoint from %s ..." % dumpfile
170 f = open(dumpfile, "rb")
171 c = pickle.load(f)
172 f.close()
173 if verbose > 0:
174 print "Done."
175 print "Root:", string.join(c.roots, "\n ")
176 else:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000177 c = Checker(checkext)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000178 if not args:
179 args.append(DEFROOT)
180
181 for arg in args:
182 c.addroot(arg)
183
Guido van Rossume5605ba1997-01-31 14:43:15 +0000184 if not norun:
185 try:
186 c.run()
187 except KeyboardInterrupt:
188 if verbose > 0:
189 print "[run interrupted]"
190
Guido van Rossumde662681997-01-30 03:58:21 +0000191 try:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000192 c.report()
Guido van Rossumde662681997-01-30 03:58:21 +0000193 except KeyboardInterrupt:
194 if verbose > 0:
195 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000196
197 if not c.changed:
Guido van Rossum272b37d1997-01-30 02:44:48 +0000198 if verbose > 0:
199 print
200 print "No need to save checkpoint"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000201 elif not dumpfile:
202 if verbose > 0:
203 print "No dumpfile, won't save checkpoint"
204 else:
Guido van Rossum272b37d1997-01-30 02:44:48 +0000205 if verbose > 0:
206 print
207 print "Saving checkpoint to %s ..." % dumpfile
208 newfile = dumpfile + ".new"
209 f = open(newfile, "wb")
210 pickle.dump(c, f)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000211 f.close()
212 try:
213 os.unlink(dumpfile)
214 except os.error:
215 pass
216 os.rename(newfile, dumpfile)
217 if verbose > 0:
218 print "Done."
219 if dumpfile == DUMPFILE:
220 print "Use ``%s -R'' to restart." % sys.argv[0]
221 else:
222 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
223 dumpfile)
224
225
226class Checker:
227
Guido van Rossumaf310c11997-02-02 23:30:32 +0000228 def __init__(self, checkext=1):
229 self.reset()
230 self.checkext = checkext
231
232 def reset(self):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000233 self.roots = []
234 self.todo = {}
235 self.done = {}
Guido van Rossum272b37d1997-01-30 02:44:48 +0000236 self.bad = {}
Guido van Rossum272b37d1997-01-30 02:44:48 +0000237 self.round = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000238 # The following are not pickled:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000239 self.robots = {}
Guido van Rossumaf310c11997-02-02 23:30:32 +0000240 self.errors = {}
Guido van Rossume5605ba1997-01-31 14:43:15 +0000241 self.urlopener = MyURLopener()
242 self.changed = 0
Guido van Rossum3edbb351997-01-30 03:19:41 +0000243
244 def __getstate__(self):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000245 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000246
247 def __setstate__(self, state):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000248 (self.roots, self.todo, self.done, self.bad, self.round) = state
Guido van Rossum3edbb351997-01-30 03:19:41 +0000249 for root in self.roots:
250 self.addrobot(root)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000251 for url in self.bad.keys():
252 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000253
254 def addroot(self, root):
255 if root not in self.roots:
256 self.roots.append(root)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000257 self.addrobot(root)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000258 self.newlink(root, ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000259
260 def addrobot(self, root):
Guido van Rossum3edbb351997-01-30 03:19:41 +0000261 url = urlparse.urljoin(root, "/robots.txt")
Guido van Rossum325a64f1997-01-30 03:30:20 +0000262 self.robots[root] = rp = robotparser.RobotFileParser()
263 if verbose > 2:
264 print "Parsing", url
Guido van Rossume5605ba1997-01-31 14:43:15 +0000265 rp.debug = verbose > 3
Guido van Rossum3edbb351997-01-30 03:19:41 +0000266 rp.set_url(url)
Guido van Rossum325a64f1997-01-30 03:30:20 +0000267 try:
268 rp.read()
269 except IOError, msg:
270 if verbose > 1:
271 print "I/O error parsing", url, ":", msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000272
273 def run(self):
274 while self.todo:
275 self.round = self.round + 1
276 if verbose > 0:
277 print
Guido van Rossumaf310c11997-02-02 23:30:32 +0000278 print "Round %d (%s)" % (self.round, self.status())
Guido van Rossume5605ba1997-01-31 14:43:15 +0000279 print
Guido van Rossum272b37d1997-01-30 02:44:48 +0000280 urls = self.todo.keys()[:roundsize]
281 for url in urls:
282 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000283
284 def status(self):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000285 return "%d total, %d to do, %d done, %d bad" % (
Guido van Rossume5605ba1997-01-31 14:43:15 +0000286 len(self.todo)+len(self.done),
287 len(self.todo), len(self.done),
Guido van Rossumaf310c11997-02-02 23:30:32 +0000288 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000289
Guido van Rossumaf310c11997-02-02 23:30:32 +0000290 def report(self):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000291 print
292 if not self.todo: print "Final",
293 else: print "Interim",
Guido van Rossumaf310c11997-02-02 23:30:32 +0000294 print "Report (%s)" % self.status()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000295 self.report_errors()
296
Guido van Rossum272b37d1997-01-30 02:44:48 +0000297 def report_errors(self):
298 if not self.bad:
299 print
300 print "No errors"
301 return
302 print
303 print "Error Report:"
Guido van Rossumaf310c11997-02-02 23:30:32 +0000304 sources = self.errors.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000305 sources.sort()
306 for source in sources:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000307 triples = self.errors[source]
Guido van Rossum272b37d1997-01-30 02:44:48 +0000308 print
309 if len(triples) > 1:
310 print len(triples), "Errors in", source
311 else:
312 print "Error in", source
313 for url, rawlink, msg in triples:
314 print " HREF", url,
315 if rawlink != url: print "(%s)" % rawlink,
316 print
317 print " msg", msg
318
319 def dopage(self, url):
320 if verbose > 1:
321 if verbose > 2:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000322 show("Check ", url, " from", self.todo[url])
Guido van Rossum272b37d1997-01-30 02:44:48 +0000323 else:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000324 print "Check ", url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000325 page = self.getpage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000326 if page:
327 for info in page.getlinkinfos():
328 link, rawlink = info
329 origin = url, rawlink
Guido van Rossumaf310c11997-02-02 23:30:32 +0000330 self.newlink(link, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000331 self.markdone(url)
332
Guido van Rossumaf310c11997-02-02 23:30:32 +0000333 def newlink(self, url, origin):
Guido van Rossume5605ba1997-01-31 14:43:15 +0000334 if self.done.has_key(url):
335 self.newdonelink(url, origin)
336 else:
337 self.newtodolink(url, origin)
338
339 def newdonelink(self, url, origin):
340 self.done[url].append(origin)
341 if verbose > 3:
342 print " Done link", url
343
344 def newtodolink(self, url, origin):
345 if self.todo.has_key(url):
346 self.todo[url].append(origin)
347 if verbose > 3:
348 print " Seen todo link", url
349 else:
350 self.todo[url] = [origin]
351 if verbose > 3:
352 print " New todo link", url
353
354 def markdone(self, url):
355 self.done[url] = self.todo[url]
356 del self.todo[url]
357 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000358
359 def inroots(self, url):
360 for root in self.roots:
361 if url[:len(root)] == root:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000362 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000363 return 0
364
365 def getpage(self, url):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000366 if url[:7] == 'mailto:' or url[:5] == 'news:':
367 if verbose > 1: print " Not checking mailto/news URL"
368 return None
369 isint = self.inroots(url)
370 if not isint and not self.checkext:
371 if verbose > 1: print " Not checking ext link"
372 return None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000373 try:
374 f = self.urlopener.open(url)
375 except IOError, msg:
Guido van Rossum2739cd71997-01-30 04:26:57 +0000376 msg = sanitize(msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000377 if verbose > 0:
378 print "Error ", msg
379 if verbose > 0:
380 show(" HREF ", url, " from", self.todo[url])
Guido van Rossume5605ba1997-01-31 14:43:15 +0000381 self.setbad(url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000382 return None
Guido van Rossumaf310c11997-02-02 23:30:32 +0000383 if not isint:
384 if verbose > 1: print " Not gathering links from ext URL"
385 safeclose(f)
386 return None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000387 nurl = f.geturl()
388 info = f.info()
389 if info.has_key('content-type'):
390 ctype = string.lower(info['content-type'])
Guido van Rossume5605ba1997-01-31 14:43:15 +0000391 else:
392 ctype = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000393 if nurl != url:
394 if verbose > 1:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000395 print " Redirected to", nurl
Guido van Rossume5605ba1997-01-31 14:43:15 +0000396 if not ctype:
397 ctype, encoding = mimetypes.guess_type(nurl)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000398 if ctype != 'text/html':
Guido van Rossume5605ba1997-01-31 14:43:15 +0000399 safeclose(f)
400 if verbose > 1:
401 print " Not HTML, mime type", ctype
Guido van Rossum272b37d1997-01-30 02:44:48 +0000402 return None
403 text = f.read()
404 f.close()
405 return Page(text, nurl)
406
Guido van Rossume5605ba1997-01-31 14:43:15 +0000407 def setgood(self, url):
408 if self.bad.has_key(url):
409 del self.bad[url]
410 self.changed = 1
411 if verbose > 0:
412 print "(Clear previously seen error)"
413
414 def setbad(self, url, msg):
415 if self.bad.has_key(url) and self.bad[url] == msg:
416 if verbose > 0:
417 print "(Seen this error before)"
418 return
419 self.bad[url] = msg
420 self.changed = 1
Guido van Rossumaf310c11997-02-02 23:30:32 +0000421 self.markerror(url)
422
423 def markerror(self, url):
424 try:
425 origins = self.todo[url]
426 except KeyError:
427 origins = self.done[url]
428 for source, rawlink in origins:
429 triple = url, rawlink, self.bad[url]
430 self.seterror(source, triple)
431
432 def seterror(self, url, triple):
433 try:
434 self.errors[url].append(triple)
435 except KeyError:
436 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000437
Guido van Rossum272b37d1997-01-30 02:44:48 +0000438
439class Page:
440
441 def __init__(self, text, url):
442 self.text = text
443 self.url = url
444
445 def getlinkinfos(self):
446 size = len(self.text)
447 if size > maxpage:
448 if verbose > 0:
449 print "Skip huge file", self.url
450 print " (%.0f Kbytes)" % (size*0.001)
451 return []
452 if verbose > 2:
453 print " Parsing", self.url, "(%d bytes)" % size
Guido van Rossume5605ba1997-01-31 14:43:15 +0000454 parser = MyHTMLParser()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000455 parser.feed(self.text)
456 parser.close()
457 rawlinks = parser.getlinks()
458 base = urlparse.urljoin(self.url, parser.getbase() or "")
459 infos = []
460 for rawlink in rawlinks:
461 t = urlparse.urlparse(rawlink)
462 t = t[:-1] + ('',)
463 rawlink = urlparse.urlunparse(t)
464 link = urlparse.urljoin(base, rawlink)
465 infos.append((link, rawlink))
466 return infos
467
468
469class MyStringIO(StringIO.StringIO):
470
471 def __init__(self, url, info):
472 self.__url = url
473 self.__info = info
474 StringIO.StringIO.__init__(self)
475
476 def info(self):
477 return self.__info
478
479 def geturl(self):
480 return self.__url
481
482
483class MyURLopener(urllib.FancyURLopener):
484
485 http_error_default = urllib.URLopener.http_error_default
486
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000487 def __init__(*args):
488 self = args[0]
489 apply(urllib.FancyURLopener.__init__, args)
490 self.addheaders = [('User-agent', 'Python-webchecker/%s' % __version__)]
491
Guido van Rossum272b37d1997-01-30 02:44:48 +0000492 def open_file(self, url):
493 path = urllib.url2pathname(urllib.unquote(url))
494 if path[-1] != os.sep:
495 url = url + '/'
496 if os.path.isdir(path):
497 indexpath = os.path.join(path, "index.html")
498 if os.path.exists(indexpath):
499 return self.open_file(url + "index.html")
500 try:
501 names = os.listdir(path)
502 except os.error, msg:
503 raise IOError, msg, sys.exc_traceback
504 names.sort()
505 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
506 s.write('<BASE HREF="file:%s">\n' %
507 urllib.quote(os.path.join(path, "")))
508 for name in names:
509 q = urllib.quote(name)
510 s.write('<A HREF="%s">%s</A>\n' % (q, q))
511 s.seek(0)
512 return s
513 return urllib.FancyURLopener.open_file(self, path)
514
515
Guido van Rossume5605ba1997-01-31 14:43:15 +0000516class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000517
Guido van Rossume5605ba1997-01-31 14:43:15 +0000518 def __init__(self):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000519 self.base = None
Guido van Rossume5605ba1997-01-31 14:43:15 +0000520 self.links = {}
521 sgmllib.SGMLParser.__init__ (self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000522
523 def start_a(self, attributes):
Guido van Rossum6133ec61997-02-01 05:16:08 +0000524 self.link_attr(attributes, 'href')
525
526 def end_a(self): pass
527
528 def do_img(self, attributes):
529 self.link_attr(attributes, 'src', 'lowsrc')
530
531 def do_frame(self, attributes):
532 self.link_attr(attributes, 'src')
533
534 def link_attr(self, attributes, *args):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000535 for name, value in attributes:
Guido van Rossum6133ec61997-02-01 05:16:08 +0000536 if name in args:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000537 if value: value = string.strip(value)
538 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000539
540 def do_base(self, attributes):
541 for name, value in attributes:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000542 if name == 'href':
543 if value: value = string.strip(value)
544 if value:
545 if verbose > 1:
546 print " Base", value
547 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000548
549 def getlinks(self):
Guido van Rossume5605ba1997-01-31 14:43:15 +0000550 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000551
552 def getbase(self):
553 return self.base
554
555
556def show(p1, link, p2, origins):
557 print p1, link
558 i = 0
559 for source, rawlink in origins:
560 i = i+1
561 if i == 2:
562 p2 = ' '*len(p2)
563 print p2, source,
564 if rawlink != link: print "(%s)" % rawlink,
565 print
566
567
Guido van Rossum2739cd71997-01-30 04:26:57 +0000568def sanitize(msg):
569 if (type(msg) == TupleType and
570 len(msg) >= 4 and
571 msg[0] == 'http error' and
572 type(msg[3]) == InstanceType):
573 # Remove the Message instance -- it may contain
574 # a file object which prevents pickling.
575 msg = msg[:3] + msg[4:]
576 return msg
577
578
Guido van Rossume5605ba1997-01-31 14:43:15 +0000579def safeclose(f):
580 url = f.geturl()
581 if url[:4] == 'ftp:' or url[:7] == 'file://':
582 # Apparently ftp connections don't like to be closed
583 # prematurely...
584 text = f.read()
585 f.close()
586
587
Guido van Rossum272b37d1997-01-30 02:44:48 +0000588if __name__ == '__main__':
589 main()