blob: 9e676caf0150e2cb20cf7d45b03275eea3b5963e [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
3"""Web tree checker.
4
5This utility is handy to check a subweb of the world-wide web for
6errors. A subweb is specified by giving one or more ``root URLs''; a
7page belongs to the subweb if one of the root URLs is an initial
8prefix of it.
9
10File URL extension:
11
12In order to easy the checking of subwebs via the local file system,
13the interpretation of ``file:'' URLs is extended to mimic the behavior
14of your average HTTP daemon: if a directory pathname is given, the
15file index.html in that directory is returned if it exists, otherwise
16a directory listing is returned. Now, you can point webchecker to the
17document tree in the local file system of your HTTP daemon, and have
18most of it checked. In fact the default works this way if your local
19web tree is located at /usr/local/etc/httpd/htdpcs (the default for
20the NCSA HTTP daemon and probably others).
21
22Reports printed:
23
24When done, it reports links to pages outside the web (unless -q is
25specified), and pages with bad links within the subweb. When
26interrupted, it print those same reports for the pages that it has
27checked already.
28
29In verbose mode, additional messages are printed during the
30information gathering phase. By default, it prints a summary of its
31work status every 50 URLs (adjustable with the -r option), and it
32reports errors as they are encountered. Use the -q option to disable
33this output.
34
35Checkpoint feature:
36
37Whether interrupted or not, it dumps its state (a Python pickle) to a
38checkpoint file and the -R option allows it to restart from the
39checkpoint (assuming that the pages on the subweb that were already
40processed haven't changed). Even when it has run till completion, -R
41can still be useful -- it will print the reports again, and -Rq prints
42the errors only. In this case, the checkpoint file is not written
43again. The checkpoint file can be set with the -d option.
44
45The checkpoint file is written as a Python pickle. Remember that
46Python's pickle module is currently quite slow. Give it the time it
47needs to load and save the checkpoint file. When interrupted while
48writing the checkpoint file, the old checkpoint file is not
49overwritten, but all work done in the current run is lost.
50
51Miscellaneous:
52
Guido van Rossum3edbb351997-01-30 03:19:41 +000053- Webchecker honors the "robots.txt" convention. Thanks to Skip
54Montanaro for his robotparser.py module (included in this directory)!
55The agent name is hardwired to "webchecker". URLs that are disallowed
56by the robots.txt file are reported as external URLs.
57
Guido van Rossum272b37d1997-01-30 02:44:48 +000058- Because the HTML parser is a bit slow, very large HTML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000059skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000060
61- Before fetching a page, it guesses its type based on its extension.
Guido van Rossume5605ba1997-01-31 14:43:15 +000062If it is a known extension and the type is not text/html, the page is
Guido van Rossum272b37d1997-01-30 02:44:48 +000063not fetched. This is a huge optimization but occasionally it means
Guido van Rossume5605ba1997-01-31 14:43:15 +000064links can be missed, and such links aren't checked for validity
65(XXX!). The mimetypes.py module (also in this directory) has a
66built-in table mapping most currently known suffixes, and in addition
67attempts to read the mime.types configuration files in the default
68locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
70- It only follows links indicated by <A> tags. It doesn't follow
71links in <FORM> or <IMG> or whatever other tags might contain
72hyperlinks. It does honor the <BASE> tag.
73
Guido van Rossumde662681997-01-30 03:58:21 +000074- Checking external links is not done by default; use -x to enable
75this feature. This is done because checking external links usually
76takes a lot of time. When enabled, this check is executed during the
Guido van Rossumc59a5d41997-01-30 06:04:00 +000077report generation phase (even when the report is silent).
Guido van Rossum272b37d1997-01-30 02:44:48 +000078
79
80Usage: webchecker.py [option] ... [rooturl] ...
81
82Options:
83
84-R -- restart from checkpoint file
85-d file -- checkpoint filename (default %(DUMPFILE)s)
86-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000087-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000088-q -- quiet operation (also suppresses external links report)
89-r number -- number of links processed per round (default %(ROUNDSIZE)d)
90-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumde662681997-01-30 03:58:21 +000091-x -- check external links (during report phase)
Guido van Rossum272b37d1997-01-30 02:44:48 +000092
93Arguments:
94
95rooturl -- URL to start checking
96 (default %(DEFROOT)s)
97
98"""
99
Guido van Rossume5605ba1997-01-31 14:43:15 +0000100# ' Emacs bait
101
102
103__version__ = "0.3"
Guido van Rossum325a64f1997-01-30 03:30:20 +0000104
Guido van Rossum272b37d1997-01-30 02:44:48 +0000105
106import sys
107import os
108from types import *
109import string
110import StringIO
111import getopt
112import pickle
113
114import urllib
115import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000116import sgmllib
Guido van Rossum272b37d1997-01-30 02:44:48 +0000117
118import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000119import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000120
121
122# Tunable parameters
123DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
Guido van Rossum0b0b5f01997-01-31 18:57:23 +0000124MAXPAGE = 150000 # Ignore files bigger than this
Guido van Rossum272b37d1997-01-30 02:44:48 +0000125ROUNDSIZE = 50 # Number of links processed per round
126DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
Guido van Rossum3edbb351997-01-30 03:19:41 +0000127AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000128
129
130# Global variables
131verbose = 1
132maxpage = MAXPAGE
133roundsize = ROUNDSIZE
134
135
136def main():
137 global verbose, maxpage, roundsize
138 dumpfile = DUMPFILE
139 restart = 0
Guido van Rossumde662681997-01-30 03:58:21 +0000140 checkext = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000141 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000142
143 try:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000144 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:vx')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000145 except getopt.error, msg:
146 sys.stdout = sys.stderr
147 print msg
Guido van Rossum0b0b5f01997-01-31 18:57:23 +0000148 print __doc__%globals()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000149 sys.exit(2)
150 for o, a in opts:
151 if o == '-R':
152 restart = 1
153 if o == '-d':
154 dumpfile = a
155 if o == '-m':
156 maxpage = string.atoi(a)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000157 if o == '-n':
158 norun = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000159 if o == '-q':
160 verbose = 0
161 if o == '-r':
162 roundsize = string.atoi(a)
163 if o == '-v':
164 verbose = verbose + 1
Guido van Rossumde662681997-01-30 03:58:21 +0000165 if o == '-x':
166 checkext = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000167
Guido van Rossume5605ba1997-01-31 14:43:15 +0000168 if verbose > 0:
Guido van Rossum325a64f1997-01-30 03:30:20 +0000169 print AGENTNAME, "version", __version__
170
Guido van Rossum272b37d1997-01-30 02:44:48 +0000171 if restart:
172 if verbose > 0:
173 print "Loading checkpoint from %s ..." % dumpfile
174 f = open(dumpfile, "rb")
175 c = pickle.load(f)
176 f.close()
177 if verbose > 0:
178 print "Done."
179 print "Root:", string.join(c.roots, "\n ")
180 else:
181 c = Checker()
182 if not args:
183 args.append(DEFROOT)
184
185 for arg in args:
186 c.addroot(arg)
187
Guido van Rossume5605ba1997-01-31 14:43:15 +0000188 if not norun:
189 try:
190 c.run()
191 except KeyboardInterrupt:
192 if verbose > 0:
193 print "[run interrupted]"
194
Guido van Rossumde662681997-01-30 03:58:21 +0000195 try:
196 c.report(checkext)
197 except KeyboardInterrupt:
198 if verbose > 0:
199 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000200
201 if not c.changed:
Guido van Rossum272b37d1997-01-30 02:44:48 +0000202 if verbose > 0:
203 print
204 print "No need to save checkpoint"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000205 elif not dumpfile:
206 if verbose > 0:
207 print "No dumpfile, won't save checkpoint"
208 else:
Guido van Rossum272b37d1997-01-30 02:44:48 +0000209 if verbose > 0:
210 print
211 print "Saving checkpoint to %s ..." % dumpfile
212 newfile = dumpfile + ".new"
213 f = open(newfile, "wb")
214 pickle.dump(c, f)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000215 f.close()
216 try:
217 os.unlink(dumpfile)
218 except os.error:
219 pass
220 os.rename(newfile, dumpfile)
221 if verbose > 0:
222 print "Done."
223 if dumpfile == DUMPFILE:
224 print "Use ``%s -R'' to restart." % sys.argv[0]
225 else:
226 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
227 dumpfile)
228
229
230class Checker:
231
232 def __init__(self):
233 self.roots = []
234 self.todo = {}
235 self.done = {}
236 self.ext = {}
237 self.bad = {}
Guido van Rossum272b37d1997-01-30 02:44:48 +0000238 self.round = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000239 # The following are not pickled:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000240 self.robots = {}
Guido van Rossume5605ba1997-01-31 14:43:15 +0000241 self.urlopener = MyURLopener()
242 self.changed = 0
Guido van Rossum3edbb351997-01-30 03:19:41 +0000243
244 def __getstate__(self):
245 return (self.roots, self.todo, self.done,
246 self.ext, self.bad, self.round)
247
248 def __setstate__(self, state):
249 (self.roots, self.todo, self.done,
250 self.ext, self.bad, self.round) = state
251 for root in self.roots:
252 self.addrobot(root)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000253
254 def addroot(self, root):
255 if root not in self.roots:
256 self.roots.append(root)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000257 self.addrobot(root)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000258 self.newintlink(root, ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000259
260 def addrobot(self, root):
Guido van Rossum3edbb351997-01-30 03:19:41 +0000261 url = urlparse.urljoin(root, "/robots.txt")
Guido van Rossum325a64f1997-01-30 03:30:20 +0000262 self.robots[root] = rp = robotparser.RobotFileParser()
263 if verbose > 2:
264 print "Parsing", url
Guido van Rossume5605ba1997-01-31 14:43:15 +0000265 rp.debug = verbose > 3
Guido van Rossum3edbb351997-01-30 03:19:41 +0000266 rp.set_url(url)
Guido van Rossum325a64f1997-01-30 03:30:20 +0000267 try:
268 rp.read()
269 except IOError, msg:
270 if verbose > 1:
271 print "I/O error parsing", url, ":", msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000272
273 def run(self):
274 while self.todo:
275 self.round = self.round + 1
276 if verbose > 0:
277 print
Guido van Rossume5605ba1997-01-31 14:43:15 +0000278 print "Round", self.round, self.status()
279 print
Guido van Rossum272b37d1997-01-30 02:44:48 +0000280 urls = self.todo.keys()[:roundsize]
281 for url in urls:
282 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000283
284 def status(self):
285 return "(%d total, %d to do, %d done, %d external, %d bad)" % (
286 len(self.todo)+len(self.done),
287 len(self.todo), len(self.done),
288 len(self.ext), len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000289
Guido van Rossumde662681997-01-30 03:58:21 +0000290 def report(self, checkext=0):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000291 print
292 if not self.todo: print "Final",
293 else: print "Interim",
Guido van Rossume5605ba1997-01-31 14:43:15 +0000294 print "Report", self.status()
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000295 if verbose > 0 or checkext:
Guido van Rossumde662681997-01-30 03:58:21 +0000296 self.report_extrefs(checkext)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000297 # Report errors last because the output may get truncated
298 self.report_errors()
299
Guido van Rossumde662681997-01-30 03:58:21 +0000300 def report_extrefs(self, checkext=0):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000301 if not self.ext:
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000302 if verbose > 0:
303 print
304 print "No external URLs"
Guido van Rossum272b37d1997-01-30 02:44:48 +0000305 return
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000306 if verbose > 0:
307 print
308 if checkext:
309 print "External URLs (checking validity):"
310 else:
311 print "External URLs (not checked):"
312 print
Guido van Rossum272b37d1997-01-30 02:44:48 +0000313 urls = self.ext.keys()
314 urls.sort()
315 for url in urls:
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000316 if verbose > 0:
317 show("HREF ", url, " from", self.ext[url])
Guido van Rossum0b0b5f01997-01-31 18:57:23 +0000318 if checkext:
319 self.checkextpage(url)
320
321 def checkextpage(self, url):
322 if url[:7] == 'mailto:' or url[:5] == 'news:':
323 if verbose > 2: print "Not checking", url
324 return
325 if verbose > 2: print "Checking", url, "..."
326 try:
327 f = self.urlopener.open(url)
328 safeclose(f)
329 if verbose > 3: print "OK"
330 if self.bad.has_key(url):
331 self.setgood(url)
332 except IOError, msg:
333 msg = sanitize(msg)
334 if verbose > 0: print "Error", msg
335 self.setbad(url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000336
337 def report_errors(self):
338 if not self.bad:
339 print
340 print "No errors"
341 return
342 print
343 print "Error Report:"
344 urls = self.bad.keys()
345 urls.sort()
346 bysource = {}
347 for url in urls:
348 try:
349 origins = self.done[url]
350 except KeyError:
Guido van Rossum2739cd71997-01-30 04:26:57 +0000351 try:
352 origins = self.todo[url]
353 except KeyError:
354 origins = self.ext[url]
Guido van Rossum272b37d1997-01-30 02:44:48 +0000355 for source, rawlink in origins:
356 triple = url, rawlink, self.bad[url]
357 try:
358 bysource[source].append(triple)
359 except KeyError:
360 bysource[source] = [triple]
361 sources = bysource.keys()
362 sources.sort()
363 for source in sources:
364 triples = bysource[source]
365 print
366 if len(triples) > 1:
367 print len(triples), "Errors in", source
368 else:
369 print "Error in", source
370 for url, rawlink, msg in triples:
371 print " HREF", url,
372 if rawlink != url: print "(%s)" % rawlink,
373 print
374 print " msg", msg
375
376 def dopage(self, url):
377 if verbose > 1:
378 if verbose > 2:
379 show("Page ", url, " from", self.todo[url])
380 else:
381 print "Page ", url
382 page = self.getpage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000383 if page:
384 for info in page.getlinkinfos():
385 link, rawlink = info
386 origin = url, rawlink
387 if not self.inroots(link):
388 self.newextlink(link, origin)
389 else:
390 self.newintlink(link, origin)
391 self.markdone(url)
392
393 def newextlink(self, url, origin):
394 try:
395 self.ext[url].append(origin)
396 if verbose > 3:
397 print " New ext link", url
398 except KeyError:
399 self.ext[url] = [origin]
400 if verbose > 3:
401 print " Seen ext link", url
402
403 def newintlink(self, url, origin):
404 if self.done.has_key(url):
405 self.newdonelink(url, origin)
406 else:
407 self.newtodolink(url, origin)
408
409 def newdonelink(self, url, origin):
410 self.done[url].append(origin)
411 if verbose > 3:
412 print " Done link", url
413
414 def newtodolink(self, url, origin):
415 if self.todo.has_key(url):
416 self.todo[url].append(origin)
417 if verbose > 3:
418 print " Seen todo link", url
419 else:
420 self.todo[url] = [origin]
421 if verbose > 3:
422 print " New todo link", url
423
424 def markdone(self, url):
425 self.done[url] = self.todo[url]
426 del self.todo[url]
427 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000428
429 def inroots(self, url):
430 for root in self.roots:
431 if url[:len(root)] == root:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000432 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000433 return 0
434
435 def getpage(self, url):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000436 try:
437 f = self.urlopener.open(url)
438 except IOError, msg:
Guido van Rossum2739cd71997-01-30 04:26:57 +0000439 msg = sanitize(msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000440 if verbose > 0:
441 print "Error ", msg
442 if verbose > 0:
443 show(" HREF ", url, " from", self.todo[url])
Guido van Rossume5605ba1997-01-31 14:43:15 +0000444 self.setbad(url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000445 return None
446 nurl = f.geturl()
447 info = f.info()
448 if info.has_key('content-type'):
449 ctype = string.lower(info['content-type'])
Guido van Rossume5605ba1997-01-31 14:43:15 +0000450 else:
451 ctype = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000452 if nurl != url:
453 if verbose > 1:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000454 print " Redirected to", nurl
Guido van Rossume5605ba1997-01-31 14:43:15 +0000455 if not ctype:
456 ctype, encoding = mimetypes.guess_type(nurl)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000457 if ctype != 'text/html':
Guido van Rossume5605ba1997-01-31 14:43:15 +0000458 safeclose(f)
459 if verbose > 1:
460 print " Not HTML, mime type", ctype
Guido van Rossum272b37d1997-01-30 02:44:48 +0000461 return None
462 text = f.read()
463 f.close()
464 return Page(text, nurl)
465
Guido van Rossume5605ba1997-01-31 14:43:15 +0000466 def setgood(self, url):
467 if self.bad.has_key(url):
468 del self.bad[url]
469 self.changed = 1
470 if verbose > 0:
471 print "(Clear previously seen error)"
472
473 def setbad(self, url, msg):
474 if self.bad.has_key(url) and self.bad[url] == msg:
475 if verbose > 0:
476 print "(Seen this error before)"
477 return
478 self.bad[url] = msg
479 self.changed = 1
480
Guido van Rossum272b37d1997-01-30 02:44:48 +0000481
482class Page:
483
484 def __init__(self, text, url):
485 self.text = text
486 self.url = url
487
488 def getlinkinfos(self):
489 size = len(self.text)
490 if size > maxpage:
491 if verbose > 0:
492 print "Skip huge file", self.url
493 print " (%.0f Kbytes)" % (size*0.001)
494 return []
495 if verbose > 2:
496 print " Parsing", self.url, "(%d bytes)" % size
Guido van Rossume5605ba1997-01-31 14:43:15 +0000497 parser = MyHTMLParser()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000498 parser.feed(self.text)
499 parser.close()
500 rawlinks = parser.getlinks()
501 base = urlparse.urljoin(self.url, parser.getbase() or "")
502 infos = []
503 for rawlink in rawlinks:
504 t = urlparse.urlparse(rawlink)
505 t = t[:-1] + ('',)
506 rawlink = urlparse.urlunparse(t)
507 link = urlparse.urljoin(base, rawlink)
508 infos.append((link, rawlink))
509 return infos
510
511
512class MyStringIO(StringIO.StringIO):
513
514 def __init__(self, url, info):
515 self.__url = url
516 self.__info = info
517 StringIO.StringIO.__init__(self)
518
519 def info(self):
520 return self.__info
521
522 def geturl(self):
523 return self.__url
524
525
526class MyURLopener(urllib.FancyURLopener):
527
528 http_error_default = urllib.URLopener.http_error_default
529
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000530 def __init__(*args):
531 self = args[0]
532 apply(urllib.FancyURLopener.__init__, args)
533 self.addheaders = [('User-agent', 'Python-webchecker/%s' % __version__)]
534
Guido van Rossum272b37d1997-01-30 02:44:48 +0000535 def open_file(self, url):
536 path = urllib.url2pathname(urllib.unquote(url))
537 if path[-1] != os.sep:
538 url = url + '/'
539 if os.path.isdir(path):
540 indexpath = os.path.join(path, "index.html")
541 if os.path.exists(indexpath):
542 return self.open_file(url + "index.html")
543 try:
544 names = os.listdir(path)
545 except os.error, msg:
546 raise IOError, msg, sys.exc_traceback
547 names.sort()
548 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
549 s.write('<BASE HREF="file:%s">\n' %
550 urllib.quote(os.path.join(path, "")))
551 for name in names:
552 q = urllib.quote(name)
553 s.write('<A HREF="%s">%s</A>\n' % (q, q))
554 s.seek(0)
555 return s
556 return urllib.FancyURLopener.open_file(self, path)
557
558
Guido van Rossume5605ba1997-01-31 14:43:15 +0000559class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000560
Guido van Rossume5605ba1997-01-31 14:43:15 +0000561 def __init__(self):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000562 self.base = None
Guido van Rossume5605ba1997-01-31 14:43:15 +0000563 self.links = {}
564 sgmllib.SGMLParser.__init__ (self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000565
566 def start_a(self, attributes):
567 for name, value in attributes:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000568 if name == 'href':
569 if value: value = string.strip(value)
570 if value: self.links[value] = None
571 return # match only first href
Guido van Rossum272b37d1997-01-30 02:44:48 +0000572
573 def do_base(self, attributes):
574 for name, value in attributes:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000575 if name == 'href':
576 if value: value = string.strip(value)
577 if value:
578 if verbose > 1:
579 print " Base", value
580 self.base = value
581 return # match only first href
Guido van Rossum272b37d1997-01-30 02:44:48 +0000582
583 def getlinks(self):
Guido van Rossume5605ba1997-01-31 14:43:15 +0000584 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000585
586 def getbase(self):
587 return self.base
588
589
590def show(p1, link, p2, origins):
591 print p1, link
592 i = 0
593 for source, rawlink in origins:
594 i = i+1
595 if i == 2:
596 p2 = ' '*len(p2)
597 print p2, source,
598 if rawlink != link: print "(%s)" % rawlink,
599 print
600
601
Guido van Rossum2739cd71997-01-30 04:26:57 +0000602def sanitize(msg):
603 if (type(msg) == TupleType and
604 len(msg) >= 4 and
605 msg[0] == 'http error' and
606 type(msg[3]) == InstanceType):
607 # Remove the Message instance -- it may contain
608 # a file object which prevents pickling.
609 msg = msg[:3] + msg[4:]
610 return msg
611
612
Guido van Rossume5605ba1997-01-31 14:43:15 +0000613def safeclose(f):
614 url = f.geturl()
615 if url[:4] == 'ftp:' or url[:7] == 'file://':
616 # Apparently ftp connections don't like to be closed
617 # prematurely...
618 text = f.read()
619 f.close()
620
621
Guido van Rossum272b37d1997-01-30 02:44:48 +0000622if __name__ == '__main__':
623 main()