blob: c4548616848024653985bfd701b993acda9bf2e1 [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
3"""Web tree checker.
4
5This utility is handy to check a subweb of the world-wide web for
6errors. A subweb is specified by giving one or more ``root URLs''; a
7page belongs to the subweb if one of the root URLs is an initial
8prefix of it.
9
10File URL extension:
11
12In order to easy the checking of subwebs via the local file system,
13the interpretation of ``file:'' URLs is extended to mimic the behavior
14of your average HTTP daemon: if a directory pathname is given, the
15file index.html in that directory is returned if it exists, otherwise
16a directory listing is returned. Now, you can point webchecker to the
17document tree in the local file system of your HTTP daemon, and have
18most of it checked. In fact the default works this way if your local
19web tree is located at /usr/local/etc/httpd/htdpcs (the default for
20the NCSA HTTP daemon and probably others).
21
22Reports printed:
23
24When done, it reports links to pages outside the web (unless -q is
25specified), and pages with bad links within the subweb. When
26interrupted, it print those same reports for the pages that it has
27checked already.
28
29In verbose mode, additional messages are printed during the
30information gathering phase. By default, it prints a summary of its
31work status every 50 URLs (adjustable with the -r option), and it
32reports errors as they are encountered. Use the -q option to disable
33this output.
34
35Checkpoint feature:
36
37Whether interrupted or not, it dumps its state (a Python pickle) to a
38checkpoint file and the -R option allows it to restart from the
39checkpoint (assuming that the pages on the subweb that were already
40processed haven't changed). Even when it has run till completion, -R
41can still be useful -- it will print the reports again, and -Rq prints
42the errors only. In this case, the checkpoint file is not written
43again. The checkpoint file can be set with the -d option.
44
45The checkpoint file is written as a Python pickle. Remember that
46Python's pickle module is currently quite slow. Give it the time it
47needs to load and save the checkpoint file. When interrupted while
48writing the checkpoint file, the old checkpoint file is not
49overwritten, but all work done in the current run is lost.
50
51Miscellaneous:
52
Guido van Rossum3edbb351997-01-30 03:19:41 +000053- Webchecker honors the "robots.txt" convention. Thanks to Skip
54Montanaro for his robotparser.py module (included in this directory)!
55The agent name is hardwired to "webchecker". URLs that are disallowed
56by the robots.txt file are reported as external URLs.
57
Guido van Rossum272b37d1997-01-30 02:44:48 +000058- Because the HTML parser is a bit slow, very large HTML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000059skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000060
61- Before fetching a page, it guesses its type based on its extension.
Guido van Rossume5605ba1997-01-31 14:43:15 +000062If it is a known extension and the type is not text/html, the page is
Guido van Rossum272b37d1997-01-30 02:44:48 +000063not fetched. This is a huge optimization but occasionally it means
Guido van Rossume5605ba1997-01-31 14:43:15 +000064links can be missed, and such links aren't checked for validity
65(XXX!). The mimetypes.py module (also in this directory) has a
66built-in table mapping most currently known suffixes, and in addition
67attempts to read the mime.types configuration files in the default
68locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
70- It only follows links indicated by <A> tags. It doesn't follow
71links in <FORM> or <IMG> or whatever other tags might contain
72hyperlinks. It does honor the <BASE> tag.
73
Guido van Rossumde662681997-01-30 03:58:21 +000074- Checking external links is not done by default; use -x to enable
75this feature. This is done because checking external links usually
76takes a lot of time. When enabled, this check is executed during the
Guido van Rossumc59a5d41997-01-30 06:04:00 +000077report generation phase (even when the report is silent).
Guido van Rossum272b37d1997-01-30 02:44:48 +000078
79
80Usage: webchecker.py [option] ... [rooturl] ...
81
82Options:
83
84-R -- restart from checkpoint file
85-d file -- checkpoint filename (default %(DUMPFILE)s)
86-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000087-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000088-q -- quiet operation (also suppresses external links report)
89-r number -- number of links processed per round (default %(ROUNDSIZE)d)
90-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumde662681997-01-30 03:58:21 +000091-x -- check external links (during report phase)
Guido van Rossum272b37d1997-01-30 02:44:48 +000092
93Arguments:
94
95rooturl -- URL to start checking
96 (default %(DEFROOT)s)
97
98"""
99
Guido van Rossume5605ba1997-01-31 14:43:15 +0000100# ' Emacs bait
101
102
103__version__ = "0.3"
Guido van Rossum325a64f1997-01-30 03:30:20 +0000104
Guido van Rossum272b37d1997-01-30 02:44:48 +0000105
106import sys
107import os
108from types import *
109import string
110import StringIO
111import getopt
112import pickle
113
114import urllib
115import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000116import sgmllib
Guido van Rossum272b37d1997-01-30 02:44:48 +0000117
118import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000119import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000120
121
122# Tunable parameters
123DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
124MAXPAGE = 50000 # Ignore files bigger than this
125ROUNDSIZE = 50 # Number of links processed per round
126DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
Guido van Rossum3edbb351997-01-30 03:19:41 +0000127AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000128
129
130# Global variables
131verbose = 1
132maxpage = MAXPAGE
133roundsize = ROUNDSIZE
134
135
136def main():
137 global verbose, maxpage, roundsize
138 dumpfile = DUMPFILE
139 restart = 0
Guido van Rossumde662681997-01-30 03:58:21 +0000140 checkext = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000141 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000142
143 try:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000144 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:vx')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000145 except getopt.error, msg:
146 sys.stdout = sys.stderr
147 print msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000148 sys.exit(2)
149 for o, a in opts:
150 if o == '-R':
151 restart = 1
152 if o == '-d':
153 dumpfile = a
154 if o == '-m':
155 maxpage = string.atoi(a)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000156 if o == '-n':
157 norun = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000158 if o == '-q':
159 verbose = 0
160 if o == '-r':
161 roundsize = string.atoi(a)
162 if o == '-v':
163 verbose = verbose + 1
Guido van Rossumde662681997-01-30 03:58:21 +0000164 if o == '-x':
165 checkext = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000166
Guido van Rossume5605ba1997-01-31 14:43:15 +0000167 if verbose > 0:
Guido van Rossum325a64f1997-01-30 03:30:20 +0000168 print AGENTNAME, "version", __version__
169
Guido van Rossum272b37d1997-01-30 02:44:48 +0000170 if restart:
171 if verbose > 0:
172 print "Loading checkpoint from %s ..." % dumpfile
173 f = open(dumpfile, "rb")
174 c = pickle.load(f)
175 f.close()
176 if verbose > 0:
177 print "Done."
178 print "Root:", string.join(c.roots, "\n ")
179 else:
180 c = Checker()
181 if not args:
182 args.append(DEFROOT)
183
184 for arg in args:
185 c.addroot(arg)
186
Guido van Rossume5605ba1997-01-31 14:43:15 +0000187 if not norun:
188 try:
189 c.run()
190 except KeyboardInterrupt:
191 if verbose > 0:
192 print "[run interrupted]"
193
Guido van Rossumde662681997-01-30 03:58:21 +0000194 try:
195 c.report(checkext)
196 except KeyboardInterrupt:
197 if verbose > 0:
198 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000199
200 if not c.changed:
Guido van Rossum272b37d1997-01-30 02:44:48 +0000201 if verbose > 0:
202 print
203 print "No need to save checkpoint"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000204 elif not dumpfile:
205 if verbose > 0:
206 print "No dumpfile, won't save checkpoint"
207 else:
Guido van Rossum272b37d1997-01-30 02:44:48 +0000208 if verbose > 0:
209 print
210 print "Saving checkpoint to %s ..." % dumpfile
211 newfile = dumpfile + ".new"
212 f = open(newfile, "wb")
213 pickle.dump(c, f)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000214 f.close()
215 try:
216 os.unlink(dumpfile)
217 except os.error:
218 pass
219 os.rename(newfile, dumpfile)
220 if verbose > 0:
221 print "Done."
222 if dumpfile == DUMPFILE:
223 print "Use ``%s -R'' to restart." % sys.argv[0]
224 else:
225 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
226 dumpfile)
227
228
229class Checker:
230
231 def __init__(self):
232 self.roots = []
233 self.todo = {}
234 self.done = {}
235 self.ext = {}
236 self.bad = {}
Guido van Rossum272b37d1997-01-30 02:44:48 +0000237 self.round = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000238 # The following are not pickled:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000239 self.robots = {}
Guido van Rossume5605ba1997-01-31 14:43:15 +0000240 self.urlopener = MyURLopener()
241 self.changed = 0
Guido van Rossum3edbb351997-01-30 03:19:41 +0000242
243 def __getstate__(self):
244 return (self.roots, self.todo, self.done,
245 self.ext, self.bad, self.round)
246
247 def __setstate__(self, state):
248 (self.roots, self.todo, self.done,
249 self.ext, self.bad, self.round) = state
250 for root in self.roots:
251 self.addrobot(root)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000252
253 def addroot(self, root):
254 if root not in self.roots:
255 self.roots.append(root)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000256 self.addrobot(root)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000257 self.newintlink(root, ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000258
259 def addrobot(self, root):
Guido van Rossum3edbb351997-01-30 03:19:41 +0000260 url = urlparse.urljoin(root, "/robots.txt")
Guido van Rossum325a64f1997-01-30 03:30:20 +0000261 self.robots[root] = rp = robotparser.RobotFileParser()
262 if verbose > 2:
263 print "Parsing", url
Guido van Rossume5605ba1997-01-31 14:43:15 +0000264 rp.debug = verbose > 3
Guido van Rossum3edbb351997-01-30 03:19:41 +0000265 rp.set_url(url)
Guido van Rossum325a64f1997-01-30 03:30:20 +0000266 try:
267 rp.read()
268 except IOError, msg:
269 if verbose > 1:
270 print "I/O error parsing", url, ":", msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000271
272 def run(self):
273 while self.todo:
274 self.round = self.round + 1
275 if verbose > 0:
276 print
Guido van Rossume5605ba1997-01-31 14:43:15 +0000277 print "Round", self.round, self.status()
278 print
Guido van Rossum272b37d1997-01-30 02:44:48 +0000279 urls = self.todo.keys()[:roundsize]
280 for url in urls:
281 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000282
283 def status(self):
284 return "(%d total, %d to do, %d done, %d external, %d bad)" % (
285 len(self.todo)+len(self.done),
286 len(self.todo), len(self.done),
287 len(self.ext), len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000288
Guido van Rossumde662681997-01-30 03:58:21 +0000289 def report(self, checkext=0):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000290 print
291 if not self.todo: print "Final",
292 else: print "Interim",
Guido van Rossume5605ba1997-01-31 14:43:15 +0000293 print "Report", self.status()
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000294 if verbose > 0 or checkext:
Guido van Rossumde662681997-01-30 03:58:21 +0000295 self.report_extrefs(checkext)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000296 # Report errors last because the output may get truncated
297 self.report_errors()
298
Guido van Rossumde662681997-01-30 03:58:21 +0000299 def report_extrefs(self, checkext=0):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000300 if not self.ext:
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000301 if verbose > 0:
302 print
303 print "No external URLs"
Guido van Rossum272b37d1997-01-30 02:44:48 +0000304 return
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000305 if verbose > 0:
306 print
307 if checkext:
308 print "External URLs (checking validity):"
309 else:
310 print "External URLs (not checked):"
311 print
Guido van Rossum272b37d1997-01-30 02:44:48 +0000312 urls = self.ext.keys()
313 urls.sort()
314 for url in urls:
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000315 if verbose > 0:
316 show("HREF ", url, " from", self.ext[url])
Guido van Rossumde662681997-01-30 03:58:21 +0000317 if not checkext:
318 continue
Guido van Rossum2739cd71997-01-30 04:26:57 +0000319 if url[:7] == 'mailto:':
320 if verbose > 2: print "Not checking", url
321 continue
Guido van Rossumde662681997-01-30 03:58:21 +0000322 if verbose > 2: print "Checking", url, "..."
323 try:
324 f = self.urlopener.open(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000325 safeclose(f)
Guido van Rossumde662681997-01-30 03:58:21 +0000326 if verbose > 3: print "OK"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000327 if self.bad.has_key(url):
328 self.setgood(url)
Guido van Rossumde662681997-01-30 03:58:21 +0000329 except IOError, msg:
Guido van Rossum2739cd71997-01-30 04:26:57 +0000330 msg = sanitize(msg)
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000331 if verbose > 0: print "Error", msg
Guido van Rossume5605ba1997-01-31 14:43:15 +0000332 self.setbad(url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000333
334 def report_errors(self):
335 if not self.bad:
336 print
337 print "No errors"
338 return
339 print
340 print "Error Report:"
341 urls = self.bad.keys()
342 urls.sort()
343 bysource = {}
344 for url in urls:
345 try:
346 origins = self.done[url]
347 except KeyError:
Guido van Rossum2739cd71997-01-30 04:26:57 +0000348 try:
349 origins = self.todo[url]
350 except KeyError:
351 origins = self.ext[url]
Guido van Rossum272b37d1997-01-30 02:44:48 +0000352 for source, rawlink in origins:
353 triple = url, rawlink, self.bad[url]
354 try:
355 bysource[source].append(triple)
356 except KeyError:
357 bysource[source] = [triple]
358 sources = bysource.keys()
359 sources.sort()
360 for source in sources:
361 triples = bysource[source]
362 print
363 if len(triples) > 1:
364 print len(triples), "Errors in", source
365 else:
366 print "Error in", source
367 for url, rawlink, msg in triples:
368 print " HREF", url,
369 if rawlink != url: print "(%s)" % rawlink,
370 print
371 print " msg", msg
372
373 def dopage(self, url):
374 if verbose > 1:
375 if verbose > 2:
376 show("Page ", url, " from", self.todo[url])
377 else:
378 print "Page ", url
379 page = self.getpage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000380 if page:
381 for info in page.getlinkinfos():
382 link, rawlink = info
383 origin = url, rawlink
384 if not self.inroots(link):
385 self.newextlink(link, origin)
386 else:
387 self.newintlink(link, origin)
388 self.markdone(url)
389
390 def newextlink(self, url, origin):
391 try:
392 self.ext[url].append(origin)
393 if verbose > 3:
394 print " New ext link", url
395 except KeyError:
396 self.ext[url] = [origin]
397 if verbose > 3:
398 print " Seen ext link", url
399
400 def newintlink(self, url, origin):
401 if self.done.has_key(url):
402 self.newdonelink(url, origin)
403 else:
404 self.newtodolink(url, origin)
405
406 def newdonelink(self, url, origin):
407 self.done[url].append(origin)
408 if verbose > 3:
409 print " Done link", url
410
411 def newtodolink(self, url, origin):
412 if self.todo.has_key(url):
413 self.todo[url].append(origin)
414 if verbose > 3:
415 print " Seen todo link", url
416 else:
417 self.todo[url] = [origin]
418 if verbose > 3:
419 print " New todo link", url
420
421 def markdone(self, url):
422 self.done[url] = self.todo[url]
423 del self.todo[url]
424 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000425
426 def inroots(self, url):
427 for root in self.roots:
428 if url[:len(root)] == root:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000429 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000430 return 0
431
432 def getpage(self, url):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000433 try:
434 f = self.urlopener.open(url)
435 except IOError, msg:
Guido van Rossum2739cd71997-01-30 04:26:57 +0000436 msg = sanitize(msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000437 if verbose > 0:
438 print "Error ", msg
439 if verbose > 0:
440 show(" HREF ", url, " from", self.todo[url])
Guido van Rossume5605ba1997-01-31 14:43:15 +0000441 self.setbad(url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000442 return None
443 nurl = f.geturl()
444 info = f.info()
445 if info.has_key('content-type'):
446 ctype = string.lower(info['content-type'])
Guido van Rossume5605ba1997-01-31 14:43:15 +0000447 else:
448 ctype = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000449 if nurl != url:
450 if verbose > 1:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000451 print " Redirected to", nurl
Guido van Rossume5605ba1997-01-31 14:43:15 +0000452 if not ctype:
453 ctype, encoding = mimetypes.guess_type(nurl)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000454 if ctype != 'text/html':
Guido van Rossume5605ba1997-01-31 14:43:15 +0000455 safeclose(f)
456 if verbose > 1:
457 print " Not HTML, mime type", ctype
Guido van Rossum272b37d1997-01-30 02:44:48 +0000458 return None
459 text = f.read()
460 f.close()
461 return Page(text, nurl)
462
Guido van Rossume5605ba1997-01-31 14:43:15 +0000463 def setgood(self, url):
464 if self.bad.has_key(url):
465 del self.bad[url]
466 self.changed = 1
467 if verbose > 0:
468 print "(Clear previously seen error)"
469
470 def setbad(self, url, msg):
471 if self.bad.has_key(url) and self.bad[url] == msg:
472 if verbose > 0:
473 print "(Seen this error before)"
474 return
475 self.bad[url] = msg
476 self.changed = 1
477
Guido van Rossum272b37d1997-01-30 02:44:48 +0000478
479class Page:
480
481 def __init__(self, text, url):
482 self.text = text
483 self.url = url
484
485 def getlinkinfos(self):
486 size = len(self.text)
487 if size > maxpage:
488 if verbose > 0:
489 print "Skip huge file", self.url
490 print " (%.0f Kbytes)" % (size*0.001)
491 return []
492 if verbose > 2:
493 print " Parsing", self.url, "(%d bytes)" % size
Guido van Rossume5605ba1997-01-31 14:43:15 +0000494 parser = MyHTMLParser()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000495 parser.feed(self.text)
496 parser.close()
497 rawlinks = parser.getlinks()
498 base = urlparse.urljoin(self.url, parser.getbase() or "")
499 infos = []
500 for rawlink in rawlinks:
501 t = urlparse.urlparse(rawlink)
502 t = t[:-1] + ('',)
503 rawlink = urlparse.urlunparse(t)
504 link = urlparse.urljoin(base, rawlink)
505 infos.append((link, rawlink))
506 return infos
507
508
509class MyStringIO(StringIO.StringIO):
510
511 def __init__(self, url, info):
512 self.__url = url
513 self.__info = info
514 StringIO.StringIO.__init__(self)
515
516 def info(self):
517 return self.__info
518
519 def geturl(self):
520 return self.__url
521
522
523class MyURLopener(urllib.FancyURLopener):
524
525 http_error_default = urllib.URLopener.http_error_default
526
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000527 def __init__(*args):
528 self = args[0]
529 apply(urllib.FancyURLopener.__init__, args)
530 self.addheaders = [('User-agent', 'Python-webchecker/%s' % __version__)]
531
Guido van Rossum272b37d1997-01-30 02:44:48 +0000532 def open_file(self, url):
533 path = urllib.url2pathname(urllib.unquote(url))
534 if path[-1] != os.sep:
535 url = url + '/'
536 if os.path.isdir(path):
537 indexpath = os.path.join(path, "index.html")
538 if os.path.exists(indexpath):
539 return self.open_file(url + "index.html")
540 try:
541 names = os.listdir(path)
542 except os.error, msg:
543 raise IOError, msg, sys.exc_traceback
544 names.sort()
545 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
546 s.write('<BASE HREF="file:%s">\n' %
547 urllib.quote(os.path.join(path, "")))
548 for name in names:
549 q = urllib.quote(name)
550 s.write('<A HREF="%s">%s</A>\n' % (q, q))
551 s.seek(0)
552 return s
553 return urllib.FancyURLopener.open_file(self, path)
554
555
Guido van Rossume5605ba1997-01-31 14:43:15 +0000556class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000557
Guido van Rossume5605ba1997-01-31 14:43:15 +0000558 def __init__(self):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000559 self.base = None
Guido van Rossume5605ba1997-01-31 14:43:15 +0000560 self.links = {}
561 sgmllib.SGMLParser.__init__ (self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000562
563 def start_a(self, attributes):
564 for name, value in attributes:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000565 if name == 'href':
566 if value: value = string.strip(value)
567 if value: self.links[value] = None
568 return # match only first href
Guido van Rossum272b37d1997-01-30 02:44:48 +0000569
570 def do_base(self, attributes):
571 for name, value in attributes:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000572 if name == 'href':
573 if value: value = string.strip(value)
574 if value:
575 if verbose > 1:
576 print " Base", value
577 self.base = value
578 return # match only first href
Guido van Rossum272b37d1997-01-30 02:44:48 +0000579
580 def getlinks(self):
Guido van Rossume5605ba1997-01-31 14:43:15 +0000581 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000582
583 def getbase(self):
584 return self.base
585
586
587def show(p1, link, p2, origins):
588 print p1, link
589 i = 0
590 for source, rawlink in origins:
591 i = i+1
592 if i == 2:
593 p2 = ' '*len(p2)
594 print p2, source,
595 if rawlink != link: print "(%s)" % rawlink,
596 print
597
598
Guido van Rossum2739cd71997-01-30 04:26:57 +0000599def sanitize(msg):
600 if (type(msg) == TupleType and
601 len(msg) >= 4 and
602 msg[0] == 'http error' and
603 type(msg[3]) == InstanceType):
604 # Remove the Message instance -- it may contain
605 # a file object which prevents pickling.
606 msg = msg[:3] + msg[4:]
607 return msg
608
609
Guido van Rossume5605ba1997-01-31 14:43:15 +0000610def safeclose(f):
611 url = f.geturl()
612 if url[:4] == 'ftp:' or url[:7] == 'file://':
613 # Apparently ftp connections don't like to be closed
614 # prematurely...
615 text = f.read()
616 f.close()
617
618
Guido van Rossum272b37d1997-01-30 02:44:48 +0000619if __name__ == '__main__':
620 main()