blob: f4120117b26d1534c4741d9a329f7a8a63b8f6af [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
3"""Web tree checker.
4
5This utility is handy to check a subweb of the world-wide web for
6errors. A subweb is specified by giving one or more ``root URLs''; a
7page belongs to the subweb if one of the root URLs is an initial
8prefix of it.
9
10File URL extension:
11
12In order to easy the checking of subwebs via the local file system,
13the interpretation of ``file:'' URLs is extended to mimic the behavior
14of your average HTTP daemon: if a directory pathname is given, the
15file index.html in that directory is returned if it exists, otherwise
16a directory listing is returned. Now, you can point webchecker to the
17document tree in the local file system of your HTTP daemon, and have
18most of it checked. In fact the default works this way if your local
19web tree is located at /usr/local/etc/httpd/htdpcs (the default for
20the NCSA HTTP daemon and probably others).
21
Guido van Rossumaf310c11997-02-02 23:30:32 +000022Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000023
Guido van Rossumaf310c11997-02-02 23:30:32 +000024When done, it reports pages with bad links within the subweb. When
25interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
27In verbose mode, additional messages are printed during the
28information gathering phase. By default, it prints a summary of its
29work status every 50 URLs (adjustable with the -r option), and it
30reports errors as they are encountered. Use the -q option to disable
31this output.
32
33Checkpoint feature:
34
35Whether interrupted or not, it dumps its state (a Python pickle) to a
36checkpoint file and the -R option allows it to restart from the
37checkpoint (assuming that the pages on the subweb that were already
38processed haven't changed). Even when it has run till completion, -R
39can still be useful -- it will print the reports again, and -Rq prints
40the errors only. In this case, the checkpoint file is not written
41again. The checkpoint file can be set with the -d option.
42
43The checkpoint file is written as a Python pickle. Remember that
44Python's pickle module is currently quite slow. Give it the time it
45needs to load and save the checkpoint file. When interrupted while
46writing the checkpoint file, the old checkpoint file is not
47overwritten, but all work done in the current run is lost.
48
49Miscellaneous:
50
Guido van Rossumaf310c11997-02-02 23:30:32 +000051- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
52
Guido van Rossum3edbb351997-01-30 03:19:41 +000053- Webchecker honors the "robots.txt" convention. Thanks to Skip
54Montanaro for his robotparser.py module (included in this directory)!
55The agent name is hardwired to "webchecker". URLs that are disallowed
56by the robots.txt file are reported as external URLs.
57
Guido van Rossumaf310c11997-02-02 23:30:32 +000058- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000059skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000060
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- When the server or protocol does not tell us a file's type, we guess
62it based on the URL's suffix. The mimetypes.py module (also in this
63directory) has a built-in table mapping most currently known suffixes,
64and in addition attempts to read the mime.types configuration files in
65the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000066
Guido van Rossumaf310c11997-02-02 23:30:32 +000067- We follows links indicated by <A>, <FRAME> and <IMG> tags. We also
68honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossumaf310c11997-02-02 23:30:32 +000070- Checking external links is now done by default; use -x to *disable*
71this feature. External links are now checked during normal
72processing. (XXX The status of a checked link could be categorized
73better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000074
75
76Usage: webchecker.py [option] ... [rooturl] ...
77
78Options:
79
80-R -- restart from checkpoint file
81-d file -- checkpoint filename (default %(DUMPFILE)s)
82-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000083-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000084-q -- quiet operation (also suppresses external links report)
85-r number -- number of links processed per round (default %(ROUNDSIZE)d)
86-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000087-x -- don't check external links (these are often slow to check)
Guido van Rossum272b37d1997-01-30 02:44:48 +000088
89Arguments:
90
91rooturl -- URL to start checking
92 (default %(DEFROOT)s)
93
94"""
95
Guido van Rossume5605ba1997-01-31 14:43:15 +000096
Guido van Rossum89efda31997-05-07 15:00:56 +000097__version__ = "0.5"
Guido van Rossum325a64f1997-01-30 03:30:20 +000098
Guido van Rossum272b37d1997-01-30 02:44:48 +000099
100import sys
101import os
102from types import *
103import string
104import StringIO
105import getopt
106import pickle
107
108import urllib
109import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000110import sgmllib
Guido van Rossum272b37d1997-01-30 02:44:48 +0000111
112import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000113import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000114
115
116# Tunable parameters
117DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
Guido van Rossum0b0b5f01997-01-31 18:57:23 +0000118MAXPAGE = 150000 # Ignore files bigger than this
Guido van Rossum272b37d1997-01-30 02:44:48 +0000119ROUNDSIZE = 50 # Number of links processed per round
120DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
Guido van Rossum3edbb351997-01-30 03:19:41 +0000121AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000122
123
124# Global variables
125verbose = 1
126maxpage = MAXPAGE
127roundsize = ROUNDSIZE
128
129
130def main():
131 global verbose, maxpage, roundsize
132 dumpfile = DUMPFILE
133 restart = 0
Guido van Rossumaf310c11997-02-02 23:30:32 +0000134 checkext = 1
Guido van Rossume5605ba1997-01-31 14:43:15 +0000135 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000136
137 try:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000138 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:vx')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000139 except getopt.error, msg:
140 sys.stdout = sys.stderr
141 print msg
Guido van Rossum0b0b5f01997-01-31 18:57:23 +0000142 print __doc__%globals()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000143 sys.exit(2)
144 for o, a in opts:
145 if o == '-R':
146 restart = 1
147 if o == '-d':
148 dumpfile = a
149 if o == '-m':
150 maxpage = string.atoi(a)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000151 if o == '-n':
152 norun = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000153 if o == '-q':
154 verbose = 0
155 if o == '-r':
156 roundsize = string.atoi(a)
157 if o == '-v':
158 verbose = verbose + 1
Guido van Rossumde662681997-01-30 03:58:21 +0000159 if o == '-x':
Guido van Rossumaf310c11997-02-02 23:30:32 +0000160 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000161
Guido van Rossume5605ba1997-01-31 14:43:15 +0000162 if verbose > 0:
Guido van Rossum325a64f1997-01-30 03:30:20 +0000163 print AGENTNAME, "version", __version__
164
Guido van Rossum272b37d1997-01-30 02:44:48 +0000165 if restart:
166 if verbose > 0:
167 print "Loading checkpoint from %s ..." % dumpfile
168 f = open(dumpfile, "rb")
169 c = pickle.load(f)
170 f.close()
171 if verbose > 0:
172 print "Done."
173 print "Root:", string.join(c.roots, "\n ")
174 else:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000175 c = Checker(checkext)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000176 if not args:
177 args.append(DEFROOT)
178
179 for arg in args:
180 c.addroot(arg)
181
Guido van Rossume5605ba1997-01-31 14:43:15 +0000182 if not norun:
183 try:
184 c.run()
185 except KeyboardInterrupt:
186 if verbose > 0:
187 print "[run interrupted]"
188
Guido van Rossumde662681997-01-30 03:58:21 +0000189 try:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000190 c.report()
Guido van Rossumde662681997-01-30 03:58:21 +0000191 except KeyboardInterrupt:
192 if verbose > 0:
193 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000194
195 if not c.changed:
Guido van Rossum272b37d1997-01-30 02:44:48 +0000196 if verbose > 0:
197 print
198 print "No need to save checkpoint"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000199 elif not dumpfile:
200 if verbose > 0:
201 print "No dumpfile, won't save checkpoint"
202 else:
Guido van Rossum272b37d1997-01-30 02:44:48 +0000203 if verbose > 0:
204 print
205 print "Saving checkpoint to %s ..." % dumpfile
206 newfile = dumpfile + ".new"
207 f = open(newfile, "wb")
208 pickle.dump(c, f)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000209 f.close()
210 try:
211 os.unlink(dumpfile)
212 except os.error:
213 pass
214 os.rename(newfile, dumpfile)
215 if verbose > 0:
216 print "Done."
217 if dumpfile == DUMPFILE:
218 print "Use ``%s -R'' to restart." % sys.argv[0]
219 else:
220 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
221 dumpfile)
222
223
224class Checker:
225
Guido van Rossumaf310c11997-02-02 23:30:32 +0000226 def __init__(self, checkext=1):
227 self.reset()
228 self.checkext = checkext
229
230 def reset(self):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000231 self.roots = []
232 self.todo = {}
233 self.done = {}
Guido van Rossum272b37d1997-01-30 02:44:48 +0000234 self.bad = {}
Guido van Rossum272b37d1997-01-30 02:44:48 +0000235 self.round = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000236 # The following are not pickled:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000237 self.robots = {}
Guido van Rossumaf310c11997-02-02 23:30:32 +0000238 self.errors = {}
Guido van Rossume5605ba1997-01-31 14:43:15 +0000239 self.urlopener = MyURLopener()
240 self.changed = 0
Guido van Rossum3edbb351997-01-30 03:19:41 +0000241
242 def __getstate__(self):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000243 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000244
245 def __setstate__(self, state):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000246 (self.roots, self.todo, self.done, self.bad, self.round) = state
Guido van Rossum3edbb351997-01-30 03:19:41 +0000247 for root in self.roots:
248 self.addrobot(root)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000249 for url in self.bad.keys():
250 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000251
252 def addroot(self, root):
253 if root not in self.roots:
Guido van Rossum2237b731997-10-06 18:54:01 +0000254 troot = root
255 scheme, netloc, path, params, query, fragment = \
256 urlparse.urlparse(root)
257 i = string.rfind(path, "/") + 1
258 if 0 < i < len(path):
259 path = path[:i]
260 troot = urlparse.urlunparse((scheme, netloc, path,
261 params, query, fragment))
262 self.roots.append(troot)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000263 self.addrobot(root)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000264 self.newlink(root, ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000265
266 def addrobot(self, root):
Guido van Rossum2237b731997-10-06 18:54:01 +0000267 root = urlparse.urljoin(root, "/")
268 if self.robots.has_key(root): return
Guido van Rossum3edbb351997-01-30 03:19:41 +0000269 url = urlparse.urljoin(root, "/robots.txt")
Guido van Rossum325a64f1997-01-30 03:30:20 +0000270 self.robots[root] = rp = robotparser.RobotFileParser()
271 if verbose > 2:
272 print "Parsing", url
Guido van Rossume5605ba1997-01-31 14:43:15 +0000273 rp.debug = verbose > 3
Guido van Rossum3edbb351997-01-30 03:19:41 +0000274 rp.set_url(url)
Guido van Rossum325a64f1997-01-30 03:30:20 +0000275 try:
276 rp.read()
277 except IOError, msg:
278 if verbose > 1:
279 print "I/O error parsing", url, ":", msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000280
281 def run(self):
282 while self.todo:
283 self.round = self.round + 1
284 if verbose > 0:
285 print
Guido van Rossumaf310c11997-02-02 23:30:32 +0000286 print "Round %d (%s)" % (self.round, self.status())
Guido van Rossume5605ba1997-01-31 14:43:15 +0000287 print
Guido van Rossum272b37d1997-01-30 02:44:48 +0000288 urls = self.todo.keys()[:roundsize]
289 for url in urls:
290 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000291
292 def status(self):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000293 return "%d total, %d to do, %d done, %d bad" % (
Guido van Rossume5605ba1997-01-31 14:43:15 +0000294 len(self.todo)+len(self.done),
295 len(self.todo), len(self.done),
Guido van Rossumaf310c11997-02-02 23:30:32 +0000296 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000297
Guido van Rossumaf310c11997-02-02 23:30:32 +0000298 def report(self):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000299 print
300 if not self.todo: print "Final",
301 else: print "Interim",
Guido van Rossumaf310c11997-02-02 23:30:32 +0000302 print "Report (%s)" % self.status()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000303 self.report_errors()
304
Guido van Rossum272b37d1997-01-30 02:44:48 +0000305 def report_errors(self):
306 if not self.bad:
307 print
308 print "No errors"
309 return
310 print
311 print "Error Report:"
Guido van Rossumaf310c11997-02-02 23:30:32 +0000312 sources = self.errors.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000313 sources.sort()
314 for source in sources:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000315 triples = self.errors[source]
Guido van Rossum272b37d1997-01-30 02:44:48 +0000316 print
317 if len(triples) > 1:
318 print len(triples), "Errors in", source
319 else:
320 print "Error in", source
321 for url, rawlink, msg in triples:
322 print " HREF", url,
323 if rawlink != url: print "(%s)" % rawlink,
324 print
325 print " msg", msg
326
327 def dopage(self, url):
328 if verbose > 1:
329 if verbose > 2:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000330 show("Check ", url, " from", self.todo[url])
Guido van Rossum272b37d1997-01-30 02:44:48 +0000331 else:
Guido van Rossumaf310c11997-02-02 23:30:32 +0000332 print "Check ", url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000333 page = self.getpage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000334 if page:
335 for info in page.getlinkinfos():
336 link, rawlink = info
337 origin = url, rawlink
Guido van Rossumaf310c11997-02-02 23:30:32 +0000338 self.newlink(link, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000339 self.markdone(url)
340
Guido van Rossumaf310c11997-02-02 23:30:32 +0000341 def newlink(self, url, origin):
Guido van Rossume5605ba1997-01-31 14:43:15 +0000342 if self.done.has_key(url):
343 self.newdonelink(url, origin)
344 else:
345 self.newtodolink(url, origin)
346
347 def newdonelink(self, url, origin):
348 self.done[url].append(origin)
349 if verbose > 3:
350 print " Done link", url
351
352 def newtodolink(self, url, origin):
353 if self.todo.has_key(url):
354 self.todo[url].append(origin)
355 if verbose > 3:
356 print " Seen todo link", url
357 else:
358 self.todo[url] = [origin]
359 if verbose > 3:
360 print " New todo link", url
361
362 def markdone(self, url):
363 self.done[url] = self.todo[url]
364 del self.todo[url]
365 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000366
367 def inroots(self, url):
368 for root in self.roots:
369 if url[:len(root)] == root:
Guido van Rossum2237b731997-10-06 18:54:01 +0000370 root = urlparse.urljoin(root, "/")
Guido van Rossum3edbb351997-01-30 03:19:41 +0000371 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000372 return 0
373
374 def getpage(self, url):
Guido van Rossumaf310c11997-02-02 23:30:32 +0000375 if url[:7] == 'mailto:' or url[:5] == 'news:':
376 if verbose > 1: print " Not checking mailto/news URL"
377 return None
378 isint = self.inroots(url)
379 if not isint and not self.checkext:
380 if verbose > 1: print " Not checking ext link"
381 return None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000382 try:
383 f = self.urlopener.open(url)
384 except IOError, msg:
Guido van Rossum2739cd71997-01-30 04:26:57 +0000385 msg = sanitize(msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000386 if verbose > 0:
387 print "Error ", msg
388 if verbose > 0:
389 show(" HREF ", url, " from", self.todo[url])
Guido van Rossume5605ba1997-01-31 14:43:15 +0000390 self.setbad(url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000391 return None
Guido van Rossumaf310c11997-02-02 23:30:32 +0000392 if not isint:
393 if verbose > 1: print " Not gathering links from ext URL"
394 safeclose(f)
395 return None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000396 nurl = f.geturl()
397 info = f.info()
398 if info.has_key('content-type'):
399 ctype = string.lower(info['content-type'])
Guido van Rossume5605ba1997-01-31 14:43:15 +0000400 else:
401 ctype = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000402 if nurl != url:
403 if verbose > 1:
Guido van Rossum3edbb351997-01-30 03:19:41 +0000404 print " Redirected to", nurl
Guido van Rossume5605ba1997-01-31 14:43:15 +0000405 if not ctype:
406 ctype, encoding = mimetypes.guess_type(nurl)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000407 if ctype != 'text/html':
Guido van Rossume5605ba1997-01-31 14:43:15 +0000408 safeclose(f)
409 if verbose > 1:
410 print " Not HTML, mime type", ctype
Guido van Rossum272b37d1997-01-30 02:44:48 +0000411 return None
412 text = f.read()
413 f.close()
414 return Page(text, nurl)
415
Guido van Rossume5605ba1997-01-31 14:43:15 +0000416 def setgood(self, url):
417 if self.bad.has_key(url):
418 del self.bad[url]
419 self.changed = 1
420 if verbose > 0:
421 print "(Clear previously seen error)"
422
423 def setbad(self, url, msg):
424 if self.bad.has_key(url) and self.bad[url] == msg:
425 if verbose > 0:
426 print "(Seen this error before)"
427 return
428 self.bad[url] = msg
429 self.changed = 1
Guido van Rossumaf310c11997-02-02 23:30:32 +0000430 self.markerror(url)
431
432 def markerror(self, url):
433 try:
434 origins = self.todo[url]
435 except KeyError:
436 origins = self.done[url]
437 for source, rawlink in origins:
438 triple = url, rawlink, self.bad[url]
439 self.seterror(source, triple)
440
441 def seterror(self, url, triple):
442 try:
443 self.errors[url].append(triple)
444 except KeyError:
445 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000446
Guido van Rossum272b37d1997-01-30 02:44:48 +0000447
448class Page:
449
450 def __init__(self, text, url):
451 self.text = text
452 self.url = url
453
454 def getlinkinfos(self):
455 size = len(self.text)
456 if size > maxpage:
457 if verbose > 0:
458 print "Skip huge file", self.url
459 print " (%.0f Kbytes)" % (size*0.001)
460 return []
461 if verbose > 2:
462 print " Parsing", self.url, "(%d bytes)" % size
Guido van Rossume5605ba1997-01-31 14:43:15 +0000463 parser = MyHTMLParser()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000464 parser.feed(self.text)
465 parser.close()
466 rawlinks = parser.getlinks()
467 base = urlparse.urljoin(self.url, parser.getbase() or "")
468 infos = []
469 for rawlink in rawlinks:
470 t = urlparse.urlparse(rawlink)
471 t = t[:-1] + ('',)
472 rawlink = urlparse.urlunparse(t)
473 link = urlparse.urljoin(base, rawlink)
474 infos.append((link, rawlink))
475 return infos
476
477
478class MyStringIO(StringIO.StringIO):
479
480 def __init__(self, url, info):
481 self.__url = url
482 self.__info = info
483 StringIO.StringIO.__init__(self)
484
485 def info(self):
486 return self.__info
487
488 def geturl(self):
489 return self.__url
490
491
492class MyURLopener(urllib.FancyURLopener):
493
494 http_error_default = urllib.URLopener.http_error_default
495
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000496 def __init__(*args):
497 self = args[0]
498 apply(urllib.FancyURLopener.__init__, args)
Guido van Rossum89efda31997-05-07 15:00:56 +0000499 self.addheaders = [
500 ('User-agent', 'Python-webchecker/%s' % __version__),
501 ]
502
503 def http_error_401(self, url, fp, errcode, errmsg, headers):
504 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000505
Guido van Rossum272b37d1997-01-30 02:44:48 +0000506 def open_file(self, url):
507 path = urllib.url2pathname(urllib.unquote(url))
508 if path[-1] != os.sep:
509 url = url + '/'
510 if os.path.isdir(path):
511 indexpath = os.path.join(path, "index.html")
512 if os.path.exists(indexpath):
513 return self.open_file(url + "index.html")
514 try:
515 names = os.listdir(path)
516 except os.error, msg:
517 raise IOError, msg, sys.exc_traceback
518 names.sort()
519 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
520 s.write('<BASE HREF="file:%s">\n' %
521 urllib.quote(os.path.join(path, "")))
522 for name in names:
523 q = urllib.quote(name)
524 s.write('<A HREF="%s">%s</A>\n' % (q, q))
525 s.seek(0)
526 return s
527 return urllib.FancyURLopener.open_file(self, path)
528
529
Guido van Rossume5605ba1997-01-31 14:43:15 +0000530class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000531
Guido van Rossume5605ba1997-01-31 14:43:15 +0000532 def __init__(self):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000533 self.base = None
Guido van Rossume5605ba1997-01-31 14:43:15 +0000534 self.links = {}
535 sgmllib.SGMLParser.__init__ (self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000536
537 def start_a(self, attributes):
Guido van Rossum6133ec61997-02-01 05:16:08 +0000538 self.link_attr(attributes, 'href')
539
540 def end_a(self): pass
541
Guido van Rossum2237b731997-10-06 18:54:01 +0000542 def do_area(self, attributes):
543 self.link_attr(attributes, 'href')
544
Guido van Rossum6133ec61997-02-01 05:16:08 +0000545 def do_img(self, attributes):
546 self.link_attr(attributes, 'src', 'lowsrc')
547
548 def do_frame(self, attributes):
549 self.link_attr(attributes, 'src')
550
551 def link_attr(self, attributes, *args):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000552 for name, value in attributes:
Guido van Rossum6133ec61997-02-01 05:16:08 +0000553 if name in args:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000554 if value: value = string.strip(value)
555 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000556
557 def do_base(self, attributes):
558 for name, value in attributes:
Guido van Rossume5605ba1997-01-31 14:43:15 +0000559 if name == 'href':
560 if value: value = string.strip(value)
561 if value:
562 if verbose > 1:
563 print " Base", value
564 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000565
566 def getlinks(self):
Guido van Rossume5605ba1997-01-31 14:43:15 +0000567 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000568
569 def getbase(self):
570 return self.base
571
572
573def show(p1, link, p2, origins):
574 print p1, link
575 i = 0
576 for source, rawlink in origins:
577 i = i+1
578 if i == 2:
579 p2 = ' '*len(p2)
580 print p2, source,
581 if rawlink != link: print "(%s)" % rawlink,
582 print
583
584
Guido van Rossum2739cd71997-01-30 04:26:57 +0000585def sanitize(msg):
586 if (type(msg) == TupleType and
587 len(msg) >= 4 and
588 msg[0] == 'http error' and
589 type(msg[3]) == InstanceType):
590 # Remove the Message instance -- it may contain
591 # a file object which prevents pickling.
592 msg = msg[:3] + msg[4:]
593 return msg
594
595
Guido van Rossume5605ba1997-01-31 14:43:15 +0000596def safeclose(f):
Guido van Rossum2237b731997-10-06 18:54:01 +0000597 try:
598 url = f.geturl()
599 except AttributeError:
600 pass
601 else:
602 if url[:4] == 'ftp:' or url[:7] == 'file://':
603 # Apparently ftp connections don't like to be closed
604 # prematurely...
605 text = f.read()
Guido van Rossume5605ba1997-01-31 14:43:15 +0000606 f.close()
607
608
Guido van Rossum272b37d1997-01-30 02:44:48 +0000609if __name__ == '__main__':
610 main()