blob: 285d04494b1fc681c30a893666c0656d73e7fcb8 [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
3"""Web tree checker.
4
5This utility is handy to check a subweb of the world-wide web for
6errors. A subweb is specified by giving one or more ``root URLs''; a
7page belongs to the subweb if one of the root URLs is an initial
8prefix of it.
9
10File URL extension:
11
12In order to easy the checking of subwebs via the local file system,
13the interpretation of ``file:'' URLs is extended to mimic the behavior
14of your average HTTP daemon: if a directory pathname is given, the
15file index.html in that directory is returned if it exists, otherwise
16a directory listing is returned. Now, you can point webchecker to the
17document tree in the local file system of your HTTP daemon, and have
18most of it checked. In fact the default works this way if your local
19web tree is located at /usr/local/etc/httpd/htdpcs (the default for
20the NCSA HTTP daemon and probably others).
21
Guido van Rossumaf310c11997-02-02 23:30:32 +000022Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000023
Guido van Rossumaf310c11997-02-02 23:30:32 +000024When done, it reports pages with bad links within the subweb. When
25interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
27In verbose mode, additional messages are printed during the
28information gathering phase. By default, it prints a summary of its
29work status every 50 URLs (adjustable with the -r option), and it
30reports errors as they are encountered. Use the -q option to disable
31this output.
32
33Checkpoint feature:
34
35Whether interrupted or not, it dumps its state (a Python pickle) to a
36checkpoint file and the -R option allows it to restart from the
37checkpoint (assuming that the pages on the subweb that were already
38processed haven't changed). Even when it has run till completion, -R
39can still be useful -- it will print the reports again, and -Rq prints
40the errors only. In this case, the checkpoint file is not written
41again. The checkpoint file can be set with the -d option.
42
43The checkpoint file is written as a Python pickle. Remember that
44Python's pickle module is currently quite slow. Give it the time it
45needs to load and save the checkpoint file. When interrupted while
46writing the checkpoint file, the old checkpoint file is not
47overwritten, but all work done in the current run is lost.
48
49Miscellaneous:
50
Guido van Rossumaf310c11997-02-02 23:30:32 +000051- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
52
Guido van Rossum3edbb351997-01-30 03:19:41 +000053- Webchecker honors the "robots.txt" convention. Thanks to Skip
54Montanaro for his robotparser.py module (included in this directory)!
55The agent name is hardwired to "webchecker". URLs that are disallowed
56by the robots.txt file are reported as external URLs.
57
Guido van Rossumaf310c11997-02-02 23:30:32 +000058- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000059skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000060
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- When the server or protocol does not tell us a file's type, we guess
62it based on the URL's suffix. The mimetypes.py module (also in this
63directory) has a built-in table mapping most currently known suffixes,
64and in addition attempts to read the mime.types configuration files in
65the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000066
Guido van Rossumaf310c11997-02-02 23:30:32 +000067- We follows links indicated by <A>, <FRAME> and <IMG> tags. We also
68honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossumaf310c11997-02-02 23:30:32 +000070- Checking external links is now done by default; use -x to *disable*
71this feature. External links are now checked during normal
72processing. (XXX The status of a checked link could be categorized
73better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000074
75
76Usage: webchecker.py [option] ... [rooturl] ...
77
78Options:
79
80-R -- restart from checkpoint file
81-d file -- checkpoint filename (default %(DUMPFILE)s)
82-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000083-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000084-q -- quiet operation (also suppresses external links report)
85-r number -- number of links processed per round (default %(ROUNDSIZE)d)
86-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000087-x -- don't check external links (these are often slow to check)
Guido van Rossum272b37d1997-01-30 02:44:48 +000088
89Arguments:
90
91rooturl -- URL to start checking
92 (default %(DEFROOT)s)
93
94"""
95
Guido van Rossume5605ba1997-01-31 14:43:15 +000096
Guido van Rossum00756bd1998-02-21 20:02:09 +000097__version__ = "$Revision$"
Guido van Rossum325a64f1997-01-30 03:30:20 +000098
Guido van Rossum272b37d1997-01-30 02:44:48 +000099
100import sys
101import os
102from types import *
103import string
104import StringIO
105import getopt
106import pickle
107
108import urllib
109import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000110import sgmllib
Guido van Rossum272b37d1997-01-30 02:44:48 +0000111
112import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000113import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000114
Guido van Rossum00756bd1998-02-21 20:02:09 +0000115# Extract real version number if necessary
116if __version__[0] == '$':
117 _v = string.split(__version__)
118 if len(_v) == 3:
Guido van Rossum986abac1998-04-06 14:29:28 +0000119 __version__ = _v[1]
Guido van Rossum00756bd1998-02-21 20:02:09 +0000120
Guido van Rossum272b37d1997-01-30 02:44:48 +0000121
122# Tunable parameters
Guido van Rossum986abac1998-04-06 14:29:28 +0000123DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
124CHECKEXT = 1 # Check external references (1 deep)
125VERBOSE = 1 # Verbosity level (0-3)
126MAXPAGE = 150000 # Ignore files bigger than this
127ROUNDSIZE = 50 # Number of links processed per round
128DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
129AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000130
131
132# Global variables
Guido van Rossum272b37d1997-01-30 02:44:48 +0000133
134
135def main():
Guido van Rossum00756bd1998-02-21 20:02:09 +0000136 checkext = CHECKEXT
137 verbose = VERBOSE
138 maxpage = MAXPAGE
139 roundsize = ROUNDSIZE
Guido van Rossum272b37d1997-01-30 02:44:48 +0000140 dumpfile = DUMPFILE
141 restart = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000142 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000143
144 try:
Guido van Rossum986abac1998-04-06 14:29:28 +0000145 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:vx')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000146 except getopt.error, msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000147 sys.stdout = sys.stderr
148 print msg
149 print __doc__%globals()
150 sys.exit(2)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000151 for o, a in opts:
Guido van Rossum986abac1998-04-06 14:29:28 +0000152 if o == '-R':
153 restart = 1
154 if o == '-d':
155 dumpfile = a
156 if o == '-m':
157 maxpage = string.atoi(a)
158 if o == '-n':
159 norun = 1
160 if o == '-q':
161 verbose = 0
162 if o == '-r':
163 roundsize = string.atoi(a)
164 if o == '-v':
165 verbose = verbose + 1
166 if o == '-x':
167 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000168
Guido van Rossume5605ba1997-01-31 14:43:15 +0000169 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000170 print AGENTNAME, "version", __version__
Guido van Rossum325a64f1997-01-30 03:30:20 +0000171
Guido van Rossum272b37d1997-01-30 02:44:48 +0000172 if restart:
Guido van Rossum986abac1998-04-06 14:29:28 +0000173 c = load_pickle(dumpfile=dumpfile, verbose=verbose)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000174 else:
Guido van Rossum986abac1998-04-06 14:29:28 +0000175 c = Checker()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000176
177 c.setflags(checkext=checkext, verbose=verbose,
Guido van Rossum986abac1998-04-06 14:29:28 +0000178 maxpage=maxpage, roundsize=roundsize)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000179
180 if not restart and not args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000181 args.append(DEFROOT)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000182
183 for arg in args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000184 c.addroot(arg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000185
Guido van Rossumbee64531998-04-27 19:35:15 +0000186 try:
187
188 if not norun:
189 try:
190 c.run()
191 except KeyboardInterrupt:
192 if verbose > 0:
193 print "[run interrupted]"
194
Guido van Rossum986abac1998-04-06 14:29:28 +0000195 try:
Guido van Rossumbee64531998-04-27 19:35:15 +0000196 c.report()
Guido van Rossum986abac1998-04-06 14:29:28 +0000197 except KeyboardInterrupt:
198 if verbose > 0:
Guido van Rossumbee64531998-04-27 19:35:15 +0000199 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000200
Guido van Rossumbee64531998-04-27 19:35:15 +0000201 finally:
202 if c.save_pickle(dumpfile):
203 if dumpfile == DUMPFILE:
204 print "Use ``%s -R'' to restart." % sys.argv[0]
205 else:
206 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0],
207 dumpfile)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000208
209
210def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
211 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000212 print "Loading checkpoint from %s ..." % dumpfile
Guido van Rossum00756bd1998-02-21 20:02:09 +0000213 f = open(dumpfile, "rb")
214 c = pickle.load(f)
215 f.close()
216 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000217 print "Done."
218 print "Root:", string.join(c.roots, "\n ")
Guido van Rossum00756bd1998-02-21 20:02:09 +0000219 return c
Guido van Rossum272b37d1997-01-30 02:44:48 +0000220
221
222class Checker:
223
Guido van Rossum00756bd1998-02-21 20:02:09 +0000224 checkext = CHECKEXT
225 verbose = VERBOSE
226 maxpage = MAXPAGE
227 roundsize = ROUNDSIZE
228
229 validflags = tuple(dir())
230
231 def __init__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000232 self.reset()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000233
234 def setflags(self, **kw):
Guido van Rossum986abac1998-04-06 14:29:28 +0000235 for key in kw.keys():
236 if key not in self.validflags:
237 raise NameError, "invalid keyword argument: %s" % str(key)
238 for key, value in kw.items():
239 setattr(self, key, value)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000240
241 def reset(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000242 self.roots = []
243 self.todo = {}
244 self.done = {}
245 self.bad = {}
246 self.round = 0
247 # The following are not pickled:
248 self.robots = {}
249 self.errors = {}
250 self.urlopener = MyURLopener()
251 self.changed = 0
Guido van Rossum3edbb351997-01-30 03:19:41 +0000252
253 def __getstate__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000254 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000255
256 def __setstate__(self, state):
Guido van Rossum986abac1998-04-06 14:29:28 +0000257 self.reset()
258 (self.roots, self.todo, self.done, self.bad, self.round) = state
259 for root in self.roots:
260 self.addrobot(root)
261 for url in self.bad.keys():
262 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000263
264 def addroot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000265 if root not in self.roots:
266 troot = root
267 scheme, netloc, path, params, query, fragment = \
268 urlparse.urlparse(root)
269 i = string.rfind(path, "/") + 1
270 if 0 < i < len(path):
271 path = path[:i]
272 troot = urlparse.urlunparse((scheme, netloc, path,
273 params, query, fragment))
274 self.roots.append(troot)
275 self.addrobot(root)
276 self.newlink(root, ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000277
278 def addrobot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000279 root = urlparse.urljoin(root, "/")
280 if self.robots.has_key(root): return
281 url = urlparse.urljoin(root, "/robots.txt")
282 self.robots[root] = rp = robotparser.RobotFileParser()
283 if self.verbose > 2:
284 print "Parsing", url
285 rp.debug = self.verbose > 3
286 rp.set_url(url)
287 try:
288 rp.read()
289 except IOError, msg:
290 if self.verbose > 1:
291 print "I/O error parsing", url, ":", msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000292
293 def run(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000294 while self.todo:
295 self.round = self.round + 1
296 if self.verbose > 0:
297 print
298 print "Round %d (%s)" % (self.round, self.status())
299 print
Guido van Rossum6eb9d321998-06-15 12:33:02 +0000300 urls = self.todo.keys()
301 urls.sort()
302 del urls[self.roundsize:]
Guido van Rossum986abac1998-04-06 14:29:28 +0000303 for url in urls:
304 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000305
306 def status(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000307 return "%d total, %d to do, %d done, %d bad" % (
308 len(self.todo)+len(self.done),
309 len(self.todo), len(self.done),
310 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000311
Guido van Rossumaf310c11997-02-02 23:30:32 +0000312 def report(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000313 print
314 if not self.todo: print "Final",
315 else: print "Interim",
316 print "Report (%s)" % self.status()
317 self.report_errors()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000318
Guido van Rossum272b37d1997-01-30 02:44:48 +0000319 def report_errors(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000320 if not self.bad:
321 print
322 print "No errors"
323 return
324 print
325 print "Error Report:"
326 sources = self.errors.keys()
327 sources.sort()
328 for source in sources:
329 triples = self.errors[source]
330 print
331 if len(triples) > 1:
332 print len(triples), "Errors in", source
333 else:
334 print "Error in", source
335 for url, rawlink, msg in triples:
336 print " HREF", url,
337 if rawlink != url: print "(%s)" % rawlink,
338 print
339 print " msg", msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000340
341 def dopage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000342 if self.verbose > 1:
343 if self.verbose > 2:
344 self.show("Check ", url, " from", self.todo[url])
345 else:
346 print "Check ", url
347 page = self.getpage(url)
348 if page:
349 for info in page.getlinkinfos():
350 link, rawlink = info
351 origin = url, rawlink
352 self.newlink(link, origin)
353 self.markdone(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000354
Guido van Rossumaf310c11997-02-02 23:30:32 +0000355 def newlink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000356 if self.done.has_key(url):
357 self.newdonelink(url, origin)
358 else:
359 self.newtodolink(url, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000360
361 def newdonelink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000362 self.done[url].append(origin)
363 if self.verbose > 3:
364 print " Done link", url
Guido van Rossume5605ba1997-01-31 14:43:15 +0000365
366 def newtodolink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000367 if self.todo.has_key(url):
368 self.todo[url].append(origin)
369 if self.verbose > 3:
370 print " Seen todo link", url
371 else:
372 self.todo[url] = [origin]
373 if self.verbose > 3:
374 print " New todo link", url
Guido van Rossume5605ba1997-01-31 14:43:15 +0000375
376 def markdone(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000377 self.done[url] = self.todo[url]
378 del self.todo[url]
379 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000380
381 def inroots(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000382 for root in self.roots:
383 if url[:len(root)] == root:
384 root = urlparse.urljoin(root, "/")
385 return self.robots[root].can_fetch(AGENTNAME, url)
386 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000387
388 def getpage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000389 if url[:7] == 'mailto:' or url[:5] == 'news:':
390 if self.verbose > 1: print " Not checking mailto/news URL"
391 return None
392 isint = self.inroots(url)
393 if not isint:
394 if not self.checkext:
395 if self.verbose > 1: print " Not checking ext link"
396 return None
397 f = self.openpage(url)
398 if f:
399 self.safeclose(f)
400 return None
401 text, nurl = self.readhtml(url)
402 if nurl != url:
403 if self.verbose > 1:
404 print " Redirected to", nurl
405 url = nurl
406 if text:
407 return Page(text, url, verbose=self.verbose, maxpage=self.maxpage)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000408
409 def readhtml(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000410 text = None
411 f, url = self.openhtml(url)
412 if f:
413 text = f.read()
414 f.close()
415 return text, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000416
417 def openhtml(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000418 f = self.openpage(url)
419 if f:
420 url = f.geturl()
421 info = f.info()
422 if not self.checkforhtml(info, url):
423 self.safeclose(f)
424 f = None
425 return f, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000426
427 def openpage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000428 try:
429 return self.urlopener.open(url)
430 except IOError, msg:
431 msg = self.sanitize(msg)
432 if self.verbose > 0:
433 print "Error ", msg
434 if self.verbose > 0:
435 self.show(" HREF ", url, " from", self.todo[url])
436 self.setbad(url, msg)
437 return None
Guido van Rossum00756bd1998-02-21 20:02:09 +0000438
439 def checkforhtml(self, info, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000440 if info.has_key('content-type'):
441 ctype = string.lower(info['content-type'])
442 else:
443 if url[-1:] == "/":
444 return 1
445 ctype, encoding = mimetypes.guess_type(url)
446 if ctype == 'text/html':
447 return 1
448 else:
449 if self.verbose > 1:
450 print " Not HTML, mime type", ctype
451 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000452
Guido van Rossume5605ba1997-01-31 14:43:15 +0000453 def setgood(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000454 if self.bad.has_key(url):
455 del self.bad[url]
456 self.changed = 1
457 if self.verbose > 0:
458 print "(Clear previously seen error)"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000459
460 def setbad(self, url, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000461 if self.bad.has_key(url) and self.bad[url] == msg:
462 if self.verbose > 0:
463 print "(Seen this error before)"
464 return
465 self.bad[url] = msg
466 self.changed = 1
467 self.markerror(url)
468
Guido van Rossumaf310c11997-02-02 23:30:32 +0000469 def markerror(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000470 try:
471 origins = self.todo[url]
472 except KeyError:
473 origins = self.done[url]
474 for source, rawlink in origins:
475 triple = url, rawlink, self.bad[url]
476 self.seterror(source, triple)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000477
478 def seterror(self, url, triple):
Guido van Rossum986abac1998-04-06 14:29:28 +0000479 try:
480 self.errors[url].append(triple)
481 except KeyError:
482 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000483
Guido van Rossum00756bd1998-02-21 20:02:09 +0000484 # The following used to be toplevel functions; they have been
485 # changed into methods so they can be overridden in subclasses.
486
487 def show(self, p1, link, p2, origins):
Guido van Rossum986abac1998-04-06 14:29:28 +0000488 print p1, link
489 i = 0
490 for source, rawlink in origins:
491 i = i+1
492 if i == 2:
493 p2 = ' '*len(p2)
494 print p2, source,
495 if rawlink != link: print "(%s)" % rawlink,
496 print
Guido van Rossum00756bd1998-02-21 20:02:09 +0000497
498 def sanitize(self, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000499 if isinstance(IOError, ClassType) and isinstance(msg, IOError):
500 # Do the other branch recursively
501 msg.args = self.sanitize(msg.args)
502 elif isinstance(msg, TupleType):
503 if len(msg) >= 4 and msg[0] == 'http error' and \
504 isinstance(msg[3], InstanceType):
505 # Remove the Message instance -- it may contain
506 # a file object which prevents pickling.
507 msg = msg[:3] + msg[4:]
508 return msg
Guido van Rossum00756bd1998-02-21 20:02:09 +0000509
510 def safeclose(self, f):
Guido van Rossum986abac1998-04-06 14:29:28 +0000511 try:
512 url = f.geturl()
513 except AttributeError:
514 pass
515 else:
516 if url[:4] == 'ftp:' or url[:7] == 'file://':
517 # Apparently ftp connections don't like to be closed
518 # prematurely...
519 text = f.read()
520 f.close()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000521
522 def save_pickle(self, dumpfile=DUMPFILE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000523 if not self.changed:
524 if self.verbose > 0:
525 print
526 print "No need to save checkpoint"
527 elif not dumpfile:
528 if self.verbose > 0:
529 print "No dumpfile, won't save checkpoint"
530 else:
531 if self.verbose > 0:
532 print
533 print "Saving checkpoint to %s ..." % dumpfile
534 newfile = dumpfile + ".new"
535 f = open(newfile, "wb")
536 pickle.dump(self, f)
537 f.close()
538 try:
539 os.unlink(dumpfile)
540 except os.error:
541 pass
542 os.rename(newfile, dumpfile)
543 if self.verbose > 0:
544 print "Done."
545 return 1
Guido van Rossum00756bd1998-02-21 20:02:09 +0000546
Guido van Rossum272b37d1997-01-30 02:44:48 +0000547
548class Page:
549
Guido van Rossum00756bd1998-02-21 20:02:09 +0000550 def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000551 self.text = text
552 self.url = url
553 self.verbose = verbose
554 self.maxpage = maxpage
Guido van Rossum272b37d1997-01-30 02:44:48 +0000555
556 def getlinkinfos(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000557 size = len(self.text)
558 if size > self.maxpage:
559 if self.verbose > 0:
560 print "Skip huge file", self.url
561 print " (%.0f Kbytes)" % (size*0.001)
562 return []
563 if self.verbose > 2:
564 print " Parsing", self.url, "(%d bytes)" % size
565 parser = MyHTMLParser(verbose=self.verbose)
566 parser.feed(self.text)
567 parser.close()
568 rawlinks = parser.getlinks()
569 base = urlparse.urljoin(self.url, parser.getbase() or "")
570 infos = []
571 for rawlink in rawlinks:
572 t = urlparse.urlparse(rawlink)
573 t = t[:-1] + ('',)
574 rawlink = urlparse.urlunparse(t)
575 link = urlparse.urljoin(base, rawlink)
576 infos.append((link, rawlink))
577 return infos
Guido van Rossum272b37d1997-01-30 02:44:48 +0000578
579
580class MyStringIO(StringIO.StringIO):
581
582 def __init__(self, url, info):
Guido van Rossum986abac1998-04-06 14:29:28 +0000583 self.__url = url
584 self.__info = info
585 StringIO.StringIO.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000586
587 def info(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000588 return self.__info
Guido van Rossum272b37d1997-01-30 02:44:48 +0000589
590 def geturl(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000591 return self.__url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000592
593
594class MyURLopener(urllib.FancyURLopener):
595
596 http_error_default = urllib.URLopener.http_error_default
597
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000598 def __init__(*args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000599 self = args[0]
600 apply(urllib.FancyURLopener.__init__, args)
601 self.addheaders = [
602 ('User-agent', 'Python-webchecker/%s' % __version__),
603 ]
Guido van Rossum89efda31997-05-07 15:00:56 +0000604
605 def http_error_401(self, url, fp, errcode, errmsg, headers):
606 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000607
Guido van Rossum272b37d1997-01-30 02:44:48 +0000608 def open_file(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000609 path = urllib.url2pathname(urllib.unquote(url))
610 if path[-1] != os.sep:
611 url = url + '/'
612 if os.path.isdir(path):
613 indexpath = os.path.join(path, "index.html")
614 if os.path.exists(indexpath):
615 return self.open_file(url + "index.html")
616 try:
617 names = os.listdir(path)
618 except os.error, msg:
619 raise IOError, msg, sys.exc_traceback
620 names.sort()
621 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
622 s.write('<BASE HREF="file:%s">\n' %
623 urllib.quote(os.path.join(path, "")))
624 for name in names:
625 q = urllib.quote(name)
626 s.write('<A HREF="%s">%s</A>\n' % (q, q))
627 s.seek(0)
628 return s
629 return urllib.FancyURLopener.open_file(self, path)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000630
631
Guido van Rossume5605ba1997-01-31 14:43:15 +0000632class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000633
Guido van Rossum00756bd1998-02-21 20:02:09 +0000634 def __init__(self, verbose=VERBOSE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000635 self.base = None
636 self.links = {}
637 self.myverbose = verbose
638 sgmllib.SGMLParser.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000639
640 def start_a(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000641 self.link_attr(attributes, 'href')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000642
643 def end_a(self): pass
644
Guido van Rossum2237b731997-10-06 18:54:01 +0000645 def do_area(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000646 self.link_attr(attributes, 'href')
Guido van Rossum2237b731997-10-06 18:54:01 +0000647
Guido van Rossum6133ec61997-02-01 05:16:08 +0000648 def do_img(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000649 self.link_attr(attributes, 'src', 'lowsrc')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000650
651 def do_frame(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000652 self.link_attr(attributes, 'src')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000653
654 def link_attr(self, attributes, *args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000655 for name, value in attributes:
656 if name in args:
657 if value: value = string.strip(value)
658 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000659
660 def do_base(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000661 for name, value in attributes:
662 if name == 'href':
663 if value: value = string.strip(value)
664 if value:
665 if self.myverbose > 1:
666 print " Base", value
667 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000668
669 def getlinks(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000670 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000671
672 def getbase(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000673 return self.base
Guido van Rossum272b37d1997-01-30 02:44:48 +0000674
675
Guido van Rossum272b37d1997-01-30 02:44:48 +0000676if __name__ == '__main__':
677 main()