blob: 5459e9772b9a9901aa48c852b52c35f3e322e329 [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
3"""Web tree checker.
4
5This utility is handy to check a subweb of the world-wide web for
6errors. A subweb is specified by giving one or more ``root URLs''; a
7page belongs to the subweb if one of the root URLs is an initial
8prefix of it.
9
10File URL extension:
11
12In order to easy the checking of subwebs via the local file system,
13the interpretation of ``file:'' URLs is extended to mimic the behavior
14of your average HTTP daemon: if a directory pathname is given, the
15file index.html in that directory is returned if it exists, otherwise
16a directory listing is returned. Now, you can point webchecker to the
17document tree in the local file system of your HTTP daemon, and have
18most of it checked. In fact the default works this way if your local
19web tree is located at /usr/local/etc/httpd/htdpcs (the default for
20the NCSA HTTP daemon and probably others).
21
Guido van Rossumaf310c11997-02-02 23:30:32 +000022Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000023
Guido van Rossumaf310c11997-02-02 23:30:32 +000024When done, it reports pages with bad links within the subweb. When
25interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
27In verbose mode, additional messages are printed during the
28information gathering phase. By default, it prints a summary of its
29work status every 50 URLs (adjustable with the -r option), and it
30reports errors as they are encountered. Use the -q option to disable
31this output.
32
33Checkpoint feature:
34
35Whether interrupted or not, it dumps its state (a Python pickle) to a
36checkpoint file and the -R option allows it to restart from the
37checkpoint (assuming that the pages on the subweb that were already
38processed haven't changed). Even when it has run till completion, -R
39can still be useful -- it will print the reports again, and -Rq prints
40the errors only. In this case, the checkpoint file is not written
41again. The checkpoint file can be set with the -d option.
42
43The checkpoint file is written as a Python pickle. Remember that
44Python's pickle module is currently quite slow. Give it the time it
45needs to load and save the checkpoint file. When interrupted while
46writing the checkpoint file, the old checkpoint file is not
47overwritten, but all work done in the current run is lost.
48
49Miscellaneous:
50
Guido van Rossumaf310c11997-02-02 23:30:32 +000051- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
52
Guido van Rossum3edbb351997-01-30 03:19:41 +000053- Webchecker honors the "robots.txt" convention. Thanks to Skip
54Montanaro for his robotparser.py module (included in this directory)!
55The agent name is hardwired to "webchecker". URLs that are disallowed
56by the robots.txt file are reported as external URLs.
57
Guido van Rossumaf310c11997-02-02 23:30:32 +000058- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000059skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000060
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- When the server or protocol does not tell us a file's type, we guess
62it based on the URL's suffix. The mimetypes.py module (also in this
63directory) has a built-in table mapping most currently known suffixes,
64and in addition attempts to read the mime.types configuration files in
65the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000066
Guido van Rossumaf310c11997-02-02 23:30:32 +000067- We follows links indicated by <A>, <FRAME> and <IMG> tags. We also
68honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossumaf310c11997-02-02 23:30:32 +000070- Checking external links is now done by default; use -x to *disable*
71this feature. External links are now checked during normal
72processing. (XXX The status of a checked link could be categorized
73better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000074
75
76Usage: webchecker.py [option] ... [rooturl] ...
77
78Options:
79
80-R -- restart from checkpoint file
81-d file -- checkpoint filename (default %(DUMPFILE)s)
82-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000083-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000084-q -- quiet operation (also suppresses external links report)
85-r number -- number of links processed per round (default %(ROUNDSIZE)d)
86-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000087-x -- don't check external links (these are often slow to check)
Guido van Rossum272b37d1997-01-30 02:44:48 +000088
89Arguments:
90
91rooturl -- URL to start checking
92 (default %(DEFROOT)s)
93
94"""
95
Guido van Rossume5605ba1997-01-31 14:43:15 +000096
Guido van Rossum00756bd1998-02-21 20:02:09 +000097__version__ = "$Revision$"
Guido van Rossum325a64f1997-01-30 03:30:20 +000098
Guido van Rossum272b37d1997-01-30 02:44:48 +000099
100import sys
101import os
102from types import *
103import string
104import StringIO
105import getopt
106import pickle
107
108import urllib
109import urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000110import sgmllib
Guido van Rossum272b37d1997-01-30 02:44:48 +0000111
112import mimetypes
Guido van Rossum3edbb351997-01-30 03:19:41 +0000113import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000114
Guido van Rossum00756bd1998-02-21 20:02:09 +0000115# Extract real version number if necessary
116if __version__[0] == '$':
117 _v = string.split(__version__)
118 if len(_v) == 3:
Guido van Rossum986abac1998-04-06 14:29:28 +0000119 __version__ = _v[1]
Guido van Rossum00756bd1998-02-21 20:02:09 +0000120
Guido van Rossum272b37d1997-01-30 02:44:48 +0000121
122# Tunable parameters
Guido van Rossum986abac1998-04-06 14:29:28 +0000123DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
124CHECKEXT = 1 # Check external references (1 deep)
125VERBOSE = 1 # Verbosity level (0-3)
126MAXPAGE = 150000 # Ignore files bigger than this
127ROUNDSIZE = 50 # Number of links processed per round
128DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
129AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000130
131
132# Global variables
Guido van Rossum272b37d1997-01-30 02:44:48 +0000133
134
135def main():
Guido van Rossum00756bd1998-02-21 20:02:09 +0000136 checkext = CHECKEXT
137 verbose = VERBOSE
138 maxpage = MAXPAGE
139 roundsize = ROUNDSIZE
Guido van Rossum272b37d1997-01-30 02:44:48 +0000140 dumpfile = DUMPFILE
141 restart = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000142 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000143
144 try:
Guido van Rossum986abac1998-04-06 14:29:28 +0000145 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:vx')
Guido van Rossum272b37d1997-01-30 02:44:48 +0000146 except getopt.error, msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000147 sys.stdout = sys.stderr
148 print msg
149 print __doc__%globals()
150 sys.exit(2)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000151 for o, a in opts:
Guido van Rossum986abac1998-04-06 14:29:28 +0000152 if o == '-R':
153 restart = 1
154 if o == '-d':
155 dumpfile = a
156 if o == '-m':
157 maxpage = string.atoi(a)
158 if o == '-n':
159 norun = 1
160 if o == '-q':
161 verbose = 0
162 if o == '-r':
163 roundsize = string.atoi(a)
164 if o == '-v':
165 verbose = verbose + 1
166 if o == '-x':
167 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000168
Guido van Rossume5605ba1997-01-31 14:43:15 +0000169 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000170 print AGENTNAME, "version", __version__
Guido van Rossum325a64f1997-01-30 03:30:20 +0000171
Guido van Rossum272b37d1997-01-30 02:44:48 +0000172 if restart:
Guido van Rossum986abac1998-04-06 14:29:28 +0000173 c = load_pickle(dumpfile=dumpfile, verbose=verbose)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000174 else:
Guido van Rossum986abac1998-04-06 14:29:28 +0000175 c = Checker()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000176
177 c.setflags(checkext=checkext, verbose=verbose,
Guido van Rossum986abac1998-04-06 14:29:28 +0000178 maxpage=maxpage, roundsize=roundsize)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000179
180 if not restart and not args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000181 args.append(DEFROOT)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000182
183 for arg in args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000184 c.addroot(arg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000185
Guido van Rossume5605ba1997-01-31 14:43:15 +0000186 if not norun:
Guido van Rossum986abac1998-04-06 14:29:28 +0000187 try:
188 c.run()
189 except KeyboardInterrupt:
190 if verbose > 0:
191 print "[run interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000192
Guido van Rossumde662681997-01-30 03:58:21 +0000193 try:
Guido van Rossum986abac1998-04-06 14:29:28 +0000194 c.report()
Guido van Rossumde662681997-01-30 03:58:21 +0000195 except KeyboardInterrupt:
Guido van Rossum986abac1998-04-06 14:29:28 +0000196 if verbose > 0:
197 print "[report interrupted]"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000198
Guido van Rossum00756bd1998-02-21 20:02:09 +0000199 if c.save_pickle(dumpfile):
Guido van Rossum986abac1998-04-06 14:29:28 +0000200 if dumpfile == DUMPFILE:
201 print "Use ``%s -R'' to restart." % sys.argv[0]
202 else:
203 print "Use ``%s -R -d %s'' to restart." % (sys.argv[0], dumpfile)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000204
205
206def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
207 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000208 print "Loading checkpoint from %s ..." % dumpfile
Guido van Rossum00756bd1998-02-21 20:02:09 +0000209 f = open(dumpfile, "rb")
210 c = pickle.load(f)
211 f.close()
212 if verbose > 0:
Guido van Rossum986abac1998-04-06 14:29:28 +0000213 print "Done."
214 print "Root:", string.join(c.roots, "\n ")
Guido van Rossum00756bd1998-02-21 20:02:09 +0000215 return c
Guido van Rossum272b37d1997-01-30 02:44:48 +0000216
217
218class Checker:
219
Guido van Rossum00756bd1998-02-21 20:02:09 +0000220 checkext = CHECKEXT
221 verbose = VERBOSE
222 maxpage = MAXPAGE
223 roundsize = ROUNDSIZE
224
225 validflags = tuple(dir())
226
227 def __init__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000228 self.reset()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000229
230 def setflags(self, **kw):
Guido van Rossum986abac1998-04-06 14:29:28 +0000231 for key in kw.keys():
232 if key not in self.validflags:
233 raise NameError, "invalid keyword argument: %s" % str(key)
234 for key, value in kw.items():
235 setattr(self, key, value)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000236
237 def reset(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000238 self.roots = []
239 self.todo = {}
240 self.done = {}
241 self.bad = {}
242 self.round = 0
243 # The following are not pickled:
244 self.robots = {}
245 self.errors = {}
246 self.urlopener = MyURLopener()
247 self.changed = 0
Guido van Rossum3edbb351997-01-30 03:19:41 +0000248
249 def __getstate__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000250 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000251
252 def __setstate__(self, state):
Guido van Rossum986abac1998-04-06 14:29:28 +0000253 self.reset()
254 (self.roots, self.todo, self.done, self.bad, self.round) = state
255 for root in self.roots:
256 self.addrobot(root)
257 for url in self.bad.keys():
258 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000259
260 def addroot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000261 if root not in self.roots:
262 troot = root
263 scheme, netloc, path, params, query, fragment = \
264 urlparse.urlparse(root)
265 i = string.rfind(path, "/") + 1
266 if 0 < i < len(path):
267 path = path[:i]
268 troot = urlparse.urlunparse((scheme, netloc, path,
269 params, query, fragment))
270 self.roots.append(troot)
271 self.addrobot(root)
272 self.newlink(root, ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000273
274 def addrobot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000275 root = urlparse.urljoin(root, "/")
276 if self.robots.has_key(root): return
277 url = urlparse.urljoin(root, "/robots.txt")
278 self.robots[root] = rp = robotparser.RobotFileParser()
279 if self.verbose > 2:
280 print "Parsing", url
281 rp.debug = self.verbose > 3
282 rp.set_url(url)
283 try:
284 rp.read()
285 except IOError, msg:
286 if self.verbose > 1:
287 print "I/O error parsing", url, ":", msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000288
289 def run(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000290 while self.todo:
291 self.round = self.round + 1
292 if self.verbose > 0:
293 print
294 print "Round %d (%s)" % (self.round, self.status())
295 print
296 urls = self.todo.keys()[:self.roundsize]
297 for url in urls:
298 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000299
300 def status(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000301 return "%d total, %d to do, %d done, %d bad" % (
302 len(self.todo)+len(self.done),
303 len(self.todo), len(self.done),
304 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000305
Guido van Rossumaf310c11997-02-02 23:30:32 +0000306 def report(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000307 print
308 if not self.todo: print "Final",
309 else: print "Interim",
310 print "Report (%s)" % self.status()
311 self.report_errors()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000312
Guido van Rossum272b37d1997-01-30 02:44:48 +0000313 def report_errors(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000314 if not self.bad:
315 print
316 print "No errors"
317 return
318 print
319 print "Error Report:"
320 sources = self.errors.keys()
321 sources.sort()
322 for source in sources:
323 triples = self.errors[source]
324 print
325 if len(triples) > 1:
326 print len(triples), "Errors in", source
327 else:
328 print "Error in", source
329 for url, rawlink, msg in triples:
330 print " HREF", url,
331 if rawlink != url: print "(%s)" % rawlink,
332 print
333 print " msg", msg
Guido van Rossum272b37d1997-01-30 02:44:48 +0000334
335 def dopage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000336 if self.verbose > 1:
337 if self.verbose > 2:
338 self.show("Check ", url, " from", self.todo[url])
339 else:
340 print "Check ", url
341 page = self.getpage(url)
342 if page:
343 for info in page.getlinkinfos():
344 link, rawlink = info
345 origin = url, rawlink
346 self.newlink(link, origin)
347 self.markdone(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000348
Guido van Rossumaf310c11997-02-02 23:30:32 +0000349 def newlink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000350 if self.done.has_key(url):
351 self.newdonelink(url, origin)
352 else:
353 self.newtodolink(url, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000354
355 def newdonelink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000356 self.done[url].append(origin)
357 if self.verbose > 3:
358 print " Done link", url
Guido van Rossume5605ba1997-01-31 14:43:15 +0000359
360 def newtodolink(self, url, origin):
Guido van Rossum986abac1998-04-06 14:29:28 +0000361 if self.todo.has_key(url):
362 self.todo[url].append(origin)
363 if self.verbose > 3:
364 print " Seen todo link", url
365 else:
366 self.todo[url] = [origin]
367 if self.verbose > 3:
368 print " New todo link", url
Guido van Rossume5605ba1997-01-31 14:43:15 +0000369
370 def markdone(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000371 self.done[url] = self.todo[url]
372 del self.todo[url]
373 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000374
375 def inroots(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000376 for root in self.roots:
377 if url[:len(root)] == root:
378 root = urlparse.urljoin(root, "/")
379 return self.robots[root].can_fetch(AGENTNAME, url)
380 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000381
382 def getpage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000383 if url[:7] == 'mailto:' or url[:5] == 'news:':
384 if self.verbose > 1: print " Not checking mailto/news URL"
385 return None
386 isint = self.inroots(url)
387 if not isint:
388 if not self.checkext:
389 if self.verbose > 1: print " Not checking ext link"
390 return None
391 f = self.openpage(url)
392 if f:
393 self.safeclose(f)
394 return None
395 text, nurl = self.readhtml(url)
396 if nurl != url:
397 if self.verbose > 1:
398 print " Redirected to", nurl
399 url = nurl
400 if text:
401 return Page(text, url, verbose=self.verbose, maxpage=self.maxpage)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000402
403 def readhtml(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000404 text = None
405 f, url = self.openhtml(url)
406 if f:
407 text = f.read()
408 f.close()
409 return text, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000410
411 def openhtml(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000412 f = self.openpage(url)
413 if f:
414 url = f.geturl()
415 info = f.info()
416 if not self.checkforhtml(info, url):
417 self.safeclose(f)
418 f = None
419 return f, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000420
421 def openpage(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000422 try:
423 return self.urlopener.open(url)
424 except IOError, msg:
425 msg = self.sanitize(msg)
426 if self.verbose > 0:
427 print "Error ", msg
428 if self.verbose > 0:
429 self.show(" HREF ", url, " from", self.todo[url])
430 self.setbad(url, msg)
431 return None
Guido van Rossum00756bd1998-02-21 20:02:09 +0000432
433 def checkforhtml(self, info, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000434 if info.has_key('content-type'):
435 ctype = string.lower(info['content-type'])
436 else:
437 if url[-1:] == "/":
438 return 1
439 ctype, encoding = mimetypes.guess_type(url)
440 if ctype == 'text/html':
441 return 1
442 else:
443 if self.verbose > 1:
444 print " Not HTML, mime type", ctype
445 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000446
Guido van Rossume5605ba1997-01-31 14:43:15 +0000447 def setgood(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000448 if self.bad.has_key(url):
449 del self.bad[url]
450 self.changed = 1
451 if self.verbose > 0:
452 print "(Clear previously seen error)"
Guido van Rossume5605ba1997-01-31 14:43:15 +0000453
454 def setbad(self, url, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000455 if self.bad.has_key(url) and self.bad[url] == msg:
456 if self.verbose > 0:
457 print "(Seen this error before)"
458 return
459 self.bad[url] = msg
460 self.changed = 1
461 self.markerror(url)
462
Guido van Rossumaf310c11997-02-02 23:30:32 +0000463 def markerror(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000464 try:
465 origins = self.todo[url]
466 except KeyError:
467 origins = self.done[url]
468 for source, rawlink in origins:
469 triple = url, rawlink, self.bad[url]
470 self.seterror(source, triple)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000471
472 def seterror(self, url, triple):
Guido van Rossum986abac1998-04-06 14:29:28 +0000473 try:
474 self.errors[url].append(triple)
475 except KeyError:
476 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000477
Guido van Rossum00756bd1998-02-21 20:02:09 +0000478 # The following used to be toplevel functions; they have been
479 # changed into methods so they can be overridden in subclasses.
480
481 def show(self, p1, link, p2, origins):
Guido van Rossum986abac1998-04-06 14:29:28 +0000482 print p1, link
483 i = 0
484 for source, rawlink in origins:
485 i = i+1
486 if i == 2:
487 p2 = ' '*len(p2)
488 print p2, source,
489 if rawlink != link: print "(%s)" % rawlink,
490 print
Guido van Rossum00756bd1998-02-21 20:02:09 +0000491
492 def sanitize(self, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000493 if isinstance(IOError, ClassType) and isinstance(msg, IOError):
494 # Do the other branch recursively
495 msg.args = self.sanitize(msg.args)
496 elif isinstance(msg, TupleType):
497 if len(msg) >= 4 and msg[0] == 'http error' and \
498 isinstance(msg[3], InstanceType):
499 # Remove the Message instance -- it may contain
500 # a file object which prevents pickling.
501 msg = msg[:3] + msg[4:]
502 return msg
Guido van Rossum00756bd1998-02-21 20:02:09 +0000503
504 def safeclose(self, f):
Guido van Rossum986abac1998-04-06 14:29:28 +0000505 try:
506 url = f.geturl()
507 except AttributeError:
508 pass
509 else:
510 if url[:4] == 'ftp:' or url[:7] == 'file://':
511 # Apparently ftp connections don't like to be closed
512 # prematurely...
513 text = f.read()
514 f.close()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000515
516 def save_pickle(self, dumpfile=DUMPFILE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000517 if not self.changed:
518 if self.verbose > 0:
519 print
520 print "No need to save checkpoint"
521 elif not dumpfile:
522 if self.verbose > 0:
523 print "No dumpfile, won't save checkpoint"
524 else:
525 if self.verbose > 0:
526 print
527 print "Saving checkpoint to %s ..." % dumpfile
528 newfile = dumpfile + ".new"
529 f = open(newfile, "wb")
530 pickle.dump(self, f)
531 f.close()
532 try:
533 os.unlink(dumpfile)
534 except os.error:
535 pass
536 os.rename(newfile, dumpfile)
537 if self.verbose > 0:
538 print "Done."
539 return 1
Guido van Rossum00756bd1998-02-21 20:02:09 +0000540
Guido van Rossum272b37d1997-01-30 02:44:48 +0000541
542class Page:
543
Guido van Rossum00756bd1998-02-21 20:02:09 +0000544 def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000545 self.text = text
546 self.url = url
547 self.verbose = verbose
548 self.maxpage = maxpage
Guido van Rossum272b37d1997-01-30 02:44:48 +0000549
550 def getlinkinfos(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000551 size = len(self.text)
552 if size > self.maxpage:
553 if self.verbose > 0:
554 print "Skip huge file", self.url
555 print " (%.0f Kbytes)" % (size*0.001)
556 return []
557 if self.verbose > 2:
558 print " Parsing", self.url, "(%d bytes)" % size
559 parser = MyHTMLParser(verbose=self.verbose)
560 parser.feed(self.text)
561 parser.close()
562 rawlinks = parser.getlinks()
563 base = urlparse.urljoin(self.url, parser.getbase() or "")
564 infos = []
565 for rawlink in rawlinks:
566 t = urlparse.urlparse(rawlink)
567 t = t[:-1] + ('',)
568 rawlink = urlparse.urlunparse(t)
569 link = urlparse.urljoin(base, rawlink)
570 infos.append((link, rawlink))
571 return infos
Guido van Rossum272b37d1997-01-30 02:44:48 +0000572
573
574class MyStringIO(StringIO.StringIO):
575
576 def __init__(self, url, info):
Guido van Rossum986abac1998-04-06 14:29:28 +0000577 self.__url = url
578 self.__info = info
579 StringIO.StringIO.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000580
581 def info(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000582 return self.__info
Guido van Rossum272b37d1997-01-30 02:44:48 +0000583
584 def geturl(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000585 return self.__url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000586
587
588class MyURLopener(urllib.FancyURLopener):
589
590 http_error_default = urllib.URLopener.http_error_default
591
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000592 def __init__(*args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000593 self = args[0]
594 apply(urllib.FancyURLopener.__init__, args)
595 self.addheaders = [
596 ('User-agent', 'Python-webchecker/%s' % __version__),
597 ]
Guido van Rossum89efda31997-05-07 15:00:56 +0000598
599 def http_error_401(self, url, fp, errcode, errmsg, headers):
600 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000601
Guido van Rossum272b37d1997-01-30 02:44:48 +0000602 def open_file(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000603 path = urllib.url2pathname(urllib.unquote(url))
604 if path[-1] != os.sep:
605 url = url + '/'
606 if os.path.isdir(path):
607 indexpath = os.path.join(path, "index.html")
608 if os.path.exists(indexpath):
609 return self.open_file(url + "index.html")
610 try:
611 names = os.listdir(path)
612 except os.error, msg:
613 raise IOError, msg, sys.exc_traceback
614 names.sort()
615 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
616 s.write('<BASE HREF="file:%s">\n' %
617 urllib.quote(os.path.join(path, "")))
618 for name in names:
619 q = urllib.quote(name)
620 s.write('<A HREF="%s">%s</A>\n' % (q, q))
621 s.seek(0)
622 return s
623 return urllib.FancyURLopener.open_file(self, path)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000624
625
Guido van Rossume5605ba1997-01-31 14:43:15 +0000626class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000627
Guido van Rossum00756bd1998-02-21 20:02:09 +0000628 def __init__(self, verbose=VERBOSE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000629 self.base = None
630 self.links = {}
631 self.myverbose = verbose
632 sgmllib.SGMLParser.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000633
634 def start_a(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000635 self.link_attr(attributes, 'href')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000636
637 def end_a(self): pass
638
Guido van Rossum2237b731997-10-06 18:54:01 +0000639 def do_area(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000640 self.link_attr(attributes, 'href')
Guido van Rossum2237b731997-10-06 18:54:01 +0000641
Guido van Rossum6133ec61997-02-01 05:16:08 +0000642 def do_img(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000643 self.link_attr(attributes, 'src', 'lowsrc')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000644
645 def do_frame(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000646 self.link_attr(attributes, 'src')
Guido van Rossum6133ec61997-02-01 05:16:08 +0000647
648 def link_attr(self, attributes, *args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000649 for name, value in attributes:
650 if name in args:
651 if value: value = string.strip(value)
652 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000653
654 def do_base(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000655 for name, value in attributes:
656 if name == 'href':
657 if value: value = string.strip(value)
658 if value:
659 if self.myverbose > 1:
660 print " Base", value
661 self.base = value
Guido van Rossum272b37d1997-01-30 02:44:48 +0000662
663 def getlinks(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000664 return self.links.keys()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000665
666 def getbase(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000667 return self.base
Guido van Rossum272b37d1997-01-30 02:44:48 +0000668
669
Guido van Rossum272b37d1997-01-30 02:44:48 +0000670if __name__ == '__main__':
671 main()