blob: 651cf85aa05c0994b4fa705189a5ff81a288cccf [file] [log] [blame]
Guido van Rossum272b37d1997-01-30 02:44:48 +00001#! /usr/bin/env python
2
Guido van Rossume284b211999-11-17 15:40:08 +00003# Original code by Guido van Rossum; extensive changes by Sam Bayer,
4# including code to check URL fragments.
5
Guido van Rossum272b37d1997-01-30 02:44:48 +00006"""Web tree checker.
7
8This utility is handy to check a subweb of the world-wide web for
9errors. A subweb is specified by giving one or more ``root URLs''; a
10page belongs to the subweb if one of the root URLs is an initial
11prefix of it.
12
13File URL extension:
14
15In order to easy the checking of subwebs via the local file system,
16the interpretation of ``file:'' URLs is extended to mimic the behavior
17of your average HTTP daemon: if a directory pathname is given, the
18file index.html in that directory is returned if it exists, otherwise
19a directory listing is returned. Now, you can point webchecker to the
20document tree in the local file system of your HTTP daemon, and have
21most of it checked. In fact the default works this way if your local
22web tree is located at /usr/local/etc/httpd/htdpcs (the default for
23the NCSA HTTP daemon and probably others).
24
Guido van Rossumaf310c11997-02-02 23:30:32 +000025Report printed:
Guido van Rossum272b37d1997-01-30 02:44:48 +000026
Guido van Rossumaf310c11997-02-02 23:30:32 +000027When done, it reports pages with bad links within the subweb. When
28interrupted, it reports for the pages that it has checked already.
Guido van Rossum272b37d1997-01-30 02:44:48 +000029
30In verbose mode, additional messages are printed during the
31information gathering phase. By default, it prints a summary of its
32work status every 50 URLs (adjustable with the -r option), and it
33reports errors as they are encountered. Use the -q option to disable
34this output.
35
36Checkpoint feature:
37
38Whether interrupted or not, it dumps its state (a Python pickle) to a
39checkpoint file and the -R option allows it to restart from the
40checkpoint (assuming that the pages on the subweb that were already
41processed haven't changed). Even when it has run till completion, -R
42can still be useful -- it will print the reports again, and -Rq prints
43the errors only. In this case, the checkpoint file is not written
44again. The checkpoint file can be set with the -d option.
45
46The checkpoint file is written as a Python pickle. Remember that
47Python's pickle module is currently quite slow. Give it the time it
48needs to load and save the checkpoint file. When interrupted while
49writing the checkpoint file, the old checkpoint file is not
50overwritten, but all work done in the current run is lost.
51
52Miscellaneous:
53
Guido van Rossumaf310c11997-02-02 23:30:32 +000054- You may find the (Tk-based) GUI version easier to use. See wcgui.py.
55
Guido van Rossum3edbb351997-01-30 03:19:41 +000056- Webchecker honors the "robots.txt" convention. Thanks to Skip
57Montanaro for his robotparser.py module (included in this directory)!
58The agent name is hardwired to "webchecker". URLs that are disallowed
59by the robots.txt file are reported as external URLs.
60
Guido van Rossumaf310c11997-02-02 23:30:32 +000061- Because the SGML parser is a bit slow, very large SGML files are
Guido van Rossum3edbb351997-01-30 03:19:41 +000062skipped. The size limit can be set with the -m option.
Guido van Rossum272b37d1997-01-30 02:44:48 +000063
Guido van Rossumaf310c11997-02-02 23:30:32 +000064- When the server or protocol does not tell us a file's type, we guess
65it based on the URL's suffix. The mimetypes.py module (also in this
66directory) has a built-in table mapping most currently known suffixes,
67and in addition attempts to read the mime.types configuration files in
68the default locations of Netscape and the NCSA HTTP daemon.
Guido van Rossum272b37d1997-01-30 02:44:48 +000069
Guido van Rossume284b211999-11-17 15:40:08 +000070- We follow links indicated by <A>, <FRAME> and <IMG> tags. We also
Guido van Rossumaf310c11997-02-02 23:30:32 +000071honor the <BASE> tag.
Guido van Rossum272b37d1997-01-30 02:44:48 +000072
Guido van Rossume284b211999-11-17 15:40:08 +000073- We now check internal NAME anchor links, as well as toplevel links.
74
Guido van Rossumaf310c11997-02-02 23:30:32 +000075- Checking external links is now done by default; use -x to *disable*
76this feature. External links are now checked during normal
77processing. (XXX The status of a checked link could be categorized
78better. Later...)
Guido van Rossum272b37d1997-01-30 02:44:48 +000079
Guido van Rossume284b211999-11-17 15:40:08 +000080- If external links are not checked, you can use the -t flag to
81provide specific overrides to -x.
Guido van Rossum272b37d1997-01-30 02:44:48 +000082
83Usage: webchecker.py [option] ... [rooturl] ...
84
85Options:
86
87-R -- restart from checkpoint file
88-d file -- checkpoint filename (default %(DUMPFILE)s)
89-m bytes -- skip HTML pages larger than this size (default %(MAXPAGE)d)
Guido van Rossume5605ba1997-01-31 14:43:15 +000090-n -- reports only, no checking (use with -R)
Guido van Rossum272b37d1997-01-30 02:44:48 +000091-q -- quiet operation (also suppresses external links report)
92-r number -- number of links processed per round (default %(ROUNDSIZE)d)
Guido van Rossume284b211999-11-17 15:40:08 +000093-t root -- specify root dir which should be treated as internal (can repeat)
Guido van Rossum272b37d1997-01-30 02:44:48 +000094-v -- verbose operation; repeating -v will increase verbosity
Guido van Rossumaf310c11997-02-02 23:30:32 +000095-x -- don't check external links (these are often slow to check)
Guido van Rossume284b211999-11-17 15:40:08 +000096-a -- don't check name anchors
Guido van Rossum272b37d1997-01-30 02:44:48 +000097
98Arguments:
99
100rooturl -- URL to start checking
101 (default %(DEFROOT)s)
102
103"""
104
Guido van Rossume5605ba1997-01-31 14:43:15 +0000105
Guido van Rossum00756bd1998-02-21 20:02:09 +0000106__version__ = "$Revision$"
Guido van Rossum325a64f1997-01-30 03:30:20 +0000107
Guido van Rossum272b37d1997-01-30 02:44:48 +0000108
109import sys
110import os
111from types import *
Guido van Rossum34d19282007-08-09 01:03:29 +0000112import io
Guido van Rossum272b37d1997-01-30 02:44:48 +0000113import getopt
114import pickle
115
Georg Brandl7d840552008-06-23 11:45:20 +0000116import urllib.request
117import urllib.parse as urlparse
Guido van Rossume5605ba1997-01-31 14:43:15 +0000118import sgmllib
Walter Dörwald88a20ba2002-06-06 17:01:21 +0000119import cgi
Guido van Rossum272b37d1997-01-30 02:44:48 +0000120
121import mimetypes
Georg Brandl7d840552008-06-23 11:45:20 +0000122from urllib import robotparser
Guido van Rossum272b37d1997-01-30 02:44:48 +0000123
Guido van Rossum00756bd1998-02-21 20:02:09 +0000124# Extract real version number if necessary
125if __version__[0] == '$':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000126 _v = __version__.split()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000127 if len(_v) == 3:
Guido van Rossum986abac1998-04-06 14:29:28 +0000128 __version__ = _v[1]
Guido van Rossum00756bd1998-02-21 20:02:09 +0000129
Guido van Rossum272b37d1997-01-30 02:44:48 +0000130
131# Tunable parameters
Guido van Rossum986abac1998-04-06 14:29:28 +0000132DEFROOT = "file:/usr/local/etc/httpd/htdocs/" # Default root URL
133CHECKEXT = 1 # Check external references (1 deep)
134VERBOSE = 1 # Verbosity level (0-3)
135MAXPAGE = 150000 # Ignore files bigger than this
136ROUNDSIZE = 50 # Number of links processed per round
137DUMPFILE = "@webchecker.pickle" # Pickled checkpoint
138AGENTNAME = "webchecker" # Agent name for robots.txt parser
Guido van Rossume284b211999-11-17 15:40:08 +0000139NONAMES = 0 # Force name anchor checking
Guido van Rossum272b37d1997-01-30 02:44:48 +0000140
141
142# Global variables
Guido van Rossum272b37d1997-01-30 02:44:48 +0000143
144
145def main():
Guido van Rossum00756bd1998-02-21 20:02:09 +0000146 checkext = CHECKEXT
147 verbose = VERBOSE
148 maxpage = MAXPAGE
149 roundsize = ROUNDSIZE
Guido van Rossum272b37d1997-01-30 02:44:48 +0000150 dumpfile = DUMPFILE
151 restart = 0
Guido van Rossume5605ba1997-01-31 14:43:15 +0000152 norun = 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000153
154 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000155 opts, args = getopt.getopt(sys.argv[1:], 'Rd:m:nqr:t:vxa')
Guido van Rossumb940e112007-01-10 16:19:56 +0000156 except getopt.error as msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000157 sys.stdout = sys.stderr
Collin Winter6afaeb72007-08-03 17:06:41 +0000158 print(msg)
159 print(__doc__%globals())
Guido van Rossum986abac1998-04-06 14:29:28 +0000160 sys.exit(2)
Guido van Rossume284b211999-11-17 15:40:08 +0000161
162 # The extra_roots variable collects extra roots.
163 extra_roots = []
164 nonames = NONAMES
165
Guido van Rossum272b37d1997-01-30 02:44:48 +0000166 for o, a in opts:
Guido van Rossum986abac1998-04-06 14:29:28 +0000167 if o == '-R':
168 restart = 1
169 if o == '-d':
170 dumpfile = a
171 if o == '-m':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000172 maxpage = int(a)
Guido van Rossum986abac1998-04-06 14:29:28 +0000173 if o == '-n':
174 norun = 1
175 if o == '-q':
176 verbose = 0
177 if o == '-r':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000178 roundsize = int(a)
Guido van Rossume284b211999-11-17 15:40:08 +0000179 if o == '-t':
180 extra_roots.append(a)
181 if o == '-a':
182 nonames = not nonames
Guido van Rossum986abac1998-04-06 14:29:28 +0000183 if o == '-v':
184 verbose = verbose + 1
185 if o == '-x':
186 checkext = not checkext
Guido van Rossum272b37d1997-01-30 02:44:48 +0000187
Guido van Rossume5605ba1997-01-31 14:43:15 +0000188 if verbose > 0:
Collin Winter6afaeb72007-08-03 17:06:41 +0000189 print(AGENTNAME, "version", __version__)
Guido van Rossum325a64f1997-01-30 03:30:20 +0000190
Guido van Rossum272b37d1997-01-30 02:44:48 +0000191 if restart:
Guido van Rossum986abac1998-04-06 14:29:28 +0000192 c = load_pickle(dumpfile=dumpfile, verbose=verbose)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000193 else:
Guido van Rossum986abac1998-04-06 14:29:28 +0000194 c = Checker()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000195
196 c.setflags(checkext=checkext, verbose=verbose,
Guido van Rossume284b211999-11-17 15:40:08 +0000197 maxpage=maxpage, roundsize=roundsize,
198 nonames=nonames
199 )
Guido van Rossum00756bd1998-02-21 20:02:09 +0000200
201 if not restart and not args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000202 args.append(DEFROOT)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000203
204 for arg in args:
Guido van Rossum986abac1998-04-06 14:29:28 +0000205 c.addroot(arg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000206
Guido van Rossume284b211999-11-17 15:40:08 +0000207 # The -t flag is only needed if external links are not to be
208 # checked. So -t values are ignored unless -x was specified.
209 if not checkext:
210 for root in extra_roots:
211 # Make sure it's terminated by a slash,
212 # so that addroot doesn't discard the last
213 # directory component.
214 if root[-1] != "/":
215 root = root + "/"
216 c.addroot(root, add_to_do = 0)
217
Guido van Rossumbee64531998-04-27 19:35:15 +0000218 try:
219
220 if not norun:
221 try:
222 c.run()
223 except KeyboardInterrupt:
224 if verbose > 0:
Collin Winter6afaeb72007-08-03 17:06:41 +0000225 print("[run interrupted]")
Guido van Rossumbee64531998-04-27 19:35:15 +0000226
Guido van Rossum986abac1998-04-06 14:29:28 +0000227 try:
Guido van Rossumbee64531998-04-27 19:35:15 +0000228 c.report()
Guido van Rossum986abac1998-04-06 14:29:28 +0000229 except KeyboardInterrupt:
230 if verbose > 0:
Collin Winter6afaeb72007-08-03 17:06:41 +0000231 print("[report interrupted]")
Guido van Rossume5605ba1997-01-31 14:43:15 +0000232
Guido van Rossumbee64531998-04-27 19:35:15 +0000233 finally:
234 if c.save_pickle(dumpfile):
235 if dumpfile == DUMPFILE:
Collin Winter6afaeb72007-08-03 17:06:41 +0000236 print("Use ``%s -R'' to restart." % sys.argv[0])
Guido van Rossumbee64531998-04-27 19:35:15 +0000237 else:
Collin Winter6afaeb72007-08-03 17:06:41 +0000238 print("Use ``%s -R -d %s'' to restart." % (sys.argv[0],
239 dumpfile))
Guido van Rossum00756bd1998-02-21 20:02:09 +0000240
241
242def load_pickle(dumpfile=DUMPFILE, verbose=VERBOSE):
243 if verbose > 0:
Collin Winter6afaeb72007-08-03 17:06:41 +0000244 print("Loading checkpoint from %s ..." % dumpfile)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000245 f = open(dumpfile, "rb")
246 c = pickle.load(f)
247 f.close()
248 if verbose > 0:
Collin Winter6afaeb72007-08-03 17:06:41 +0000249 print("Done.")
250 print("Root:", "\n ".join(c.roots))
Guido van Rossum00756bd1998-02-21 20:02:09 +0000251 return c
Guido van Rossum272b37d1997-01-30 02:44:48 +0000252
253
254class Checker:
255
Guido van Rossum00756bd1998-02-21 20:02:09 +0000256 checkext = CHECKEXT
257 verbose = VERBOSE
258 maxpage = MAXPAGE
259 roundsize = ROUNDSIZE
Guido van Rossume284b211999-11-17 15:40:08 +0000260 nonames = NONAMES
Guido van Rossum00756bd1998-02-21 20:02:09 +0000261
262 validflags = tuple(dir())
263
264 def __init__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000265 self.reset()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000266
267 def setflags(self, **kw):
Georg Brandlbf82e372008-05-16 17:02:34 +0000268 for key in kw:
Guido van Rossum986abac1998-04-06 14:29:28 +0000269 if key not in self.validflags:
Collin Winter828f04a2007-08-31 00:04:24 +0000270 raise NameError("invalid keyword argument: %s" % str(key))
Guido van Rossum986abac1998-04-06 14:29:28 +0000271 for key, value in kw.items():
272 setattr(self, key, value)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000273
274 def reset(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000275 self.roots = []
276 self.todo = {}
277 self.done = {}
278 self.bad = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000279
280 # Add a name table, so that the name URLs can be checked. Also
281 # serves as an implicit cache for which URLs are done.
282 self.name_table = {}
283
Guido van Rossum986abac1998-04-06 14:29:28 +0000284 self.round = 0
285 # The following are not pickled:
286 self.robots = {}
287 self.errors = {}
288 self.urlopener = MyURLopener()
289 self.changed = 0
Guido van Rossume284b211999-11-17 15:40:08 +0000290
Guido van Rossum125700a1998-07-08 03:04:39 +0000291 def note(self, level, format, *args):
292 if self.verbose > level:
293 if args:
294 format = format%args
295 self.message(format)
Guido van Rossume284b211999-11-17 15:40:08 +0000296
Guido van Rossum125700a1998-07-08 03:04:39 +0000297 def message(self, format, *args):
298 if args:
299 format = format%args
Collin Winter6afaeb72007-08-03 17:06:41 +0000300 print(format)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000301
302 def __getstate__(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000303 return (self.roots, self.todo, self.done, self.bad, self.round)
Guido van Rossum3edbb351997-01-30 03:19:41 +0000304
305 def __setstate__(self, state):
Guido van Rossum986abac1998-04-06 14:29:28 +0000306 self.reset()
307 (self.roots, self.todo, self.done, self.bad, self.round) = state
308 for root in self.roots:
309 self.addrobot(root)
Georg Brandlbf82e372008-05-16 17:02:34 +0000310 for url in self.bad:
Guido van Rossum986abac1998-04-06 14:29:28 +0000311 self.markerror(url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000312
Guido van Rossume284b211999-11-17 15:40:08 +0000313 def addroot(self, root, add_to_do = 1):
Guido van Rossum986abac1998-04-06 14:29:28 +0000314 if root not in self.roots:
315 troot = root
316 scheme, netloc, path, params, query, fragment = \
317 urlparse.urlparse(root)
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000318 i = path.rfind("/") + 1
Guido van Rossum986abac1998-04-06 14:29:28 +0000319 if 0 < i < len(path):
320 path = path[:i]
321 troot = urlparse.urlunparse((scheme, netloc, path,
322 params, query, fragment))
323 self.roots.append(troot)
324 self.addrobot(root)
Guido van Rossume284b211999-11-17 15:40:08 +0000325 if add_to_do:
326 self.newlink((root, ""), ("<root>", root))
Guido van Rossum3edbb351997-01-30 03:19:41 +0000327
328 def addrobot(self, root):
Guido van Rossum986abac1998-04-06 14:29:28 +0000329 root = urlparse.urljoin(root, "/")
Georg Brandlbf82e372008-05-16 17:02:34 +0000330 if root in self.robots: return
Guido van Rossum986abac1998-04-06 14:29:28 +0000331 url = urlparse.urljoin(root, "/robots.txt")
332 self.robots[root] = rp = robotparser.RobotFileParser()
Guido van Rossum125700a1998-07-08 03:04:39 +0000333 self.note(2, "Parsing %s", url)
334 rp.debug = self.verbose > 3
Guido van Rossum986abac1998-04-06 14:29:28 +0000335 rp.set_url(url)
336 try:
337 rp.read()
Guido van Rossumb940e112007-01-10 16:19:56 +0000338 except (OSError, IOError) as msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000339 self.note(1, "I/O error parsing %s: %s", url, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000340
341 def run(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000342 while self.todo:
343 self.round = self.round + 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000344 self.note(0, "\nRound %d (%s)\n", self.round, self.status())
Georg Brandlbf82e372008-05-16 17:02:34 +0000345 urls = sorted(self.todo.keys())
Guido van Rossum6eb9d321998-06-15 12:33:02 +0000346 del urls[self.roundsize:]
Guido van Rossum986abac1998-04-06 14:29:28 +0000347 for url in urls:
348 self.dopage(url)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000349
350 def status(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000351 return "%d total, %d to do, %d done, %d bad" % (
352 len(self.todo)+len(self.done),
353 len(self.todo), len(self.done),
354 len(self.bad))
Guido van Rossum272b37d1997-01-30 02:44:48 +0000355
Guido van Rossumaf310c11997-02-02 23:30:32 +0000356 def report(self):
Guido van Rossum125700a1998-07-08 03:04:39 +0000357 self.message("")
358 if not self.todo: s = "Final"
359 else: s = "Interim"
360 self.message("%s Report (%s)", s, self.status())
Guido van Rossum986abac1998-04-06 14:29:28 +0000361 self.report_errors()
Guido van Rossum272b37d1997-01-30 02:44:48 +0000362
Guido van Rossum272b37d1997-01-30 02:44:48 +0000363 def report_errors(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000364 if not self.bad:
Guido van Rossum125700a1998-07-08 03:04:39 +0000365 self.message("\nNo errors")
Guido van Rossum986abac1998-04-06 14:29:28 +0000366 return
Guido van Rossum125700a1998-07-08 03:04:39 +0000367 self.message("\nError Report:")
Georg Brandlbf82e372008-05-16 17:02:34 +0000368 sources = sorted(self.errors.keys())
Guido van Rossum986abac1998-04-06 14:29:28 +0000369 for source in sources:
370 triples = self.errors[source]
Guido van Rossum125700a1998-07-08 03:04:39 +0000371 self.message("")
Guido van Rossum986abac1998-04-06 14:29:28 +0000372 if len(triples) > 1:
Guido van Rossum125700a1998-07-08 03:04:39 +0000373 self.message("%d Errors in %s", len(triples), source)
Guido van Rossum986abac1998-04-06 14:29:28 +0000374 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000375 self.message("Error in %s", source)
Guido van Rossume284b211999-11-17 15:40:08 +0000376 # Call self.format_url() instead of referring
377 # to the URL directly, since the URLs in these
378 # triples is now a (URL, fragment) pair. The value
379 # of the "source" variable comes from the list of
380 # origins, and is a URL, not a pair.
Tim Peters182b5ac2004-07-18 06:16:08 +0000381 for url, rawlink, msg in triples:
Guido van Rossume284b211999-11-17 15:40:08 +0000382 if rawlink != self.format_url(url): s = " (%s)" % rawlink
Guido van Rossum125700a1998-07-08 03:04:39 +0000383 else: s = ""
Guido van Rossume284b211999-11-17 15:40:08 +0000384 self.message(" HREF %s%s\n msg %s",
385 self.format_url(url), s, msg)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000386
Guido van Rossume284b211999-11-17 15:40:08 +0000387 def dopage(self, url_pair):
388
389 # All printing of URLs uses format_url(); argument changed to
390 # url_pair for clarity.
Guido van Rossum986abac1998-04-06 14:29:28 +0000391 if self.verbose > 1:
392 if self.verbose > 2:
Guido van Rossume284b211999-11-17 15:40:08 +0000393 self.show("Check ", self.format_url(url_pair),
394 " from", self.todo[url_pair])
Guido van Rossum986abac1998-04-06 14:29:28 +0000395 else:
Guido van Rossume284b211999-11-17 15:40:08 +0000396 self.message("Check %s", self.format_url(url_pair))
397 url, local_fragment = url_pair
398 if local_fragment and self.nonames:
399 self.markdone(url_pair)
400 return
Mark Hammondce56c372003-02-27 06:59:10 +0000401 try:
402 page = self.getpage(url_pair)
Guido van Rossumb940e112007-01-10 16:19:56 +0000403 except sgmllib.SGMLParseError as msg:
Mark Hammondce56c372003-02-27 06:59:10 +0000404 msg = self.sanitize(msg)
405 self.note(0, "Error parsing %s: %s",
406 self.format_url(url_pair), msg)
407 # Dont actually mark the URL as bad - it exists, just
408 # we can't parse it!
409 page = None
Guido van Rossum986abac1998-04-06 14:29:28 +0000410 if page:
Guido van Rossume284b211999-11-17 15:40:08 +0000411 # Store the page which corresponds to this URL.
412 self.name_table[url] = page
413 # If there is a fragment in this url_pair, and it's not
414 # in the list of names for the page, call setbad(), since
415 # it's a missing anchor.
416 if local_fragment and local_fragment not in page.getnames():
417 self.setbad(url_pair, ("Missing name anchor `%s'" % local_fragment))
Guido van Rossum986abac1998-04-06 14:29:28 +0000418 for info in page.getlinkinfos():
Guido van Rossume284b211999-11-17 15:40:08 +0000419 # getlinkinfos() now returns the fragment as well,
420 # and we store that fragment here in the "todo" dictionary.
421 link, rawlink, fragment = info
422 # However, we don't want the fragment as the origin, since
423 # the origin is logically a page.
Guido van Rossum986abac1998-04-06 14:29:28 +0000424 origin = url, rawlink
Guido van Rossume284b211999-11-17 15:40:08 +0000425 self.newlink((link, fragment), origin)
426 else:
427 # If no page has been created yet, we want to
428 # record that fact.
429 self.name_table[url_pair[0]] = None
430 self.markdone(url_pair)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000431
Guido van Rossumaf310c11997-02-02 23:30:32 +0000432 def newlink(self, url, origin):
Georg Brandlbf82e372008-05-16 17:02:34 +0000433 if url in self.done:
Guido van Rossum986abac1998-04-06 14:29:28 +0000434 self.newdonelink(url, origin)
435 else:
436 self.newtodolink(url, origin)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000437
438 def newdonelink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000439 if origin not in self.done[url]:
440 self.done[url].append(origin)
441
442 # Call self.format_url(), since the URL here
443 # is now a (URL, fragment) pair.
444 self.note(3, " Done link %s", self.format_url(url))
445
446 # Make sure that if it's bad, that the origin gets added.
Georg Brandlbf82e372008-05-16 17:02:34 +0000447 if url in self.bad:
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000448 source, rawlink = origin
449 triple = url, rawlink, self.bad[url]
450 self.seterror(source, triple)
Guido van Rossume5605ba1997-01-31 14:43:15 +0000451
452 def newtodolink(self, url, origin):
Guido van Rossume284b211999-11-17 15:40:08 +0000453 # Call self.format_url(), since the URL here
454 # is now a (URL, fragment) pair.
Georg Brandlbf82e372008-05-16 17:02:34 +0000455 if url in self.todo:
Guido van Rossumdbd5c3e1999-11-17 15:00:14 +0000456 if origin not in self.todo[url]:
457 self.todo[url].append(origin)
Guido van Rossume284b211999-11-17 15:40:08 +0000458 self.note(3, " Seen todo link %s", self.format_url(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000459 else:
460 self.todo[url] = [origin]
Guido van Rossume284b211999-11-17 15:40:08 +0000461 self.note(3, " New todo link %s", self.format_url(url))
462
Tim Peters182b5ac2004-07-18 06:16:08 +0000463 def format_url(self, url):
Guido van Rossume284b211999-11-17 15:40:08 +0000464 link, fragment = url
465 if fragment: return link + "#" + fragment
466 else: return link
Guido van Rossume5605ba1997-01-31 14:43:15 +0000467
468 def markdone(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000469 self.done[url] = self.todo[url]
470 del self.todo[url]
471 self.changed = 1
Guido van Rossum272b37d1997-01-30 02:44:48 +0000472
473 def inroots(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000474 for root in self.roots:
475 if url[:len(root)] == root:
Guido van Rossum125700a1998-07-08 03:04:39 +0000476 return self.isallowed(root, url)
Guido van Rossum986abac1998-04-06 14:29:28 +0000477 return 0
Guido van Rossume284b211999-11-17 15:40:08 +0000478
Guido van Rossum125700a1998-07-08 03:04:39 +0000479 def isallowed(self, root, url):
480 root = urlparse.urljoin(root, "/")
481 return self.robots[root].can_fetch(AGENTNAME, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000482
Guido van Rossume284b211999-11-17 15:40:08 +0000483 def getpage(self, url_pair):
484 # Incoming argument name is a (URL, fragment) pair.
485 # The page may have been cached in the name_table variable.
486 url, fragment = url_pair
Georg Brandlbf82e372008-05-16 17:02:34 +0000487 if url in self.name_table:
Guido van Rossume284b211999-11-17 15:40:08 +0000488 return self.name_table[url]
489
Georg Brandl7d840552008-06-23 11:45:20 +0000490 scheme, path = urllib.request.splittype(url)
Fred Drakef3186e82001-04-04 17:47:25 +0000491 if scheme in ('mailto', 'news', 'javascript', 'telnet'):
492 self.note(1, " Not checking %s URL" % scheme)
Guido van Rossum986abac1998-04-06 14:29:28 +0000493 return None
494 isint = self.inroots(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000495
496 # Ensure that openpage gets the URL pair to
497 # print out its error message and record the error pair
498 # correctly.
Guido van Rossum986abac1998-04-06 14:29:28 +0000499 if not isint:
500 if not self.checkext:
Guido van Rossum125700a1998-07-08 03:04:39 +0000501 self.note(1, " Not checking ext link")
Guido van Rossum986abac1998-04-06 14:29:28 +0000502 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000503 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000504 if f:
505 self.safeclose(f)
506 return None
Guido van Rossume284b211999-11-17 15:40:08 +0000507 text, nurl = self.readhtml(url_pair)
508
Guido van Rossum986abac1998-04-06 14:29:28 +0000509 if nurl != url:
Guido van Rossum125700a1998-07-08 03:04:39 +0000510 self.note(1, " Redirected to %s", nurl)
Guido van Rossum986abac1998-04-06 14:29:28 +0000511 url = nurl
512 if text:
Guido van Rossum125700a1998-07-08 03:04:39 +0000513 return Page(text, url, maxpage=self.maxpage, checker=self)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000514
Guido van Rossume284b211999-11-17 15:40:08 +0000515 # These next three functions take (URL, fragment) pairs as
516 # arguments, so that openpage() receives the appropriate tuple to
517 # record error messages.
518 def readhtml(self, url_pair):
519 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000520 text = None
Guido van Rossume284b211999-11-17 15:40:08 +0000521 f, url = self.openhtml(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000522 if f:
523 text = f.read()
524 f.close()
525 return text, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000526
Guido van Rossume284b211999-11-17 15:40:08 +0000527 def openhtml(self, url_pair):
528 url, fragment = url_pair
529 f = self.openpage(url_pair)
Guido van Rossum986abac1998-04-06 14:29:28 +0000530 if f:
531 url = f.geturl()
532 info = f.info()
533 if not self.checkforhtml(info, url):
534 self.safeclose(f)
535 f = None
536 return f, url
Guido van Rossum00756bd1998-02-21 20:02:09 +0000537
Guido van Rossume284b211999-11-17 15:40:08 +0000538 def openpage(self, url_pair):
539 url, fragment = url_pair
Guido van Rossum986abac1998-04-06 14:29:28 +0000540 try:
541 return self.urlopener.open(url)
Guido van Rossumb940e112007-01-10 16:19:56 +0000542 except (OSError, IOError) as msg:
Guido van Rossum986abac1998-04-06 14:29:28 +0000543 msg = self.sanitize(msg)
Guido van Rossum125700a1998-07-08 03:04:39 +0000544 self.note(0, "Error %s", msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000545 if self.verbose > 0:
Guido van Rossume284b211999-11-17 15:40:08 +0000546 self.show(" HREF ", url, " from", self.todo[url_pair])
547 self.setbad(url_pair, msg)
Guido van Rossum986abac1998-04-06 14:29:28 +0000548 return None
Guido van Rossum00756bd1998-02-21 20:02:09 +0000549
550 def checkforhtml(self, info, url):
Georg Brandlbf82e372008-05-16 17:02:34 +0000551 if 'content-type' in info:
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000552 ctype = cgi.parse_header(info['content-type'])[0].lower()
Fred Drake0b9e3f72002-11-12 22:19:34 +0000553 if ';' in ctype:
554 # handle content-type: text/html; charset=iso8859-1 :
555 ctype = ctype.split(';', 1)[0].strip()
Guido van Rossum986abac1998-04-06 14:29:28 +0000556 else:
557 if url[-1:] == "/":
558 return 1
559 ctype, encoding = mimetypes.guess_type(url)
560 if ctype == 'text/html':
561 return 1
562 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000563 self.note(1, " Not HTML, mime type %s", ctype)
Guido van Rossum986abac1998-04-06 14:29:28 +0000564 return 0
Guido van Rossum272b37d1997-01-30 02:44:48 +0000565
Guido van Rossume5605ba1997-01-31 14:43:15 +0000566 def setgood(self, url):
Georg Brandlbf82e372008-05-16 17:02:34 +0000567 if url in self.bad:
Guido van Rossum986abac1998-04-06 14:29:28 +0000568 del self.bad[url]
569 self.changed = 1
Guido van Rossum125700a1998-07-08 03:04:39 +0000570 self.note(0, "(Clear previously seen error)")
Guido van Rossume5605ba1997-01-31 14:43:15 +0000571
572 def setbad(self, url, msg):
Georg Brandlbf82e372008-05-16 17:02:34 +0000573 if url in self.bad and self.bad[url] == msg:
Guido van Rossum125700a1998-07-08 03:04:39 +0000574 self.note(0, "(Seen this error before)")
Guido van Rossum986abac1998-04-06 14:29:28 +0000575 return
576 self.bad[url] = msg
577 self.changed = 1
578 self.markerror(url)
Guido van Rossume284b211999-11-17 15:40:08 +0000579
Guido van Rossumaf310c11997-02-02 23:30:32 +0000580 def markerror(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000581 try:
582 origins = self.todo[url]
583 except KeyError:
584 origins = self.done[url]
585 for source, rawlink in origins:
586 triple = url, rawlink, self.bad[url]
587 self.seterror(source, triple)
Guido van Rossumaf310c11997-02-02 23:30:32 +0000588
589 def seterror(self, url, triple):
Guido van Rossum986abac1998-04-06 14:29:28 +0000590 try:
Guido van Rossume284b211999-11-17 15:40:08 +0000591 # Because of the way the URLs are now processed, I need to
592 # check to make sure the URL hasn't been entered in the
593 # error list. The first element of the triple here is a
594 # (URL, fragment) pair, but the URL key is not, since it's
595 # from the list of origins.
596 if triple not in self.errors[url]:
597 self.errors[url].append(triple)
Guido van Rossum986abac1998-04-06 14:29:28 +0000598 except KeyError:
599 self.errors[url] = [triple]
Guido van Rossume5605ba1997-01-31 14:43:15 +0000600
Guido van Rossum00756bd1998-02-21 20:02:09 +0000601 # The following used to be toplevel functions; they have been
602 # changed into methods so they can be overridden in subclasses.
603
604 def show(self, p1, link, p2, origins):
Guido van Rossum125700a1998-07-08 03:04:39 +0000605 self.message("%s %s", p1, link)
Guido van Rossum986abac1998-04-06 14:29:28 +0000606 i = 0
607 for source, rawlink in origins:
608 i = i+1
609 if i == 2:
610 p2 = ' '*len(p2)
Guido van Rossum125700a1998-07-08 03:04:39 +0000611 if rawlink != link: s = " (%s)" % rawlink
612 else: s = ""
613 self.message("%s %s%s", p2, source, s)
Guido van Rossum00756bd1998-02-21 20:02:09 +0000614
615 def sanitize(self, msg):
Guido van Rossum986abac1998-04-06 14:29:28 +0000616 if isinstance(IOError, ClassType) and isinstance(msg, IOError):
617 # Do the other branch recursively
618 msg.args = self.sanitize(msg.args)
619 elif isinstance(msg, TupleType):
620 if len(msg) >= 4 and msg[0] == 'http error' and \
621 isinstance(msg[3], InstanceType):
622 # Remove the Message instance -- it may contain
623 # a file object which prevents pickling.
624 msg = msg[:3] + msg[4:]
625 return msg
Guido van Rossum00756bd1998-02-21 20:02:09 +0000626
627 def safeclose(self, f):
Guido van Rossum986abac1998-04-06 14:29:28 +0000628 try:
629 url = f.geturl()
630 except AttributeError:
631 pass
632 else:
633 if url[:4] == 'ftp:' or url[:7] == 'file://':
634 # Apparently ftp connections don't like to be closed
635 # prematurely...
636 text = f.read()
637 f.close()
Guido van Rossum00756bd1998-02-21 20:02:09 +0000638
639 def save_pickle(self, dumpfile=DUMPFILE):
Guido van Rossum986abac1998-04-06 14:29:28 +0000640 if not self.changed:
Guido van Rossum125700a1998-07-08 03:04:39 +0000641 self.note(0, "\nNo need to save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000642 elif not dumpfile:
Guido van Rossum125700a1998-07-08 03:04:39 +0000643 self.note(0, "No dumpfile, won't save checkpoint")
Guido van Rossum986abac1998-04-06 14:29:28 +0000644 else:
Guido van Rossum125700a1998-07-08 03:04:39 +0000645 self.note(0, "\nSaving checkpoint to %s ...", dumpfile)
Guido van Rossum986abac1998-04-06 14:29:28 +0000646 newfile = dumpfile + ".new"
647 f = open(newfile, "wb")
648 pickle.dump(self, f)
649 f.close()
650 try:
651 os.unlink(dumpfile)
652 except os.error:
653 pass
654 os.rename(newfile, dumpfile)
Guido van Rossum125700a1998-07-08 03:04:39 +0000655 self.note(0, "Done.")
Guido van Rossum986abac1998-04-06 14:29:28 +0000656 return 1
Guido van Rossum00756bd1998-02-21 20:02:09 +0000657
Guido van Rossum272b37d1997-01-30 02:44:48 +0000658
659class Page:
660
Guido van Rossum125700a1998-07-08 03:04:39 +0000661 def __init__(self, text, url, verbose=VERBOSE, maxpage=MAXPAGE, checker=None):
Guido van Rossum986abac1998-04-06 14:29:28 +0000662 self.text = text
663 self.url = url
664 self.verbose = verbose
665 self.maxpage = maxpage
Guido van Rossum125700a1998-07-08 03:04:39 +0000666 self.checker = checker
Guido van Rossum272b37d1997-01-30 02:44:48 +0000667
Guido van Rossume284b211999-11-17 15:40:08 +0000668 # The parsing of the page is done in the __init__() routine in
669 # order to initialize the list of names the file
670 # contains. Stored the parser in an instance variable. Passed
671 # the URL to MyHTMLParser().
672 size = len(self.text)
673 if size > self.maxpage:
674 self.note(0, "Skip huge file %s (%.0f Kbytes)", self.url, (size*0.001))
675 self.parser = None
676 return
677 self.checker.note(2, " Parsing %s (%d bytes)", self.url, size)
678 self.parser = MyHTMLParser(url, verbose=self.verbose,
679 checker=self.checker)
680 self.parser.feed(self.text)
681 self.parser.close()
682
Guido van Rossuma42c1ee1998-08-06 21:31:13 +0000683 def note(self, level, msg, *args):
684 if self.checker:
Neal Norwitzd9108552006-03-17 08:00:19 +0000685 self.checker.note(level, msg, *args)
Guido van Rossuma42c1ee1998-08-06 21:31:13 +0000686 else:
687 if self.verbose >= level:
688 if args:
689 msg = msg%args
Collin Winter6afaeb72007-08-03 17:06:41 +0000690 print(msg)
Guido van Rossuma42c1ee1998-08-06 21:31:13 +0000691
Guido van Rossume284b211999-11-17 15:40:08 +0000692 # Method to retrieve names.
693 def getnames(self):
Guido van Rossum84306242000-03-28 20:10:39 +0000694 if self.parser:
695 return self.parser.names
696 else:
697 return []
Guido van Rossume284b211999-11-17 15:40:08 +0000698
Guido van Rossum272b37d1997-01-30 02:44:48 +0000699 def getlinkinfos(self):
Guido van Rossume284b211999-11-17 15:40:08 +0000700 # File reading is done in __init__() routine. Store parser in
701 # local variable to indicate success of parsing.
702
703 # If no parser was stored, fail.
704 if not self.parser: return []
705
706 rawlinks = self.parser.getlinks()
707 base = urlparse.urljoin(self.url, self.parser.getbase() or "")
Guido van Rossum986abac1998-04-06 14:29:28 +0000708 infos = []
709 for rawlink in rawlinks:
710 t = urlparse.urlparse(rawlink)
Guido van Rossume284b211999-11-17 15:40:08 +0000711 # DON'T DISCARD THE FRAGMENT! Instead, include
712 # it in the tuples which are returned. See Checker.dopage().
713 fragment = t[-1]
Guido van Rossum986abac1998-04-06 14:29:28 +0000714 t = t[:-1] + ('',)
715 rawlink = urlparse.urlunparse(t)
716 link = urlparse.urljoin(base, rawlink)
Tim Peters182b5ac2004-07-18 06:16:08 +0000717 infos.append((link, rawlink, fragment))
Guido van Rossume284b211999-11-17 15:40:08 +0000718
Guido van Rossum986abac1998-04-06 14:29:28 +0000719 return infos
Guido van Rossum272b37d1997-01-30 02:44:48 +0000720
721
Guido van Rossum34d19282007-08-09 01:03:29 +0000722class MyStringIO(io.StringIO):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000723
724 def __init__(self, url, info):
Guido van Rossum986abac1998-04-06 14:29:28 +0000725 self.__url = url
726 self.__info = info
Guido van Rossum34d19282007-08-09 01:03:29 +0000727 super(MyStringIO, self).__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000728
729 def info(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000730 return self.__info
Guido van Rossum272b37d1997-01-30 02:44:48 +0000731
732 def geturl(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000733 return self.__url
Guido van Rossum272b37d1997-01-30 02:44:48 +0000734
735
Georg Brandl7d840552008-06-23 11:45:20 +0000736class MyURLopener(urllib.request.FancyURLopener):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000737
Georg Brandl7d840552008-06-23 11:45:20 +0000738 http_error_default = urllib.request.URLopener.http_error_default
Guido van Rossum272b37d1997-01-30 02:44:48 +0000739
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000740 def __init__(*args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000741 self = args[0]
Georg Brandl7d840552008-06-23 11:45:20 +0000742 urllib.request.FancyURLopener.__init__(*args)
Guido van Rossum986abac1998-04-06 14:29:28 +0000743 self.addheaders = [
744 ('User-agent', 'Python-webchecker/%s' % __version__),
745 ]
Guido van Rossum89efda31997-05-07 15:00:56 +0000746
747 def http_error_401(self, url, fp, errcode, errmsg, headers):
748 return None
Guido van Rossumc59a5d41997-01-30 06:04:00 +0000749
Guido van Rossum272b37d1997-01-30 02:44:48 +0000750 def open_file(self, url):
Guido van Rossum986abac1998-04-06 14:29:28 +0000751 path = urllib.url2pathname(urllib.unquote(url))
Guido van Rossum986abac1998-04-06 14:29:28 +0000752 if os.path.isdir(path):
Guido van Rossum0ec14931999-04-26 23:11:46 +0000753 if path[-1] != os.sep:
754 url = url + '/'
Guido van Rossum986abac1998-04-06 14:29:28 +0000755 indexpath = os.path.join(path, "index.html")
756 if os.path.exists(indexpath):
757 return self.open_file(url + "index.html")
758 try:
759 names = os.listdir(path)
Guido van Rossumb940e112007-01-10 16:19:56 +0000760 except os.error as msg:
Thomas Wouters0e3f5912006-08-11 14:57:12 +0000761 exc_type, exc_value, exc_tb = sys.exc_info()
Collin Winter828f04a2007-08-31 00:04:24 +0000762 raise IOError(msg).with_traceback(exc_tb)
Guido van Rossum986abac1998-04-06 14:29:28 +0000763 names.sort()
764 s = MyStringIO("file:"+url, {'content-type': 'text/html'})
765 s.write('<BASE HREF="file:%s">\n' %
766 urllib.quote(os.path.join(path, "")))
767 for name in names:
768 q = urllib.quote(name)
769 s.write('<A HREF="%s">%s</A>\n' % (q, q))
770 s.seek(0)
771 return s
Georg Brandl7d840552008-06-23 11:45:20 +0000772 return urllib.request.FancyURLopener.open_file(self, url)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000773
774
Guido van Rossume5605ba1997-01-31 14:43:15 +0000775class MyHTMLParser(sgmllib.SGMLParser):
Guido van Rossum272b37d1997-01-30 02:44:48 +0000776
Guido van Rossume284b211999-11-17 15:40:08 +0000777 def __init__(self, url, verbose=VERBOSE, checker=None):
Guido van Rossum125700a1998-07-08 03:04:39 +0000778 self.myverbose = verbose # now unused
779 self.checker = checker
Guido van Rossum986abac1998-04-06 14:29:28 +0000780 self.base = None
781 self.links = {}
Guido van Rossume284b211999-11-17 15:40:08 +0000782 self.names = []
783 self.url = url
Guido van Rossum986abac1998-04-06 14:29:28 +0000784 sgmllib.SGMLParser.__init__(self)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000785
Thomas Wouters73e5a5b2006-06-08 15:35:45 +0000786 def check_name_id(self, attributes):
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000787 """ Check the name or id attributes on an element.
788 """
789 # We must rescue the NAME or id (name is deprecated in XHTML)
Guido van Rossume284b211999-11-17 15:40:08 +0000790 # attributes from the anchor, in order to
791 # cache the internal anchors which are made
792 # available in the page.
793 for name, value in attributes:
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000794 if name == "name" or name == "id":
Guido van Rossume284b211999-11-17 15:40:08 +0000795 if value in self.names:
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000796 self.checker.message("WARNING: duplicate ID name %s in %s",
Guido van Rossume284b211999-11-17 15:40:08 +0000797 value, self.url)
798 else: self.names.append(value)
799 break
800
Thomas Wouters73e5a5b2006-06-08 15:35:45 +0000801 def unknown_starttag(self, tag, attributes):
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000802 """ In XHTML, you can have id attributes on any element.
803 """
804 self.check_name_id(attributes)
805
806 def start_a(self, attributes):
807 self.link_attr(attributes, 'href')
808 self.check_name_id(attributes)
809
Guido van Rossum6133ec61997-02-01 05:16:08 +0000810 def end_a(self): pass
811
Guido van Rossum2237b731997-10-06 18:54:01 +0000812 def do_area(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000813 self.link_attr(attributes, 'href')
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000814 self.check_name_id(attributes)
Guido van Rossum2237b731997-10-06 18:54:01 +0000815
Fred Drakef3186e82001-04-04 17:47:25 +0000816 def do_body(self, attributes):
Fred Draked34a9c92001-04-05 18:14:50 +0000817 self.link_attr(attributes, 'background', 'bgsound')
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000818 self.check_name_id(attributes)
Fred Drakef3186e82001-04-04 17:47:25 +0000819
Guido van Rossum6133ec61997-02-01 05:16:08 +0000820 def do_img(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000821 self.link_attr(attributes, 'src', 'lowsrc')
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000822 self.check_name_id(attributes)
Guido van Rossum6133ec61997-02-01 05:16:08 +0000823
824 def do_frame(self, attributes):
Fred Drakef3186e82001-04-04 17:47:25 +0000825 self.link_attr(attributes, 'src', 'longdesc')
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000826 self.check_name_id(attributes)
Fred Drakef3186e82001-04-04 17:47:25 +0000827
828 def do_iframe(self, attributes):
829 self.link_attr(attributes, 'src', 'longdesc')
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000830 self.check_name_id(attributes)
Fred Drakef3186e82001-04-04 17:47:25 +0000831
832 def do_link(self, attributes):
833 for name, value in attributes:
834 if name == "rel":
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000835 parts = value.lower().split()
Fred Drakef3186e82001-04-04 17:47:25 +0000836 if ( parts == ["stylesheet"]
837 or parts == ["alternate", "stylesheet"]):
838 self.link_attr(attributes, "href")
839 break
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000840 self.check_name_id(attributes)
Fred Drakef3186e82001-04-04 17:47:25 +0000841
842 def do_object(self, attributes):
843 self.link_attr(attributes, 'data', 'usemap')
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000844 self.check_name_id(attributes)
Fred Drakef3186e82001-04-04 17:47:25 +0000845
846 def do_script(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000847 self.link_attr(attributes, 'src')
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000848 self.check_name_id(attributes)
Guido van Rossum6133ec61997-02-01 05:16:08 +0000849
Fred Draked34a9c92001-04-05 18:14:50 +0000850 def do_table(self, attributes):
851 self.link_attr(attributes, 'background')
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000852 self.check_name_id(attributes)
Fred Draked34a9c92001-04-05 18:14:50 +0000853
854 def do_td(self, attributes):
855 self.link_attr(attributes, 'background')
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000856 self.check_name_id(attributes)
Fred Draked34a9c92001-04-05 18:14:50 +0000857
858 def do_th(self, attributes):
859 self.link_attr(attributes, 'background')
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000860 self.check_name_id(attributes)
Fred Draked34a9c92001-04-05 18:14:50 +0000861
862 def do_tr(self, attributes):
863 self.link_attr(attributes, 'background')
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000864 self.check_name_id(attributes)
Fred Draked34a9c92001-04-05 18:14:50 +0000865
Guido van Rossum6133ec61997-02-01 05:16:08 +0000866 def link_attr(self, attributes, *args):
Guido van Rossum986abac1998-04-06 14:29:28 +0000867 for name, value in attributes:
868 if name in args:
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000869 if value: value = value.strip()
Guido van Rossum986abac1998-04-06 14:29:28 +0000870 if value: self.links[value] = None
Guido van Rossum272b37d1997-01-30 02:44:48 +0000871
872 def do_base(self, attributes):
Guido van Rossum986abac1998-04-06 14:29:28 +0000873 for name, value in attributes:
874 if name == 'href':
Walter Dörwaldaaab30e2002-09-11 20:36:02 +0000875 if value: value = value.strip()
Guido van Rossum986abac1998-04-06 14:29:28 +0000876 if value:
Guido van Rossum125700a1998-07-08 03:04:39 +0000877 if self.checker:
878 self.checker.note(1, " Base %s", value)
Guido van Rossum986abac1998-04-06 14:29:28 +0000879 self.base = value
Andrew M. Kuchlinga982c442004-03-21 19:07:23 +0000880 self.check_name_id(attributes)
Guido van Rossum272b37d1997-01-30 02:44:48 +0000881
882 def getlinks(self):
Georg Brandlbf82e372008-05-16 17:02:34 +0000883 return list(self.links.keys())
Guido van Rossum272b37d1997-01-30 02:44:48 +0000884
885 def getbase(self):
Guido van Rossum986abac1998-04-06 14:29:28 +0000886 return self.base
Guido van Rossum272b37d1997-01-30 02:44:48 +0000887
888
Guido van Rossum272b37d1997-01-30 02:44:48 +0000889if __name__ == '__main__':
890 main()