blob: 85a38fbe981672cb53f49a9207d754602c833d3e [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001
2:mod:`HTMLParser` --- Simple HTML and XHTML parser
3==================================================
4
5.. module:: HTMLParser
6 :synopsis: A simple parser that can handle HTML and XHTML.
7
8
9.. versionadded:: 2.2
10
11.. index::
12 single: HTML
13 single: XHTML
14
15This module defines a class :class:`HTMLParser` which serves as the basis for
16parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
17Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
18in :mod:`sgmllib`.
19
20
21.. class:: HTMLParser()
22
23 The :class:`HTMLParser` class is instantiated without arguments.
24
25 An HTMLParser instance is fed HTML data and calls handler functions when tags
26 begin and end. The :class:`HTMLParser` class is meant to be overridden by the
27 user to provide a desired behavior.
28
29 Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
30 match start tags or call the end-tag handler for elements which are closed
31 implicitly by closing an outer element.
32
33An exception is defined as well:
34
35
36.. exception:: HTMLParseError
37
38 Exception raised by the :class:`HTMLParser` class when it encounters an error
39 while parsing. This exception provides three attributes: :attr:`msg` is a brief
40 message explaining the error, :attr:`lineno` is the number of the line on which
41 the broken construct was detected, and :attr:`offset` is the number of
42 characters into the line at which the construct starts.
43
44:class:`HTMLParser` instances have the following methods:
45
46
47.. method:: HTMLParser.reset()
48
49 Reset the instance. Loses all unprocessed data. This is called implicitly at
50 instantiation time.
51
52
53.. method:: HTMLParser.feed(data)
54
55 Feed some text to the parser. It is processed insofar as it consists of
56 complete elements; incomplete data is buffered until more data is fed or
57 :meth:`close` is called.
58
59
60.. method:: HTMLParser.close()
61
62 Force processing of all buffered data as if it were followed by an end-of-file
63 mark. This method may be redefined by a derived class to define additional
64 processing at the end of the input, but the redefined version should always call
65 the :class:`HTMLParser` base class method :meth:`close`.
66
67
68.. method:: HTMLParser.getpos()
69
70 Return current line number and offset.
71
72
73.. method:: HTMLParser.get_starttag_text()
74
75 Return the text of the most recently opened start tag. This should not normally
76 be needed for structured processing, but may be useful in dealing with HTML "as
77 deployed" or for re-generating input with minimal changes (whitespace between
78 attributes can be preserved, etc.).
79
80
81.. method:: HTMLParser.handle_starttag(tag, attrs)
82
83 This method is called to handle the start of a tag. It is intended to be
84 overridden by a derived class; the base class implementation does nothing.
85
86 The *tag* argument is the name of the tag converted to lower case. The *attrs*
87 argument is a list of ``(name, value)`` pairs containing the attributes found
88 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
89 and quotes in the *value* have been removed, and character and entity references
90 have been replaced. For instance, for the tag ``<A
91 HREF="http://www.cwi.nl/">``, this method would be called as
92 ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
93
94 .. versionchanged:: 2.6
95 All entity references from htmlentitydefs are now replaced in the attribute
96 values.
97
98
99.. method:: HTMLParser.handle_startendtag(tag, attrs)
100
101 Similar to :meth:`handle_starttag`, but called when the parser encounters an
102 XHTML-style empty tag (``<a .../>``). This method may be overridden by
103 subclasses which require this particular lexical information; the default
104 implementation simple calls :meth:`handle_starttag` and :meth:`handle_endtag`.
105
106
107.. method:: HTMLParser.handle_endtag(tag)
108
109 This method is called to handle the end tag of an element. It is intended to be
110 overridden by a derived class; the base class implementation does nothing. The
111 *tag* argument is the name of the tag converted to lower case.
112
113
114.. method:: HTMLParser.handle_data(data)
115
116 This method is called to process arbitrary data. It is intended to be
117 overridden by a derived class; the base class implementation does nothing.
118
119
120.. method:: HTMLParser.handle_charref(name)
121
122 This method is called to process a character reference of the form ``&#ref;``.
123 It is intended to be overridden by a derived class; the base class
124 implementation does nothing.
125
126
127.. method:: HTMLParser.handle_entityref(name)
128
129 This method is called to process a general entity reference of the form
130 ``&name;`` where *name* is an general entity reference. It is intended to be
131 overridden by a derived class; the base class implementation does nothing.
132
133
134.. method:: HTMLParser.handle_comment(data)
135
136 This method is called when a comment is encountered. The *comment* argument is
137 a string containing the text between the ``--`` and ``--`` delimiters, but not
138 the delimiters themselves. For example, the comment ``<!--text-->`` will cause
139 this method to be called with the argument ``'text'``. It is intended to be
140 overridden by a derived class; the base class implementation does nothing.
141
142
143.. method:: HTMLParser.handle_decl(decl)
144
145 Method called when an SGML declaration is read by the parser. The *decl*
146 parameter will be the entire contents of the declaration inside the ``<!``...\
147 ``>`` markup. It is intended to be overridden by a derived class; the base
148 class implementation does nothing.
149
150
151.. method:: HTMLParser.handle_pi(data)
152
153 Method called when a processing instruction is encountered. The *data*
154 parameter will contain the entire processing instruction. For example, for the
155 processing instruction ``<?proc color='red'>``, this method would be called as
156 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
157 class; the base class implementation does nothing.
158
159 .. note::
160
161 The :class:`HTMLParser` class uses the SGML syntactic rules for processing
162 instructions. An XHTML processing instruction using the trailing ``'?'`` will
163 cause the ``'?'`` to be included in *data*.
164
165
166.. _htmlparser-example:
167
168Example HTML Parser Application
169-------------------------------
170
171As a basic example, below is a very basic HTML parser that uses the
172:class:`HTMLParser` class to print out tags as they are encountered::
173
174 from HTMLParser import HTMLParser
175
176 class MyHTMLParser(HTMLParser):
177
178 def handle_starttag(self, tag, attrs):
179 print "Encountered the beginning of a %s tag" % tag
180
181 def handle_endtag(self, tag):
182 print "Encountered the end of a %s tag" % tag
183