blob: 3aba74ef3f273520cc264c26a867b4bc8de9e124 [file] [log] [blame]
Fred Draked995e112008-05-20 06:08:38 +00001
2:mod:`HTMLParser` --- Simple HTML and XHTML parser
3==================================================
Georg Brandl8ec7f652007-08-15 14:28:01 +00004
Fred Drake20b56602008-05-17 21:23:02 +00005.. module:: HTMLParser
Georg Brandl8ec7f652007-08-15 14:28:01 +00006 :synopsis: A simple parser that can handle HTML and XHTML.
7
Fred Drake20b56602008-05-17 21:23:02 +00008.. note::
Georg Brandl3682dfe2008-05-20 07:21:58 +00009
10 The :mod:`HTMLParser` module has been renamed to :mod:`html.parser` in Python
Ezio Melotti87033522011-10-28 14:20:08 +030011 3. The :term:`2to3` tool will automatically adapt imports when converting
12 your sources to Python 3.
Fred Drake20b56602008-05-17 21:23:02 +000013
Georg Brandl8ec7f652007-08-15 14:28:01 +000014
15.. versionadded:: 2.2
16
17.. index::
18 single: HTML
19 single: XHTML
20
Éric Araujo29a0b572011-08-19 02:14:03 +020021**Source code:** :source:`Lib/HTMLParser.py`
22
23--------------
24
Ezio Melottic39b5522012-02-18 01:46:04 +020025This module defines a class :class:`.HTMLParser` which serves as the basis for
Georg Brandl8ec7f652007-08-15 14:28:01 +000026parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
27Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
28in :mod:`sgmllib`.
29
30
31.. class:: HTMLParser()
32
Ezio Melottic39b5522012-02-18 01:46:04 +020033 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
34 when start tags, end tags, text, comments, and other markup elements are
35 encountered. The user should subclass :class:`.HTMLParser` and override its
36 methods to implement the desired behavior.
Georg Brandl8ec7f652007-08-15 14:28:01 +000037
Ezio Melottic39b5522012-02-18 01:46:04 +020038 The :class:`.HTMLParser` class is instantiated without arguments.
Georg Brandl8ec7f652007-08-15 14:28:01 +000039
40 Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
41 match start tags or call the end-tag handler for elements which are closed
42 implicitly by closing an outer element.
43
44An exception is defined as well:
45
Georg Brandl8ec7f652007-08-15 14:28:01 +000046.. exception:: HTMLParseError
47
Ezio Melottic39b5522012-02-18 01:46:04 +020048 :class:`.HTMLParser` is able to handle broken markup, but in some cases it
49 might raise this exception when it encounters an error while parsing.
50 This exception provides three attributes: :attr:`msg` is a brief
51 message explaining the error, :attr:`lineno` is the number of the line on
52 which the broken construct was detected, and :attr:`offset` is the number of
Georg Brandl8ec7f652007-08-15 14:28:01 +000053 characters into the line at which the construct starts.
54
Ezio Melottic39b5522012-02-18 01:46:04 +020055
56Example HTML Parser Application
57-------------------------------
58
59As a basic example, below is a simple HTML parser that uses the
60:class:`.HTMLParser` class to print out start tags, end tags and data
61as they are encountered::
62
63 from HTMLParser import HTMLParser
64
65 # create a subclass and override the handler methods
66 class MyHTMLParser(HTMLParser):
67 def handle_starttag(self, tag, attrs):
68 print "Encountered a start tag:", tag
69 def handle_endtag(self, tag):
70 print "Encountered an end tag :", tag
71 def handle_data(self, data):
72 print "Encountered some data :", data
73
74 # instantiate the parser and fed it some HTML
75 parser = MyHTMLParser()
76 parser.feed('<html><head><title>Test</title></head>'
77 '<body><h1>Parse me!</h1></body></html>')
78
79The output will then be::
80
81 Encountered a start tag: html
82 Encountered a start tag: head
83 Encountered a start tag: title
84 Encountered some data : Test
85 Encountered an end tag : title
86 Encountered an end tag : head
87 Encountered a start tag: body
88 Encountered a start tag: h1
89 Encountered some data : Parse me!
90 Encountered an end tag : h1
91 Encountered an end tag : body
92 Encountered an end tag : html
Georg Brandl8ec7f652007-08-15 14:28:01 +000093
94
Ezio Melottic39b5522012-02-18 01:46:04 +020095:class:`.HTMLParser` Methods
96----------------------------
Georg Brandl8ec7f652007-08-15 14:28:01 +000097
Ezio Melottic39b5522012-02-18 01:46:04 +020098:class:`.HTMLParser` instances have the following methods:
Georg Brandl8ec7f652007-08-15 14:28:01 +000099
100
101.. method:: HTMLParser.feed(data)
102
103 Feed some text to the parser. It is processed insofar as it consists of
104 complete elements; incomplete data is buffered until more data is fed or
Ezio Melottid0ffcd62011-12-19 07:15:26 +0200105 :meth:`close` is called. *data* can be either :class:`unicode` or
106 :class:`str`, but passing :class:`unicode` is advised.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000107
108
109.. method:: HTMLParser.close()
110
111 Force processing of all buffered data as if it were followed by an end-of-file
112 mark. This method may be redefined by a derived class to define additional
113 processing at the end of the input, but the redefined version should always call
Ezio Melottic39b5522012-02-18 01:46:04 +0200114 the :class:`.HTMLParser` base class method :meth:`close`.
115
116
117.. method:: HTMLParser.reset()
118
119 Reset the instance. Loses all unprocessed data. This is called implicitly at
120 instantiation time.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000121
122
123.. method:: HTMLParser.getpos()
124
125 Return current line number and offset.
126
127
128.. method:: HTMLParser.get_starttag_text()
129
130 Return the text of the most recently opened start tag. This should not normally
131 be needed for structured processing, but may be useful in dealing with HTML "as
132 deployed" or for re-generating input with minimal changes (whitespace between
133 attributes can be preserved, etc.).
134
135
Ezio Melottic39b5522012-02-18 01:46:04 +0200136The following methods are called when data or markup elements are encountered
137and they are meant to be overridden in a subclass. The base class
138implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
139
140
Georg Brandl8ec7f652007-08-15 14:28:01 +0000141.. method:: HTMLParser.handle_starttag(tag, attrs)
142
Ezio Melottic39b5522012-02-18 01:46:04 +0200143 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000144
145 The *tag* argument is the name of the tag converted to lower case. The *attrs*
146 argument is a list of ``(name, value)`` pairs containing the attributes found
147 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
148 and quotes in the *value* have been removed, and character and entity references
Ezio Melottic39b5522012-02-18 01:46:04 +0200149 have been replaced.
150
151 For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
152 would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000153
154 .. versionchanged:: 2.6
Ezio Melottic39b5522012-02-18 01:46:04 +0200155 All entity references from :mod:`htmlentitydefs` are now replaced in the
156 attribute values.
157
158
159.. method:: HTMLParser.handle_endtag(tag)
160
161 This method is called to handle the end tag of an element (e.g. ``</div>``).
162
163 The *tag* argument is the name of the tag converted to lower case.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000164
165
166.. method:: HTMLParser.handle_startendtag(tag, attrs)
167
168 Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melotti87033522011-10-28 14:20:08 +0300169 XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl8ec7f652007-08-15 14:28:01 +0000170 subclasses which require this particular lexical information; the default
Ezio Melotti87033522011-10-28 14:20:08 +0300171 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000172
173
Georg Brandl8ec7f652007-08-15 14:28:01 +0000174.. method:: HTMLParser.handle_data(data)
175
Ezio Melottic39b5522012-02-18 01:46:04 +0200176 This method is called to process arbitrary data (e.g. text nodes and the
177 content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000178
179
180.. method:: HTMLParser.handle_entityref(name)
181
Ezio Melottic39b5522012-02-18 01:46:04 +0200182 This method is called to process a named character reference of the form
183 ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
184 (e.g. ``'gt'``).
185
186
187.. method:: HTMLParser.handle_charref(name)
188
189 This method is called to process decimal and hexadecimal numeric character
190 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
191 equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
192 in this case the method will receive ``'62'`` or ``'x3E'``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000193
194
195.. method:: HTMLParser.handle_comment(data)
196
Ezio Melottic39b5522012-02-18 01:46:04 +0200197 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
198
199 For example, the comment ``<!-- comment -->`` will cause this method to be
200 called with the argument ``' comment '``.
201
202 The content of Internet Explorer conditional comments (condcoms) will also be
203 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
204 this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000205
206
207.. method:: HTMLParser.handle_decl(decl)
208
Ezio Melottic39b5522012-02-18 01:46:04 +0200209 This method is called to handle an HTML doctype declaration (e.g.
210 ``<!DOCTYPE html>``).
211
Georg Brandlc79d4322010-08-01 21:10:57 +0000212 The *decl* parameter will be the entire contents of the declaration inside
Ezio Melottic39b5522012-02-18 01:46:04 +0200213 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000214
215
216.. method:: HTMLParser.handle_pi(data)
217
Ezio Melottic39b5522012-02-18 01:46:04 +0200218 This method is called when a processing instruction is encountered. The *data*
219 parameter will contain the entire processing instruction. For example, for the
Georg Brandl8ec7f652007-08-15 14:28:01 +0000220 processing instruction ``<?proc color='red'>``, this method would be called as
Ezio Melottic39b5522012-02-18 01:46:04 +0200221 ``handle_pi("proc color='red'")``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000222
223 .. note::
224
Ezio Melottic39b5522012-02-18 01:46:04 +0200225 The :class:`.HTMLParser` class uses the SGML syntactic rules for processing
Georg Brandl8ec7f652007-08-15 14:28:01 +0000226 instructions. An XHTML processing instruction using the trailing ``'?'`` will
227 cause the ``'?'`` to be included in *data*.
228
229
Ezio Melottic39b5522012-02-18 01:46:04 +0200230.. method:: HTMLParser.unknown_decl(data)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000231
Ezio Melottic39b5522012-02-18 01:46:04 +0200232 This method is called when an unrecognized declaration is read by the parser.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000233
Ezio Melottic39b5522012-02-18 01:46:04 +0200234 The *data* parameter will be the entire contents of the declaration inside
235 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
236 derived class.
237
238
239.. _htmlparser-examples:
240
241Examples
242--------
243
244The following class implements a parser that will be used to illustrate more
245examples::
Georg Brandl8ec7f652007-08-15 14:28:01 +0000246
Fred Draked995e112008-05-20 06:08:38 +0000247 from HTMLParser import HTMLParser
Ezio Melottic39b5522012-02-18 01:46:04 +0200248 from htmlentitydefs import name2codepoint
Georg Brandl8ec7f652007-08-15 14:28:01 +0000249
250 class MyHTMLParser(HTMLParser):
Georg Brandl8ec7f652007-08-15 14:28:01 +0000251 def handle_starttag(self, tag, attrs):
Ezio Melottic39b5522012-02-18 01:46:04 +0200252 print "Start tag:", tag
253 for attr in attrs:
254 print " attr:", attr
Georg Brandl8ec7f652007-08-15 14:28:01 +0000255 def handle_endtag(self, tag):
Ezio Melottic39b5522012-02-18 01:46:04 +0200256 print "End tag :", tag
Ezio Melottif9cc80d2011-10-28 14:14:34 +0300257 def handle_data(self, data):
Ezio Melottic39b5522012-02-18 01:46:04 +0200258 print "Data :", data
259 def handle_comment(self, data):
260 print "Comment :", data
261 def handle_entityref(self, name):
262 c = unichr(name2codepoint[name])
263 print "Named ent:", c
264 def handle_charref(self, name):
265 if name.startswith('x'):
266 c = unichr(int(name[1:], 16))
267 else:
268 c = unichr(int(name))
269 print "Num ent :", c
270 def handle_decl(self, data):
271 print "Decl :", data
Ezio Melottif9cc80d2011-10-28 14:14:34 +0300272
273 parser = MyHTMLParser()
Ezio Melottic39b5522012-02-18 01:46:04 +0200274
275Parsing a doctype::
276
277 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
278 ... '"http://www.w3.org/TR/html4/strict.dtd">')
279 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
280
281Parsing an element with a few attributes and a title::
282
283 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
284 Start tag: img
285 attr: ('src', 'python-logo.png')
286 attr: ('alt', 'The Python logo')
287 >>>
288 >>> parser.feed('<h1>Python</h1>')
289 Start tag: h1
290 Data : Python
291 End tag : h1
292
293The content of ``script`` and ``style`` elements is returned as is, without
294further parsing::
295
296 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
297 Start tag: style
298 attr: ('type', 'text/css')
299 Data : #python { color: green }
300 End tag : style
301 >>>
302 >>> parser.feed('<script type="text/javascript">'
303 ... 'alert("<strong>hello!</strong>");</script>')
304 Start tag: script
305 attr: ('type', 'text/javascript')
306 Data : alert("<strong>hello!</strong>");
307 End tag : script
308
309Parsing comments::
310
311 >>> parser.feed('<!-- a comment -->'
312 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
313 Comment : a comment
314 Comment : [if IE 9]>IE-specific content<![endif]
315
316Parsing named and numeric character references and converting them to the
317correct char (note: these 3 references are all equivalent to ``'>'``)::
318
319 >>> parser.feed('&gt;&#62;&#x3E;')
320 Named ent: >
321 Num ent : >
322 Num ent : >
323
324Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
325:meth:`~HTMLParser.handle_data` might be called more than once::
326
327 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
328 ... parser.feed(chunk)
329 ...
330 Start tag: span
331 Data : buff
332 Data : ered
333 Data : text
334 End tag : span
335
336Parsing invalid HTML (e.g. unquoted attributes) also works::
337
338 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
339 Start tag: p
340 Start tag: a
341 attr: ('class', 'link')
342 attr: ('href', '#main')
343 Data : tag soup
344 End tag : p
345 End tag : a