blob: e73ce07745ac7aebd43c8dd3f5abd29c323f71cb [file] [log] [blame]
Fred Draked995e112008-05-20 06:08:38 +00001
2:mod:`HTMLParser` --- Simple HTML and XHTML parser
3==================================================
Georg Brandl8ec7f652007-08-15 14:28:01 +00004
Fred Drake20b56602008-05-17 21:23:02 +00005.. module:: HTMLParser
Georg Brandl8ec7f652007-08-15 14:28:01 +00006 :synopsis: A simple parser that can handle HTML and XHTML.
7
Fred Drake20b56602008-05-17 21:23:02 +00008.. note::
Georg Brandl3682dfe2008-05-20 07:21:58 +00009
10 The :mod:`HTMLParser` module has been renamed to :mod:`html.parser` in Python
Ezio Melotti87033522011-10-28 14:20:08 +030011 3. The :term:`2to3` tool will automatically adapt imports when converting
12 your sources to Python 3.
Fred Drake20b56602008-05-17 21:23:02 +000013
Georg Brandl8ec7f652007-08-15 14:28:01 +000014
15.. versionadded:: 2.2
16
17.. index::
18 single: HTML
19 single: XHTML
20
Éric Araujo29a0b572011-08-19 02:14:03 +020021**Source code:** :source:`Lib/HTMLParser.py`
22
23--------------
24
Ezio Melottic39b5522012-02-18 01:46:04 +020025This module defines a class :class:`.HTMLParser` which serves as the basis for
Georg Brandl8ec7f652007-08-15 14:28:01 +000026parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
27Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
28in :mod:`sgmllib`.
29
30
31.. class:: HTMLParser()
32
Ezio Melottic39b5522012-02-18 01:46:04 +020033 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
34 when start tags, end tags, text, comments, and other markup elements are
35 encountered. The user should subclass :class:`.HTMLParser` and override its
36 methods to implement the desired behavior.
Georg Brandl8ec7f652007-08-15 14:28:01 +000037
Ezio Melottic39b5522012-02-18 01:46:04 +020038 The :class:`.HTMLParser` class is instantiated without arguments.
Georg Brandl8ec7f652007-08-15 14:28:01 +000039
40 Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
41 match start tags or call the end-tag handler for elements which are closed
42 implicitly by closing an outer element.
43
44An exception is defined as well:
45
Georg Brandl8ec7f652007-08-15 14:28:01 +000046.. exception:: HTMLParseError
47
Ezio Melottic39b5522012-02-18 01:46:04 +020048 :class:`.HTMLParser` is able to handle broken markup, but in some cases it
49 might raise this exception when it encounters an error while parsing.
50 This exception provides three attributes: :attr:`msg` is a brief
51 message explaining the error, :attr:`lineno` is the number of the line on
52 which the broken construct was detected, and :attr:`offset` is the number of
Georg Brandl8ec7f652007-08-15 14:28:01 +000053 characters into the line at which the construct starts.
54
Ezio Melottic39b5522012-02-18 01:46:04 +020055
56Example HTML Parser Application
57-------------------------------
58
59As a basic example, below is a simple HTML parser that uses the
60:class:`.HTMLParser` class to print out start tags, end tags and data
61as they are encountered::
62
63 from HTMLParser import HTMLParser
64
65 # create a subclass and override the handler methods
66 class MyHTMLParser(HTMLParser):
67 def handle_starttag(self, tag, attrs):
68 print "Encountered a start tag:", tag
Serhiy Storchaka12d547a2016-05-10 13:45:32 +030069
Ezio Melottic39b5522012-02-18 01:46:04 +020070 def handle_endtag(self, tag):
71 print "Encountered an end tag :", tag
Serhiy Storchaka12d547a2016-05-10 13:45:32 +030072
Ezio Melottic39b5522012-02-18 01:46:04 +020073 def handle_data(self, data):
74 print "Encountered some data :", data
75
76 # instantiate the parser and fed it some HTML
77 parser = MyHTMLParser()
78 parser.feed('<html><head><title>Test</title></head>'
79 '<body><h1>Parse me!</h1></body></html>')
80
81The output will then be::
82
83 Encountered a start tag: html
84 Encountered a start tag: head
85 Encountered a start tag: title
86 Encountered some data : Test
87 Encountered an end tag : title
88 Encountered an end tag : head
89 Encountered a start tag: body
90 Encountered a start tag: h1
91 Encountered some data : Parse me!
92 Encountered an end tag : h1
93 Encountered an end tag : body
94 Encountered an end tag : html
Georg Brandl8ec7f652007-08-15 14:28:01 +000095
96
Ezio Melottic39b5522012-02-18 01:46:04 +020097:class:`.HTMLParser` Methods
98----------------------------
Georg Brandl8ec7f652007-08-15 14:28:01 +000099
Ezio Melottic39b5522012-02-18 01:46:04 +0200100:class:`.HTMLParser` instances have the following methods:
Georg Brandl8ec7f652007-08-15 14:28:01 +0000101
102
103.. method:: HTMLParser.feed(data)
104
105 Feed some text to the parser. It is processed insofar as it consists of
106 complete elements; incomplete data is buffered until more data is fed or
Ezio Melottid0ffcd62011-12-19 07:15:26 +0200107 :meth:`close` is called. *data* can be either :class:`unicode` or
108 :class:`str`, but passing :class:`unicode` is advised.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000109
110
111.. method:: HTMLParser.close()
112
113 Force processing of all buffered data as if it were followed by an end-of-file
114 mark. This method may be redefined by a derived class to define additional
115 processing at the end of the input, but the redefined version should always call
Ezio Melottic39b5522012-02-18 01:46:04 +0200116 the :class:`.HTMLParser` base class method :meth:`close`.
117
118
119.. method:: HTMLParser.reset()
120
121 Reset the instance. Loses all unprocessed data. This is called implicitly at
122 instantiation time.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000123
124
125.. method:: HTMLParser.getpos()
126
127 Return current line number and offset.
128
129
130.. method:: HTMLParser.get_starttag_text()
131
132 Return the text of the most recently opened start tag. This should not normally
133 be needed for structured processing, but may be useful in dealing with HTML "as
134 deployed" or for re-generating input with minimal changes (whitespace between
135 attributes can be preserved, etc.).
136
137
Ezio Melottic39b5522012-02-18 01:46:04 +0200138The following methods are called when data or markup elements are encountered
139and they are meant to be overridden in a subclass. The base class
140implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
141
142
Georg Brandl8ec7f652007-08-15 14:28:01 +0000143.. method:: HTMLParser.handle_starttag(tag, attrs)
144
Ezio Melottic39b5522012-02-18 01:46:04 +0200145 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000146
147 The *tag* argument is the name of the tag converted to lower case. The *attrs*
148 argument is a list of ``(name, value)`` pairs containing the attributes found
149 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
150 and quotes in the *value* have been removed, and character and entity references
Ezio Melottic39b5522012-02-18 01:46:04 +0200151 have been replaced.
152
Serhiy Storchakab4905ef2016-05-07 10:50:12 +0300153 For instance, for the tag ``<A HREF="https://www.cwi.nl/">``, this method
154 would be called as ``handle_starttag('a', [('href', 'https://www.cwi.nl/')])``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000155
156 .. versionchanged:: 2.6
Ezio Melottic39b5522012-02-18 01:46:04 +0200157 All entity references from :mod:`htmlentitydefs` are now replaced in the
158 attribute values.
159
160
161.. method:: HTMLParser.handle_endtag(tag)
162
163 This method is called to handle the end tag of an element (e.g. ``</div>``).
164
165 The *tag* argument is the name of the tag converted to lower case.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000166
167
168.. method:: HTMLParser.handle_startendtag(tag, attrs)
169
170 Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melotti87033522011-10-28 14:20:08 +0300171 XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl8ec7f652007-08-15 14:28:01 +0000172 subclasses which require this particular lexical information; the default
Ezio Melotti87033522011-10-28 14:20:08 +0300173 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000174
175
Georg Brandl8ec7f652007-08-15 14:28:01 +0000176.. method:: HTMLParser.handle_data(data)
177
Ezio Melottic39b5522012-02-18 01:46:04 +0200178 This method is called to process arbitrary data (e.g. text nodes and the
179 content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000180
181
182.. method:: HTMLParser.handle_entityref(name)
183
Ezio Melottic39b5522012-02-18 01:46:04 +0200184 This method is called to process a named character reference of the form
185 ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
186 (e.g. ``'gt'``).
187
188
189.. method:: HTMLParser.handle_charref(name)
190
191 This method is called to process decimal and hexadecimal numeric character
192 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
193 equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
194 in this case the method will receive ``'62'`` or ``'x3E'``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000195
196
197.. method:: HTMLParser.handle_comment(data)
198
Ezio Melottic39b5522012-02-18 01:46:04 +0200199 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
200
201 For example, the comment ``<!-- comment -->`` will cause this method to be
202 called with the argument ``' comment '``.
203
204 The content of Internet Explorer conditional comments (condcoms) will also be
205 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
R David Murrayf79aa582015-08-24 12:50:50 -0400206 this method will receive ``'[if IE 9]>IE9-specific content<![endif]'``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000207
208
209.. method:: HTMLParser.handle_decl(decl)
210
Ezio Melottic39b5522012-02-18 01:46:04 +0200211 This method is called to handle an HTML doctype declaration (e.g.
212 ``<!DOCTYPE html>``).
213
Georg Brandlc79d4322010-08-01 21:10:57 +0000214 The *decl* parameter will be the entire contents of the declaration inside
Ezio Melottic39b5522012-02-18 01:46:04 +0200215 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl8ec7f652007-08-15 14:28:01 +0000216
217
218.. method:: HTMLParser.handle_pi(data)
219
Ezio Melottic39b5522012-02-18 01:46:04 +0200220 This method is called when a processing instruction is encountered. The *data*
221 parameter will contain the entire processing instruction. For example, for the
Georg Brandl8ec7f652007-08-15 14:28:01 +0000222 processing instruction ``<?proc color='red'>``, this method would be called as
Ezio Melottic39b5522012-02-18 01:46:04 +0200223 ``handle_pi("proc color='red'")``.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000224
225 .. note::
226
Ezio Melottic39b5522012-02-18 01:46:04 +0200227 The :class:`.HTMLParser` class uses the SGML syntactic rules for processing
Georg Brandl8ec7f652007-08-15 14:28:01 +0000228 instructions. An XHTML processing instruction using the trailing ``'?'`` will
229 cause the ``'?'`` to be included in *data*.
230
231
Ezio Melottic39b5522012-02-18 01:46:04 +0200232.. method:: HTMLParser.unknown_decl(data)
Georg Brandl8ec7f652007-08-15 14:28:01 +0000233
Ezio Melottic39b5522012-02-18 01:46:04 +0200234 This method is called when an unrecognized declaration is read by the parser.
Georg Brandl8ec7f652007-08-15 14:28:01 +0000235
Ezio Melottic39b5522012-02-18 01:46:04 +0200236 The *data* parameter will be the entire contents of the declaration inside
237 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
238 derived class.
239
240
241.. _htmlparser-examples:
242
243Examples
244--------
245
246The following class implements a parser that will be used to illustrate more
247examples::
Georg Brandl8ec7f652007-08-15 14:28:01 +0000248
Fred Draked995e112008-05-20 06:08:38 +0000249 from HTMLParser import HTMLParser
Ezio Melottic39b5522012-02-18 01:46:04 +0200250 from htmlentitydefs import name2codepoint
Georg Brandl8ec7f652007-08-15 14:28:01 +0000251
252 class MyHTMLParser(HTMLParser):
Georg Brandl8ec7f652007-08-15 14:28:01 +0000253 def handle_starttag(self, tag, attrs):
Ezio Melottic39b5522012-02-18 01:46:04 +0200254 print "Start tag:", tag
255 for attr in attrs:
256 print " attr:", attr
Serhiy Storchaka12d547a2016-05-10 13:45:32 +0300257
Georg Brandl8ec7f652007-08-15 14:28:01 +0000258 def handle_endtag(self, tag):
Ezio Melottic39b5522012-02-18 01:46:04 +0200259 print "End tag :", tag
Serhiy Storchaka12d547a2016-05-10 13:45:32 +0300260
Ezio Melottif9cc80d2011-10-28 14:14:34 +0300261 def handle_data(self, data):
Ezio Melottic39b5522012-02-18 01:46:04 +0200262 print "Data :", data
Serhiy Storchaka12d547a2016-05-10 13:45:32 +0300263
Ezio Melottic39b5522012-02-18 01:46:04 +0200264 def handle_comment(self, data):
265 print "Comment :", data
Serhiy Storchaka12d547a2016-05-10 13:45:32 +0300266
Ezio Melottic39b5522012-02-18 01:46:04 +0200267 def handle_entityref(self, name):
268 c = unichr(name2codepoint[name])
269 print "Named ent:", c
Serhiy Storchaka12d547a2016-05-10 13:45:32 +0300270
Ezio Melottic39b5522012-02-18 01:46:04 +0200271 def handle_charref(self, name):
272 if name.startswith('x'):
273 c = unichr(int(name[1:], 16))
274 else:
275 c = unichr(int(name))
276 print "Num ent :", c
Serhiy Storchaka12d547a2016-05-10 13:45:32 +0300277
Ezio Melottic39b5522012-02-18 01:46:04 +0200278 def handle_decl(self, data):
279 print "Decl :", data
Ezio Melottif9cc80d2011-10-28 14:14:34 +0300280
281 parser = MyHTMLParser()
Ezio Melottic39b5522012-02-18 01:46:04 +0200282
283Parsing a doctype::
284
285 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
286 ... '"http://www.w3.org/TR/html4/strict.dtd">')
287 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
288
289Parsing an element with a few attributes and a title::
290
291 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
292 Start tag: img
293 attr: ('src', 'python-logo.png')
294 attr: ('alt', 'The Python logo')
295 >>>
296 >>> parser.feed('<h1>Python</h1>')
297 Start tag: h1
298 Data : Python
299 End tag : h1
300
301The content of ``script`` and ``style`` elements is returned as is, without
302further parsing::
303
304 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
305 Start tag: style
306 attr: ('type', 'text/css')
307 Data : #python { color: green }
308 End tag : style
Serhiy Storchaka12d547a2016-05-10 13:45:32 +0300309
Ezio Melottic39b5522012-02-18 01:46:04 +0200310 >>> parser.feed('<script type="text/javascript">'
311 ... 'alert("<strong>hello!</strong>");</script>')
312 Start tag: script
313 attr: ('type', 'text/javascript')
314 Data : alert("<strong>hello!</strong>");
315 End tag : script
316
317Parsing comments::
318
319 >>> parser.feed('<!-- a comment -->'
320 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
321 Comment : a comment
322 Comment : [if IE 9]>IE-specific content<![endif]
323
324Parsing named and numeric character references and converting them to the
325correct char (note: these 3 references are all equivalent to ``'>'``)::
326
327 >>> parser.feed('&gt;&#62;&#x3E;')
328 Named ent: >
329 Num ent : >
330 Num ent : >
331
332Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
333:meth:`~HTMLParser.handle_data` might be called more than once::
334
335 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
336 ... parser.feed(chunk)
337 ...
338 Start tag: span
339 Data : buff
340 Data : ered
341 Data : text
342 End tag : span
343
344Parsing invalid HTML (e.g. unquoted attributes) also works::
345
346 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
347 Start tag: p
348 Start tag: a
349 attr: ('class', 'link')
350 attr: ('href', '#main')
351 Data : tag soup
352 End tag : p
353 End tag : a