blob: b84c60b708dd25495c11518012840e657a010e8d [file] [log] [blame]
Fred Drake3c50ea42008-05-17 22:02:32 +00001:mod:`html.parser` --- Simple HTML and XHTML parser
2===================================================
Georg Brandl116aa622007-08-15 14:28:22 +00003
Fred Drake3c50ea42008-05-17 22:02:32 +00004.. module:: html.parser
Georg Brandl116aa622007-08-15 14:28:22 +00005 :synopsis: A simple parser that can handle HTML and XHTML.
6
7
Georg Brandl9087b7f2008-05-18 07:53:01 +00008.. index::
9 single: HTML
10 single: XHTML
Georg Brandl116aa622007-08-15 14:28:22 +000011
Raymond Hettingera1993682011-01-27 01:20:32 +000012**Source code:** :source:`Lib/html/parser.py`
13
14--------------
15
Georg Brandl116aa622007-08-15 14:28:22 +000016This module defines a class :class:`HTMLParser` which serves as the basis for
17parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl116aa622007-08-15 14:28:22 +000018
Ezio Melotti6fc16d82014-08-02 18:36:12 +030019.. class:: HTMLParser(*, convert_charrefs=True)
Georg Brandl116aa622007-08-15 14:28:22 +000020
Ezio Melotti73a43592014-08-02 14:10:30 +030021 Create a parser instance able to parse invalid markup.
Ezio Melotti95401c52013-11-23 19:52:05 +020022
Ezio Melotti6fc16d82014-08-02 18:36:12 +030023 If *convert_charrefs* is ``True`` (the default), all character
Ezio Melotti95401c52013-11-23 19:52:05 +020024 references (except the ones in ``script``/``style`` elements) are
25 automatically converted to the corresponding Unicode characters.
Ezio Melotti95401c52013-11-23 19:52:05 +020026
Ezio Melotti4279bc72012-02-18 02:01:36 +020027 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
28 when start tags, end tags, text, comments, and other markup elements are
29 encountered. The user should subclass :class:`.HTMLParser` and override its
30 methods to implement the desired behavior.
Georg Brandl116aa622007-08-15 14:28:22 +000031
Georg Brandl877b10a2008-06-01 21:25:55 +000032 This parser does not check that end tags match start tags or call the end-tag
33 handler for elements which are closed implicitly by closing an outer element.
Georg Brandl116aa622007-08-15 14:28:22 +000034
Ezio Melotti95401c52013-11-23 19:52:05 +020035 .. versionchanged:: 3.4
36 *convert_charrefs* keyword argument added.
37
Ezio Melotti6fc16d82014-08-02 18:36:12 +030038 .. versionchanged:: 3.5
39 The default value for argument *convert_charrefs* is now ``True``.
40
Ezio Melotti4279bc72012-02-18 02:01:36 +020041
42Example HTML Parser Application
43-------------------------------
44
45As a basic example, below is a simple HTML parser that uses the
46:class:`HTMLParser` class to print out start tags, end tags, and data
47as they are encountered::
48
49 from html.parser import HTMLParser
50
51 class MyHTMLParser(HTMLParser):
52 def handle_starttag(self, tag, attrs):
53 print("Encountered a start tag:", tag)
54 def handle_endtag(self, tag):
55 print("Encountered an end tag :", tag)
56 def handle_data(self, data):
57 print("Encountered some data :", data)
58
Ezio Melotti88ebfb12013-11-02 17:08:24 +020059 parser = MyHTMLParser()
Ezio Melotti4279bc72012-02-18 02:01:36 +020060 parser.feed('<html><head><title>Test</title></head>'
61 '<body><h1>Parse me!</h1></body></html>')
62
63The output will then be::
64
65 Encountered a start tag: html
66 Encountered a start tag: head
67 Encountered a start tag: title
68 Encountered some data : Test
69 Encountered an end tag : title
70 Encountered an end tag : head
71 Encountered a start tag: body
72 Encountered a start tag: h1
73 Encountered some data : Parse me!
74 Encountered an end tag : h1
75 Encountered an end tag : body
76 Encountered an end tag : html
77
78
79:class:`.HTMLParser` Methods
80----------------------------
Georg Brandl116aa622007-08-15 14:28:22 +000081
82:class:`HTMLParser` instances have the following methods:
83
84
Georg Brandl116aa622007-08-15 14:28:22 +000085.. method:: HTMLParser.feed(data)
86
87 Feed some text to the parser. It is processed insofar as it consists of
88 complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti4279bc72012-02-18 02:01:36 +020089 :meth:`close` is called. *data* must be :class:`str`.
Georg Brandl116aa622007-08-15 14:28:22 +000090
91
92.. method:: HTMLParser.close()
93
94 Force processing of all buffered data as if it were followed by an end-of-file
95 mark. This method may be redefined by a derived class to define additional
96 processing at the end of the input, but the redefined version should always call
97 the :class:`HTMLParser` base class method :meth:`close`.
98
99
Ezio Melotti4279bc72012-02-18 02:01:36 +0200100.. method:: HTMLParser.reset()
101
102 Reset the instance. Loses all unprocessed data. This is called implicitly at
103 instantiation time.
104
105
Georg Brandl116aa622007-08-15 14:28:22 +0000106.. method:: HTMLParser.getpos()
107
108 Return current line number and offset.
109
110
111.. method:: HTMLParser.get_starttag_text()
112
113 Return the text of the most recently opened start tag. This should not normally
114 be needed for structured processing, but may be useful in dealing with HTML "as
115 deployed" or for re-generating input with minimal changes (whitespace between
116 attributes can be preserved, etc.).
117
118
Ezio Melotti4279bc72012-02-18 02:01:36 +0200119The following methods are called when data or markup elements are encountered
120and they are meant to be overridden in a subclass. The base class
121implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
122
123
Georg Brandl116aa622007-08-15 14:28:22 +0000124.. method:: HTMLParser.handle_starttag(tag, attrs)
125
Ezio Melotti4279bc72012-02-18 02:01:36 +0200126 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl116aa622007-08-15 14:28:22 +0000127
128 The *tag* argument is the name of the tag converted to lower case. The *attrs*
129 argument is a list of ``(name, value)`` pairs containing the attributes found
130 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
131 and quotes in the *value* have been removed, and character and entity references
Ezio Melotti4279bc72012-02-18 02:01:36 +0200132 have been replaced.
133
134 For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
135 would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
Georg Brandl116aa622007-08-15 14:28:22 +0000136
Georg Brandl9087b7f2008-05-18 07:53:01 +0000137 All entity references from :mod:`html.entities` are replaced in the attribute
138 values.
Georg Brandl116aa622007-08-15 14:28:22 +0000139
140
Ezio Melotti4279bc72012-02-18 02:01:36 +0200141.. method:: HTMLParser.handle_endtag(tag)
142
143 This method is called to handle the end tag of an element (e.g. ``</div>``).
144
145 The *tag* argument is the name of the tag converted to lower case.
146
147
Georg Brandl116aa622007-08-15 14:28:22 +0000148.. method:: HTMLParser.handle_startendtag(tag, attrs)
149
150 Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melottif99e4b52011-10-28 14:34:56 +0300151 XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl116aa622007-08-15 14:28:22 +0000152 subclasses which require this particular lexical information; the default
Ezio Melottif99e4b52011-10-28 14:34:56 +0300153 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000154
155
Georg Brandl116aa622007-08-15 14:28:22 +0000156.. method:: HTMLParser.handle_data(data)
157
Ezio Melotti4279bc72012-02-18 02:01:36 +0200158 This method is called to process arbitrary data (e.g. text nodes and the
159 content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl116aa622007-08-15 14:28:22 +0000160
161
162.. method:: HTMLParser.handle_entityref(name)
163
Ezio Melotti4279bc72012-02-18 02:01:36 +0200164 This method is called to process a named character reference of the form
165 ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
Ezio Melotti95401c52013-11-23 19:52:05 +0200166 (e.g. ``'gt'``). This method is never called if *convert_charrefs* is
167 ``True``.
Ezio Melotti4279bc72012-02-18 02:01:36 +0200168
169
170.. method:: HTMLParser.handle_charref(name)
171
172 This method is called to process decimal and hexadecimal numeric character
173 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
174 equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
Ezio Melotti95401c52013-11-23 19:52:05 +0200175 in this case the method will receive ``'62'`` or ``'x3E'``. This method
176 is never called if *convert_charrefs* is ``True``.
Georg Brandl116aa622007-08-15 14:28:22 +0000177
178
179.. method:: HTMLParser.handle_comment(data)
180
Ezio Melotti4279bc72012-02-18 02:01:36 +0200181 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
182
183 For example, the comment ``<!-- comment -->`` will cause this method to be
184 called with the argument ``' comment '``.
185
186 The content of Internet Explorer conditional comments (condcoms) will also be
187 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
188 this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000189
190
191.. method:: HTMLParser.handle_decl(decl)
192
Ezio Melotti4279bc72012-02-18 02:01:36 +0200193 This method is called to handle an HTML doctype declaration (e.g.
194 ``<!DOCTYPE html>``).
195
Georg Brandl46aa5c52010-07-29 13:38:37 +0000196 The *decl* parameter will be the entire contents of the declaration inside
Ezio Melotti4279bc72012-02-18 02:01:36 +0200197 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl116aa622007-08-15 14:28:22 +0000198
199
200.. method:: HTMLParser.handle_pi(data)
201
202 Method called when a processing instruction is encountered. The *data*
203 parameter will contain the entire processing instruction. For example, for the
204 processing instruction ``<?proc color='red'>``, this method would be called as
205 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
206 class; the base class implementation does nothing.
207
208 .. note::
209
210 The :class:`HTMLParser` class uses the SGML syntactic rules for processing
211 instructions. An XHTML processing instruction using the trailing ``'?'`` will
212 cause the ``'?'`` to be included in *data*.
213
214
Ezio Melotti4279bc72012-02-18 02:01:36 +0200215.. method:: HTMLParser.unknown_decl(data)
Georg Brandl116aa622007-08-15 14:28:22 +0000216
Ezio Melotti4279bc72012-02-18 02:01:36 +0200217 This method is called when an unrecognized declaration is read by the parser.
Georg Brandl116aa622007-08-15 14:28:22 +0000218
Ezio Melotti4279bc72012-02-18 02:01:36 +0200219 The *data* parameter will be the entire contents of the declaration inside
220 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
Ezio Melotti73a43592014-08-02 14:10:30 +0300221 derived class. The base class implementation does nothing.
Ezio Melotti4279bc72012-02-18 02:01:36 +0200222
223
224.. _htmlparser-examples:
225
226Examples
227--------
228
229The following class implements a parser that will be used to illustrate more
230examples::
Georg Brandl116aa622007-08-15 14:28:22 +0000231
Ezio Melottif99e4b52011-10-28 14:34:56 +0300232 from html.parser import HTMLParser
Ezio Melotti4279bc72012-02-18 02:01:36 +0200233 from html.entities import name2codepoint
Ezio Melottif99e4b52011-10-28 14:34:56 +0300234
235 class MyHTMLParser(HTMLParser):
236 def handle_starttag(self, tag, attrs):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200237 print("Start tag:", tag)
238 for attr in attrs:
239 print(" attr:", attr)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300240 def handle_endtag(self, tag):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200241 print("End tag :", tag)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300242 def handle_data(self, data):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200243 print("Data :", data)
244 def handle_comment(self, data):
245 print("Comment :", data)
246 def handle_entityref(self, name):
247 c = chr(name2codepoint[name])
248 print("Named ent:", c)
249 def handle_charref(self, name):
250 if name.startswith('x'):
251 c = chr(int(name[1:], 16))
252 else:
253 c = chr(int(name))
254 print("Num ent :", c)
255 def handle_decl(self, data):
256 print("Decl :", data)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300257
Ezio Melotti88ebfb12013-11-02 17:08:24 +0200258 parser = MyHTMLParser()
Georg Brandl116aa622007-08-15 14:28:22 +0000259
Ezio Melotti4279bc72012-02-18 02:01:36 +0200260Parsing a doctype::
261
262 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
263 ... '"http://www.w3.org/TR/html4/strict.dtd">')
264 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
265
266Parsing an element with a few attributes and a title::
267
268 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
269 Start tag: img
270 attr: ('src', 'python-logo.png')
271 attr: ('alt', 'The Python logo')
272 >>>
273 >>> parser.feed('<h1>Python</h1>')
274 Start tag: h1
275 Data : Python
276 End tag : h1
277
278The content of ``script`` and ``style`` elements is returned as is, without
279further parsing::
280
281 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
282 Start tag: style
283 attr: ('type', 'text/css')
284 Data : #python { color: green }
285 End tag : style
286 >>>
287 >>> parser.feed('<script type="text/javascript">'
288 ... 'alert("<strong>hello!</strong>");</script>')
289 Start tag: script
290 attr: ('type', 'text/javascript')
291 Data : alert("<strong>hello!</strong>");
292 End tag : script
293
294Parsing comments::
295
296 >>> parser.feed('<!-- a comment -->'
297 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
298 Comment : a comment
299 Comment : [if IE 9]>IE-specific content<![endif]
300
301Parsing named and numeric character references and converting them to the
302correct char (note: these 3 references are all equivalent to ``'>'``)::
303
304 >>> parser.feed('&gt;&#62;&#x3E;')
305 Named ent: >
306 Num ent : >
307 Num ent : >
308
309Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
Ezio Melotti95401c52013-11-23 19:52:05 +0200310:meth:`~HTMLParser.handle_data` might be called more than once
311(unless *convert_charrefs* is set to ``True``)::
Ezio Melotti4279bc72012-02-18 02:01:36 +0200312
313 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
314 ... parser.feed(chunk)
315 ...
316 Start tag: span
317 Data : buff
318 Data : ered
319 Data : text
320 End tag : span
321
322Parsing invalid HTML (e.g. unquoted attributes) also works::
323
324 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
325 Start tag: p
326 Start tag: a
327 attr: ('class', 'link')
328 attr: ('href', '#main')
329 Data : tag soup
330 End tag : p
331 End tag : a