blob: ac844a683bf7ac1b9cc9bdf873943647c8a3ee6d [file] [log] [blame]
Fred Drake3c50ea42008-05-17 22:02:32 +00001:mod:`html.parser` --- Simple HTML and XHTML parser
2===================================================
Georg Brandl116aa622007-08-15 14:28:22 +00003
Fred Drake3c50ea42008-05-17 22:02:32 +00004.. module:: html.parser
Georg Brandl116aa622007-08-15 14:28:22 +00005 :synopsis: A simple parser that can handle HTML and XHTML.
6
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04007**Source code:** :source:`Lib/html/parser.py`
Georg Brandl116aa622007-08-15 14:28:22 +00008
Georg Brandl9087b7f2008-05-18 07:53:01 +00009.. index::
10 single: HTML
11 single: XHTML
Georg Brandl116aa622007-08-15 14:28:22 +000012
Raymond Hettingera1993682011-01-27 01:20:32 +000013--------------
14
Georg Brandl116aa622007-08-15 14:28:22 +000015This module defines a class :class:`HTMLParser` which serves as the basis for
16parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl116aa622007-08-15 14:28:22 +000017
Ezio Melotti6fc16d82014-08-02 18:36:12 +030018.. class:: HTMLParser(*, convert_charrefs=True)
Georg Brandl116aa622007-08-15 14:28:22 +000019
Ezio Melotti73a43592014-08-02 14:10:30 +030020 Create a parser instance able to parse invalid markup.
Ezio Melotti95401c52013-11-23 19:52:05 +020021
Ezio Melotti6fc16d82014-08-02 18:36:12 +030022 If *convert_charrefs* is ``True`` (the default), all character
Ezio Melotti95401c52013-11-23 19:52:05 +020023 references (except the ones in ``script``/``style`` elements) are
24 automatically converted to the corresponding Unicode characters.
Ezio Melotti95401c52013-11-23 19:52:05 +020025
Ezio Melotti4279bc72012-02-18 02:01:36 +020026 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
27 when start tags, end tags, text, comments, and other markup elements are
28 encountered. The user should subclass :class:`.HTMLParser` and override its
29 methods to implement the desired behavior.
Georg Brandl116aa622007-08-15 14:28:22 +000030
Georg Brandl877b10a2008-06-01 21:25:55 +000031 This parser does not check that end tags match start tags or call the end-tag
32 handler for elements which are closed implicitly by closing an outer element.
Georg Brandl116aa622007-08-15 14:28:22 +000033
Ezio Melotti95401c52013-11-23 19:52:05 +020034 .. versionchanged:: 3.4
35 *convert_charrefs* keyword argument added.
36
Ezio Melotti6fc16d82014-08-02 18:36:12 +030037 .. versionchanged:: 3.5
38 The default value for argument *convert_charrefs* is now ``True``.
39
Ezio Melotti4279bc72012-02-18 02:01:36 +020040
41Example HTML Parser Application
42-------------------------------
43
44As a basic example, below is a simple HTML parser that uses the
45:class:`HTMLParser` class to print out start tags, end tags, and data
46as they are encountered::
47
48 from html.parser import HTMLParser
49
50 class MyHTMLParser(HTMLParser):
51 def handle_starttag(self, tag, attrs):
52 print("Encountered a start tag:", tag)
Serhiy Storchakadba90392016-05-10 12:01:23 +030053
Ezio Melotti4279bc72012-02-18 02:01:36 +020054 def handle_endtag(self, tag):
55 print("Encountered an end tag :", tag)
Serhiy Storchakadba90392016-05-10 12:01:23 +030056
Ezio Melotti4279bc72012-02-18 02:01:36 +020057 def handle_data(self, data):
58 print("Encountered some data :", data)
59
Ezio Melotti88ebfb12013-11-02 17:08:24 +020060 parser = MyHTMLParser()
Ezio Melotti4279bc72012-02-18 02:01:36 +020061 parser.feed('<html><head><title>Test</title></head>'
62 '<body><h1>Parse me!</h1></body></html>')
63
Martin Panter1050d2d2016-07-26 11:18:21 +020064The output will then be:
65
66.. code-block:: none
Ezio Melotti4279bc72012-02-18 02:01:36 +020067
68 Encountered a start tag: html
69 Encountered a start tag: head
70 Encountered a start tag: title
71 Encountered some data : Test
72 Encountered an end tag : title
73 Encountered an end tag : head
74 Encountered a start tag: body
75 Encountered a start tag: h1
76 Encountered some data : Parse me!
77 Encountered an end tag : h1
78 Encountered an end tag : body
79 Encountered an end tag : html
80
81
82:class:`.HTMLParser` Methods
83----------------------------
Georg Brandl116aa622007-08-15 14:28:22 +000084
85:class:`HTMLParser` instances have the following methods:
86
87
Georg Brandl116aa622007-08-15 14:28:22 +000088.. method:: HTMLParser.feed(data)
89
90 Feed some text to the parser. It is processed insofar as it consists of
91 complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti4279bc72012-02-18 02:01:36 +020092 :meth:`close` is called. *data* must be :class:`str`.
Georg Brandl116aa622007-08-15 14:28:22 +000093
94
95.. method:: HTMLParser.close()
96
97 Force processing of all buffered data as if it were followed by an end-of-file
98 mark. This method may be redefined by a derived class to define additional
99 processing at the end of the input, but the redefined version should always call
100 the :class:`HTMLParser` base class method :meth:`close`.
101
102
Ezio Melotti4279bc72012-02-18 02:01:36 +0200103.. method:: HTMLParser.reset()
104
105 Reset the instance. Loses all unprocessed data. This is called implicitly at
106 instantiation time.
107
108
Georg Brandl116aa622007-08-15 14:28:22 +0000109.. method:: HTMLParser.getpos()
110
111 Return current line number and offset.
112
113
114.. method:: HTMLParser.get_starttag_text()
115
116 Return the text of the most recently opened start tag. This should not normally
117 be needed for structured processing, but may be useful in dealing with HTML "as
118 deployed" or for re-generating input with minimal changes (whitespace between
119 attributes can be preserved, etc.).
120
121
Ezio Melotti4279bc72012-02-18 02:01:36 +0200122The following methods are called when data or markup elements are encountered
123and they are meant to be overridden in a subclass. The base class
124implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
125
126
Georg Brandl116aa622007-08-15 14:28:22 +0000127.. method:: HTMLParser.handle_starttag(tag, attrs)
128
Ezio Melotti4279bc72012-02-18 02:01:36 +0200129 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl116aa622007-08-15 14:28:22 +0000130
131 The *tag* argument is the name of the tag converted to lower case. The *attrs*
132 argument is a list of ``(name, value)`` pairs containing the attributes found
133 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
134 and quotes in the *value* have been removed, and character and entity references
Ezio Melotti4279bc72012-02-18 02:01:36 +0200135 have been replaced.
136
Serhiy Storchaka6dff0202016-05-07 10:49:07 +0300137 For instance, for the tag ``<A HREF="https://www.cwi.nl/">``, this method
138 would be called as ``handle_starttag('a', [('href', 'https://www.cwi.nl/')])``.
Georg Brandl116aa622007-08-15 14:28:22 +0000139
Georg Brandl9087b7f2008-05-18 07:53:01 +0000140 All entity references from :mod:`html.entities` are replaced in the attribute
141 values.
Georg Brandl116aa622007-08-15 14:28:22 +0000142
143
Ezio Melotti4279bc72012-02-18 02:01:36 +0200144.. method:: HTMLParser.handle_endtag(tag)
145
146 This method is called to handle the end tag of an element (e.g. ``</div>``).
147
148 The *tag* argument is the name of the tag converted to lower case.
149
150
Georg Brandl116aa622007-08-15 14:28:22 +0000151.. method:: HTMLParser.handle_startendtag(tag, attrs)
152
153 Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melottif99e4b52011-10-28 14:34:56 +0300154 XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl116aa622007-08-15 14:28:22 +0000155 subclasses which require this particular lexical information; the default
Ezio Melottif99e4b52011-10-28 14:34:56 +0300156 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000157
158
Georg Brandl116aa622007-08-15 14:28:22 +0000159.. method:: HTMLParser.handle_data(data)
160
Ezio Melotti4279bc72012-02-18 02:01:36 +0200161 This method is called to process arbitrary data (e.g. text nodes and the
162 content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl116aa622007-08-15 14:28:22 +0000163
164
165.. method:: HTMLParser.handle_entityref(name)
166
Ezio Melotti4279bc72012-02-18 02:01:36 +0200167 This method is called to process a named character reference of the form
168 ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
Ezio Melotti95401c52013-11-23 19:52:05 +0200169 (e.g. ``'gt'``). This method is never called if *convert_charrefs* is
170 ``True``.
Ezio Melotti4279bc72012-02-18 02:01:36 +0200171
172
173.. method:: HTMLParser.handle_charref(name)
174
175 This method is called to process decimal and hexadecimal numeric character
176 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
177 equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
Ezio Melotti95401c52013-11-23 19:52:05 +0200178 in this case the method will receive ``'62'`` or ``'x3E'``. This method
179 is never called if *convert_charrefs* is ``True``.
Georg Brandl116aa622007-08-15 14:28:22 +0000180
181
182.. method:: HTMLParser.handle_comment(data)
183
Ezio Melotti4279bc72012-02-18 02:01:36 +0200184 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
185
186 For example, the comment ``<!-- comment -->`` will cause this method to be
187 called with the argument ``' comment '``.
188
189 The content of Internet Explorer conditional comments (condcoms) will also be
190 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
R David Murray87cbfb22015-08-24 12:55:03 -0400191 this method will receive ``'[if IE 9]>IE9-specific content<![endif]'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000192
193
194.. method:: HTMLParser.handle_decl(decl)
195
Ezio Melotti4279bc72012-02-18 02:01:36 +0200196 This method is called to handle an HTML doctype declaration (e.g.
197 ``<!DOCTYPE html>``).
198
Georg Brandl46aa5c52010-07-29 13:38:37 +0000199 The *decl* parameter will be the entire contents of the declaration inside
Ezio Melotti4279bc72012-02-18 02:01:36 +0200200 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl116aa622007-08-15 14:28:22 +0000201
202
203.. method:: HTMLParser.handle_pi(data)
204
205 Method called when a processing instruction is encountered. The *data*
206 parameter will contain the entire processing instruction. For example, for the
207 processing instruction ``<?proc color='red'>``, this method would be called as
208 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
209 class; the base class implementation does nothing.
210
211 .. note::
212
213 The :class:`HTMLParser` class uses the SGML syntactic rules for processing
214 instructions. An XHTML processing instruction using the trailing ``'?'`` will
215 cause the ``'?'`` to be included in *data*.
216
217
Ezio Melotti4279bc72012-02-18 02:01:36 +0200218.. method:: HTMLParser.unknown_decl(data)
Georg Brandl116aa622007-08-15 14:28:22 +0000219
Ezio Melotti4279bc72012-02-18 02:01:36 +0200220 This method is called when an unrecognized declaration is read by the parser.
Georg Brandl116aa622007-08-15 14:28:22 +0000221
Ezio Melotti4279bc72012-02-18 02:01:36 +0200222 The *data* parameter will be the entire contents of the declaration inside
223 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
Ezio Melotti73a43592014-08-02 14:10:30 +0300224 derived class. The base class implementation does nothing.
Ezio Melotti4279bc72012-02-18 02:01:36 +0200225
226
227.. _htmlparser-examples:
228
229Examples
230--------
231
232The following class implements a parser that will be used to illustrate more
233examples::
Georg Brandl116aa622007-08-15 14:28:22 +0000234
Ezio Melottif99e4b52011-10-28 14:34:56 +0300235 from html.parser import HTMLParser
Ezio Melotti4279bc72012-02-18 02:01:36 +0200236 from html.entities import name2codepoint
Ezio Melottif99e4b52011-10-28 14:34:56 +0300237
238 class MyHTMLParser(HTMLParser):
239 def handle_starttag(self, tag, attrs):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200240 print("Start tag:", tag)
241 for attr in attrs:
242 print(" attr:", attr)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300243
Ezio Melottif99e4b52011-10-28 14:34:56 +0300244 def handle_endtag(self, tag):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200245 print("End tag :", tag)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300246
Ezio Melottif99e4b52011-10-28 14:34:56 +0300247 def handle_data(self, data):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200248 print("Data :", data)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300249
Ezio Melotti4279bc72012-02-18 02:01:36 +0200250 def handle_comment(self, data):
251 print("Comment :", data)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300252
Ezio Melotti4279bc72012-02-18 02:01:36 +0200253 def handle_entityref(self, name):
254 c = chr(name2codepoint[name])
255 print("Named ent:", c)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300256
Ezio Melotti4279bc72012-02-18 02:01:36 +0200257 def handle_charref(self, name):
258 if name.startswith('x'):
259 c = chr(int(name[1:], 16))
260 else:
261 c = chr(int(name))
262 print("Num ent :", c)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300263
Ezio Melotti4279bc72012-02-18 02:01:36 +0200264 def handle_decl(self, data):
265 print("Decl :", data)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300266
Ezio Melotti88ebfb12013-11-02 17:08:24 +0200267 parser = MyHTMLParser()
Georg Brandl116aa622007-08-15 14:28:22 +0000268
Ezio Melotti4279bc72012-02-18 02:01:36 +0200269Parsing a doctype::
270
271 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
272 ... '"http://www.w3.org/TR/html4/strict.dtd">')
273 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
274
275Parsing an element with a few attributes and a title::
276
277 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
278 Start tag: img
279 attr: ('src', 'python-logo.png')
280 attr: ('alt', 'The Python logo')
281 >>>
282 >>> parser.feed('<h1>Python</h1>')
283 Start tag: h1
284 Data : Python
285 End tag : h1
286
287The content of ``script`` and ``style`` elements is returned as is, without
288further parsing::
289
290 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
291 Start tag: style
292 attr: ('type', 'text/css')
293 Data : #python { color: green }
294 End tag : style
Serhiy Storchakadba90392016-05-10 12:01:23 +0300295
Ezio Melotti4279bc72012-02-18 02:01:36 +0200296 >>> parser.feed('<script type="text/javascript">'
297 ... 'alert("<strong>hello!</strong>");</script>')
298 Start tag: script
299 attr: ('type', 'text/javascript')
300 Data : alert("<strong>hello!</strong>");
301 End tag : script
302
303Parsing comments::
304
305 >>> parser.feed('<!-- a comment -->'
306 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
307 Comment : a comment
308 Comment : [if IE 9]>IE-specific content<![endif]
309
310Parsing named and numeric character references and converting them to the
311correct char (note: these 3 references are all equivalent to ``'>'``)::
312
313 >>> parser.feed('&gt;&#62;&#x3E;')
314 Named ent: >
315 Num ent : >
316 Num ent : >
317
318Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
Ezio Melotti95401c52013-11-23 19:52:05 +0200319:meth:`~HTMLParser.handle_data` might be called more than once
320(unless *convert_charrefs* is set to ``True``)::
Ezio Melotti4279bc72012-02-18 02:01:36 +0200321
322 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
323 ... parser.feed(chunk)
324 ...
325 Start tag: span
326 Data : buff
327 Data : ered
328 Data : text
329 End tag : span
330
331Parsing invalid HTML (e.g. unquoted attributes) also works::
332
333 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
334 Start tag: p
335 Start tag: a
336 attr: ('class', 'link')
337 attr: ('href', '#main')
338 Data : tag soup
339 End tag : p
340 End tag : a