blob: 16abb406210cfb95af04ddc743bf004512cee0df [file] [log] [blame]
Fred Drake3c50ea42008-05-17 22:02:32 +00001:mod:`html.parser` --- Simple HTML and XHTML parser
2===================================================
Georg Brandl116aa622007-08-15 14:28:22 +00003
Fred Drake3c50ea42008-05-17 22:02:32 +00004.. module:: html.parser
Georg Brandl116aa622007-08-15 14:28:22 +00005 :synopsis: A simple parser that can handle HTML and XHTML.
6
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04007**Source code:** :source:`Lib/html/parser.py`
Georg Brandl116aa622007-08-15 14:28:22 +00008
Georg Brandl9087b7f2008-05-18 07:53:01 +00009.. index::
10 single: HTML
11 single: XHTML
Georg Brandl116aa622007-08-15 14:28:22 +000012
Raymond Hettingera1993682011-01-27 01:20:32 +000013--------------
14
Georg Brandl116aa622007-08-15 14:28:22 +000015This module defines a class :class:`HTMLParser` which serves as the basis for
16parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl116aa622007-08-15 14:28:22 +000017
Ezio Melotti6fc16d82014-08-02 18:36:12 +030018.. class:: HTMLParser(*, convert_charrefs=True)
Georg Brandl116aa622007-08-15 14:28:22 +000019
Ezio Melotti73a43592014-08-02 14:10:30 +030020 Create a parser instance able to parse invalid markup.
Ezio Melotti95401c52013-11-23 19:52:05 +020021
Ezio Melotti6fc16d82014-08-02 18:36:12 +030022 If *convert_charrefs* is ``True`` (the default), all character
Ezio Melotti95401c52013-11-23 19:52:05 +020023 references (except the ones in ``script``/``style`` elements) are
24 automatically converted to the corresponding Unicode characters.
Ezio Melotti95401c52013-11-23 19:52:05 +020025
Ezio Melotti4279bc72012-02-18 02:01:36 +020026 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
27 when start tags, end tags, text, comments, and other markup elements are
28 encountered. The user should subclass :class:`.HTMLParser` and override its
29 methods to implement the desired behavior.
Georg Brandl116aa622007-08-15 14:28:22 +000030
Georg Brandl877b10a2008-06-01 21:25:55 +000031 This parser does not check that end tags match start tags or call the end-tag
32 handler for elements which are closed implicitly by closing an outer element.
Georg Brandl116aa622007-08-15 14:28:22 +000033
Ezio Melotti95401c52013-11-23 19:52:05 +020034 .. versionchanged:: 3.4
35 *convert_charrefs* keyword argument added.
36
Ezio Melotti6fc16d82014-08-02 18:36:12 +030037 .. versionchanged:: 3.5
38 The default value for argument *convert_charrefs* is now ``True``.
39
Ezio Melotti4279bc72012-02-18 02:01:36 +020040
41Example HTML Parser Application
42-------------------------------
43
44As a basic example, below is a simple HTML parser that uses the
45:class:`HTMLParser` class to print out start tags, end tags, and data
46as they are encountered::
47
48 from html.parser import HTMLParser
49
50 class MyHTMLParser(HTMLParser):
51 def handle_starttag(self, tag, attrs):
52 print("Encountered a start tag:", tag)
Serhiy Storchakadba90392016-05-10 12:01:23 +030053
Ezio Melotti4279bc72012-02-18 02:01:36 +020054 def handle_endtag(self, tag):
55 print("Encountered an end tag :", tag)
Serhiy Storchakadba90392016-05-10 12:01:23 +030056
Ezio Melotti4279bc72012-02-18 02:01:36 +020057 def handle_data(self, data):
58 print("Encountered some data :", data)
59
Ezio Melotti88ebfb12013-11-02 17:08:24 +020060 parser = MyHTMLParser()
Ezio Melotti4279bc72012-02-18 02:01:36 +020061 parser.feed('<html><head><title>Test</title></head>'
62 '<body><h1>Parse me!</h1></body></html>')
63
64The output will then be::
65
66 Encountered a start tag: html
67 Encountered a start tag: head
68 Encountered a start tag: title
69 Encountered some data : Test
70 Encountered an end tag : title
71 Encountered an end tag : head
72 Encountered a start tag: body
73 Encountered a start tag: h1
74 Encountered some data : Parse me!
75 Encountered an end tag : h1
76 Encountered an end tag : body
77 Encountered an end tag : html
78
79
80:class:`.HTMLParser` Methods
81----------------------------
Georg Brandl116aa622007-08-15 14:28:22 +000082
83:class:`HTMLParser` instances have the following methods:
84
85
Georg Brandl116aa622007-08-15 14:28:22 +000086.. method:: HTMLParser.feed(data)
87
88 Feed some text to the parser. It is processed insofar as it consists of
89 complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti4279bc72012-02-18 02:01:36 +020090 :meth:`close` is called. *data* must be :class:`str`.
Georg Brandl116aa622007-08-15 14:28:22 +000091
92
93.. method:: HTMLParser.close()
94
95 Force processing of all buffered data as if it were followed by an end-of-file
96 mark. This method may be redefined by a derived class to define additional
97 processing at the end of the input, but the redefined version should always call
98 the :class:`HTMLParser` base class method :meth:`close`.
99
100
Ezio Melotti4279bc72012-02-18 02:01:36 +0200101.. method:: HTMLParser.reset()
102
103 Reset the instance. Loses all unprocessed data. This is called implicitly at
104 instantiation time.
105
106
Georg Brandl116aa622007-08-15 14:28:22 +0000107.. method:: HTMLParser.getpos()
108
109 Return current line number and offset.
110
111
112.. method:: HTMLParser.get_starttag_text()
113
114 Return the text of the most recently opened start tag. This should not normally
115 be needed for structured processing, but may be useful in dealing with HTML "as
116 deployed" or for re-generating input with minimal changes (whitespace between
117 attributes can be preserved, etc.).
118
119
Ezio Melotti4279bc72012-02-18 02:01:36 +0200120The following methods are called when data or markup elements are encountered
121and they are meant to be overridden in a subclass. The base class
122implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
123
124
Georg Brandl116aa622007-08-15 14:28:22 +0000125.. method:: HTMLParser.handle_starttag(tag, attrs)
126
Ezio Melotti4279bc72012-02-18 02:01:36 +0200127 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl116aa622007-08-15 14:28:22 +0000128
129 The *tag* argument is the name of the tag converted to lower case. The *attrs*
130 argument is a list of ``(name, value)`` pairs containing the attributes found
131 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
132 and quotes in the *value* have been removed, and character and entity references
Ezio Melotti4279bc72012-02-18 02:01:36 +0200133 have been replaced.
134
Serhiy Storchaka6dff0202016-05-07 10:49:07 +0300135 For instance, for the tag ``<A HREF="https://www.cwi.nl/">``, this method
136 would be called as ``handle_starttag('a', [('href', 'https://www.cwi.nl/')])``.
Georg Brandl116aa622007-08-15 14:28:22 +0000137
Georg Brandl9087b7f2008-05-18 07:53:01 +0000138 All entity references from :mod:`html.entities` are replaced in the attribute
139 values.
Georg Brandl116aa622007-08-15 14:28:22 +0000140
141
Ezio Melotti4279bc72012-02-18 02:01:36 +0200142.. method:: HTMLParser.handle_endtag(tag)
143
144 This method is called to handle the end tag of an element (e.g. ``</div>``).
145
146 The *tag* argument is the name of the tag converted to lower case.
147
148
Georg Brandl116aa622007-08-15 14:28:22 +0000149.. method:: HTMLParser.handle_startendtag(tag, attrs)
150
151 Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melottif99e4b52011-10-28 14:34:56 +0300152 XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl116aa622007-08-15 14:28:22 +0000153 subclasses which require this particular lexical information; the default
Ezio Melottif99e4b52011-10-28 14:34:56 +0300154 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000155
156
Georg Brandl116aa622007-08-15 14:28:22 +0000157.. method:: HTMLParser.handle_data(data)
158
Ezio Melotti4279bc72012-02-18 02:01:36 +0200159 This method is called to process arbitrary data (e.g. text nodes and the
160 content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl116aa622007-08-15 14:28:22 +0000161
162
163.. method:: HTMLParser.handle_entityref(name)
164
Ezio Melotti4279bc72012-02-18 02:01:36 +0200165 This method is called to process a named character reference of the form
166 ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
Ezio Melotti95401c52013-11-23 19:52:05 +0200167 (e.g. ``'gt'``). This method is never called if *convert_charrefs* is
168 ``True``.
Ezio Melotti4279bc72012-02-18 02:01:36 +0200169
170
171.. method:: HTMLParser.handle_charref(name)
172
173 This method is called to process decimal and hexadecimal numeric character
174 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
175 equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
Ezio Melotti95401c52013-11-23 19:52:05 +0200176 in this case the method will receive ``'62'`` or ``'x3E'``. This method
177 is never called if *convert_charrefs* is ``True``.
Georg Brandl116aa622007-08-15 14:28:22 +0000178
179
180.. method:: HTMLParser.handle_comment(data)
181
Ezio Melotti4279bc72012-02-18 02:01:36 +0200182 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
183
184 For example, the comment ``<!-- comment -->`` will cause this method to be
185 called with the argument ``' comment '``.
186
187 The content of Internet Explorer conditional comments (condcoms) will also be
188 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
R David Murray87cbfb22015-08-24 12:55:03 -0400189 this method will receive ``'[if IE 9]>IE9-specific content<![endif]'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000190
191
192.. method:: HTMLParser.handle_decl(decl)
193
Ezio Melotti4279bc72012-02-18 02:01:36 +0200194 This method is called to handle an HTML doctype declaration (e.g.
195 ``<!DOCTYPE html>``).
196
Georg Brandl46aa5c52010-07-29 13:38:37 +0000197 The *decl* parameter will be the entire contents of the declaration inside
Ezio Melotti4279bc72012-02-18 02:01:36 +0200198 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl116aa622007-08-15 14:28:22 +0000199
200
201.. method:: HTMLParser.handle_pi(data)
202
203 Method called when a processing instruction is encountered. The *data*
204 parameter will contain the entire processing instruction. For example, for the
205 processing instruction ``<?proc color='red'>``, this method would be called as
206 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
207 class; the base class implementation does nothing.
208
209 .. note::
210
211 The :class:`HTMLParser` class uses the SGML syntactic rules for processing
212 instructions. An XHTML processing instruction using the trailing ``'?'`` will
213 cause the ``'?'`` to be included in *data*.
214
215
Ezio Melotti4279bc72012-02-18 02:01:36 +0200216.. method:: HTMLParser.unknown_decl(data)
Georg Brandl116aa622007-08-15 14:28:22 +0000217
Ezio Melotti4279bc72012-02-18 02:01:36 +0200218 This method is called when an unrecognized declaration is read by the parser.
Georg Brandl116aa622007-08-15 14:28:22 +0000219
Ezio Melotti4279bc72012-02-18 02:01:36 +0200220 The *data* parameter will be the entire contents of the declaration inside
221 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
Ezio Melotti73a43592014-08-02 14:10:30 +0300222 derived class. The base class implementation does nothing.
Ezio Melotti4279bc72012-02-18 02:01:36 +0200223
224
225.. _htmlparser-examples:
226
227Examples
228--------
229
230The following class implements a parser that will be used to illustrate more
231examples::
Georg Brandl116aa622007-08-15 14:28:22 +0000232
Ezio Melottif99e4b52011-10-28 14:34:56 +0300233 from html.parser import HTMLParser
Ezio Melotti4279bc72012-02-18 02:01:36 +0200234 from html.entities import name2codepoint
Ezio Melottif99e4b52011-10-28 14:34:56 +0300235
236 class MyHTMLParser(HTMLParser):
237 def handle_starttag(self, tag, attrs):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200238 print("Start tag:", tag)
239 for attr in attrs:
240 print(" attr:", attr)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300241
Ezio Melottif99e4b52011-10-28 14:34:56 +0300242 def handle_endtag(self, tag):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200243 print("End tag :", tag)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300244
Ezio Melottif99e4b52011-10-28 14:34:56 +0300245 def handle_data(self, data):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200246 print("Data :", data)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300247
Ezio Melotti4279bc72012-02-18 02:01:36 +0200248 def handle_comment(self, data):
249 print("Comment :", data)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300250
Ezio Melotti4279bc72012-02-18 02:01:36 +0200251 def handle_entityref(self, name):
252 c = chr(name2codepoint[name])
253 print("Named ent:", c)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300254
Ezio Melotti4279bc72012-02-18 02:01:36 +0200255 def handle_charref(self, name):
256 if name.startswith('x'):
257 c = chr(int(name[1:], 16))
258 else:
259 c = chr(int(name))
260 print("Num ent :", c)
Serhiy Storchakadba90392016-05-10 12:01:23 +0300261
Ezio Melotti4279bc72012-02-18 02:01:36 +0200262 def handle_decl(self, data):
263 print("Decl :", data)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300264
Ezio Melotti88ebfb12013-11-02 17:08:24 +0200265 parser = MyHTMLParser()
Georg Brandl116aa622007-08-15 14:28:22 +0000266
Ezio Melotti4279bc72012-02-18 02:01:36 +0200267Parsing a doctype::
268
269 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
270 ... '"http://www.w3.org/TR/html4/strict.dtd">')
271 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
272
273Parsing an element with a few attributes and a title::
274
275 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
276 Start tag: img
277 attr: ('src', 'python-logo.png')
278 attr: ('alt', 'The Python logo')
279 >>>
280 >>> parser.feed('<h1>Python</h1>')
281 Start tag: h1
282 Data : Python
283 End tag : h1
284
285The content of ``script`` and ``style`` elements is returned as is, without
286further parsing::
287
288 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
289 Start tag: style
290 attr: ('type', 'text/css')
291 Data : #python { color: green }
292 End tag : style
Serhiy Storchakadba90392016-05-10 12:01:23 +0300293
Ezio Melotti4279bc72012-02-18 02:01:36 +0200294 >>> parser.feed('<script type="text/javascript">'
295 ... 'alert("<strong>hello!</strong>");</script>')
296 Start tag: script
297 attr: ('type', 'text/javascript')
298 Data : alert("<strong>hello!</strong>");
299 End tag : script
300
301Parsing comments::
302
303 >>> parser.feed('<!-- a comment -->'
304 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
305 Comment : a comment
306 Comment : [if IE 9]>IE-specific content<![endif]
307
308Parsing named and numeric character references and converting them to the
309correct char (note: these 3 references are all equivalent to ``'>'``)::
310
311 >>> parser.feed('&gt;&#62;&#x3E;')
312 Named ent: >
313 Num ent : >
314 Num ent : >
315
316Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
Ezio Melotti95401c52013-11-23 19:52:05 +0200317:meth:`~HTMLParser.handle_data` might be called more than once
318(unless *convert_charrefs* is set to ``True``)::
Ezio Melotti4279bc72012-02-18 02:01:36 +0200319
320 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
321 ... parser.feed(chunk)
322 ...
323 Start tag: span
324 Data : buff
325 Data : ered
326 Data : text
327 End tag : span
328
329Parsing invalid HTML (e.g. unquoted attributes) also works::
330
331 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
332 Start tag: p
333 Start tag: a
334 attr: ('class', 'link')
335 attr: ('href', '#main')
336 Data : tag soup
337 End tag : p
338 End tag : a