blob: 44b7d6ea6d282bb9ce0c608c19782f79826a2803 [file] [log] [blame]
Fred Drake3c50ea42008-05-17 22:02:32 +00001:mod:`html.parser` --- Simple HTML and XHTML parser
2===================================================
Georg Brandl116aa622007-08-15 14:28:22 +00003
Fred Drake3c50ea42008-05-17 22:02:32 +00004.. module:: html.parser
Georg Brandl116aa622007-08-15 14:28:22 +00005 :synopsis: A simple parser that can handle HTML and XHTML.
6
7
Georg Brandl9087b7f2008-05-18 07:53:01 +00008.. index::
9 single: HTML
10 single: XHTML
Georg Brandl116aa622007-08-15 14:28:22 +000011
Raymond Hettingera1993682011-01-27 01:20:32 +000012**Source code:** :source:`Lib/html/parser.py`
13
14--------------
15
Georg Brandl116aa622007-08-15 14:28:22 +000016This module defines a class :class:`HTMLParser` which serves as the basis for
17parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl116aa622007-08-15 14:28:22 +000018
Ezio Melotti95401c52013-11-23 19:52:05 +020019.. class:: HTMLParser(strict=False, *, convert_charrefs=False)
Georg Brandl116aa622007-08-15 14:28:22 +000020
Ezio Melotti95401c52013-11-23 19:52:05 +020021 Create a parser instance.
22
23 If *convert_charrefs* is ``True`` (default: ``False``), all character
24 references (except the ones in ``script``/``style`` elements) are
25 automatically converted to the corresponding Unicode characters.
26 The use of ``convert_charrefs=True`` is encouraged and will become
27 the default in Python 3.5.
28
29 If *strict* is ``False`` (the default), the parser will accept and parse
30 invalid markup. If *strict* is ``True`` the parser will raise an
31 :exc:`~html.parser.HTMLParseError` exception instead [#]_ when it's not
32 able to parse the markup. The use of ``strict=True`` is discouraged and
33 the *strict* argument is deprecated.
Georg Brandl116aa622007-08-15 14:28:22 +000034
Ezio Melotti4279bc72012-02-18 02:01:36 +020035 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
36 when start tags, end tags, text, comments, and other markup elements are
37 encountered. The user should subclass :class:`.HTMLParser` and override its
38 methods to implement the desired behavior.
Georg Brandl116aa622007-08-15 14:28:22 +000039
Georg Brandl877b10a2008-06-01 21:25:55 +000040 This parser does not check that end tags match start tags or call the end-tag
41 handler for elements which are closed implicitly by closing an outer element.
Georg Brandl116aa622007-08-15 14:28:22 +000042
Georg Brandl61063cc2012-06-24 22:48:30 +020043 .. versionchanged:: 3.2
Ezio Melotti95401c52013-11-23 19:52:05 +020044 *strict* argument added.
R. David Murraybb7b7532010-12-03 04:26:18 +000045
Ezio Melotti3861d8b2012-06-23 15:27:51 +020046 .. deprecated-removed:: 3.3 3.5
47 The *strict* argument and the strict mode have been deprecated.
48 The parser is now able to accept and parse invalid markup too.
49
Ezio Melotti95401c52013-11-23 19:52:05 +020050 .. versionchanged:: 3.4
51 *convert_charrefs* keyword argument added.
52
Georg Brandl116aa622007-08-15 14:28:22 +000053An exception is defined as well:
54
55
56.. exception:: HTMLParseError
57
58 Exception raised by the :class:`HTMLParser` class when it encounters an error
Ezio Melotti4279bc72012-02-18 02:01:36 +020059 while parsing and *strict* is ``True``. This exception provides three
60 attributes: :attr:`msg` is a brief message explaining the error,
61 :attr:`lineno` is the number of the line on which the broken construct was
62 detected, and :attr:`offset` is the number of characters into the line at
63 which the construct starts.
64
Ezio Melotti3861d8b2012-06-23 15:27:51 +020065 .. deprecated-removed:: 3.3 3.5
66 This exception has been deprecated because it's never raised by the parser
67 (when the default non-strict mode is used).
68
Ezio Melotti4279bc72012-02-18 02:01:36 +020069
70Example HTML Parser Application
71-------------------------------
72
73As a basic example, below is a simple HTML parser that uses the
74:class:`HTMLParser` class to print out start tags, end tags, and data
75as they are encountered::
76
77 from html.parser import HTMLParser
78
79 class MyHTMLParser(HTMLParser):
80 def handle_starttag(self, tag, attrs):
81 print("Encountered a start tag:", tag)
82 def handle_endtag(self, tag):
83 print("Encountered an end tag :", tag)
84 def handle_data(self, data):
85 print("Encountered some data :", data)
86
Ezio Melotti88ebfb12013-11-02 17:08:24 +020087 parser = MyHTMLParser()
Ezio Melotti4279bc72012-02-18 02:01:36 +020088 parser.feed('<html><head><title>Test</title></head>'
89 '<body><h1>Parse me!</h1></body></html>')
90
91The output will then be::
92
93 Encountered a start tag: html
94 Encountered a start tag: head
95 Encountered a start tag: title
96 Encountered some data : Test
97 Encountered an end tag : title
98 Encountered an end tag : head
99 Encountered a start tag: body
100 Encountered a start tag: h1
101 Encountered some data : Parse me!
102 Encountered an end tag : h1
103 Encountered an end tag : body
104 Encountered an end tag : html
105
106
107:class:`.HTMLParser` Methods
108----------------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000109
110:class:`HTMLParser` instances have the following methods:
111
112
Georg Brandl116aa622007-08-15 14:28:22 +0000113.. method:: HTMLParser.feed(data)
114
115 Feed some text to the parser. It is processed insofar as it consists of
116 complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti4279bc72012-02-18 02:01:36 +0200117 :meth:`close` is called. *data* must be :class:`str`.
Georg Brandl116aa622007-08-15 14:28:22 +0000118
119
120.. method:: HTMLParser.close()
121
122 Force processing of all buffered data as if it were followed by an end-of-file
123 mark. This method may be redefined by a derived class to define additional
124 processing at the end of the input, but the redefined version should always call
125 the :class:`HTMLParser` base class method :meth:`close`.
126
127
Ezio Melotti4279bc72012-02-18 02:01:36 +0200128.. method:: HTMLParser.reset()
129
130 Reset the instance. Loses all unprocessed data. This is called implicitly at
131 instantiation time.
132
133
Georg Brandl116aa622007-08-15 14:28:22 +0000134.. method:: HTMLParser.getpos()
135
136 Return current line number and offset.
137
138
139.. method:: HTMLParser.get_starttag_text()
140
141 Return the text of the most recently opened start tag. This should not normally
142 be needed for structured processing, but may be useful in dealing with HTML "as
143 deployed" or for re-generating input with minimal changes (whitespace between
144 attributes can be preserved, etc.).
145
146
Ezio Melotti4279bc72012-02-18 02:01:36 +0200147The following methods are called when data or markup elements are encountered
148and they are meant to be overridden in a subclass. The base class
149implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
150
151
Georg Brandl116aa622007-08-15 14:28:22 +0000152.. method:: HTMLParser.handle_starttag(tag, attrs)
153
Ezio Melotti4279bc72012-02-18 02:01:36 +0200154 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl116aa622007-08-15 14:28:22 +0000155
156 The *tag* argument is the name of the tag converted to lower case. The *attrs*
157 argument is a list of ``(name, value)`` pairs containing the attributes found
158 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
159 and quotes in the *value* have been removed, and character and entity references
Ezio Melotti4279bc72012-02-18 02:01:36 +0200160 have been replaced.
161
162 For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
163 would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
Georg Brandl116aa622007-08-15 14:28:22 +0000164
Georg Brandl9087b7f2008-05-18 07:53:01 +0000165 All entity references from :mod:`html.entities` are replaced in the attribute
166 values.
Georg Brandl116aa622007-08-15 14:28:22 +0000167
168
Ezio Melotti4279bc72012-02-18 02:01:36 +0200169.. method:: HTMLParser.handle_endtag(tag)
170
171 This method is called to handle the end tag of an element (e.g. ``</div>``).
172
173 The *tag* argument is the name of the tag converted to lower case.
174
175
Georg Brandl116aa622007-08-15 14:28:22 +0000176.. method:: HTMLParser.handle_startendtag(tag, attrs)
177
178 Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melottif99e4b52011-10-28 14:34:56 +0300179 XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl116aa622007-08-15 14:28:22 +0000180 subclasses which require this particular lexical information; the default
Ezio Melottif99e4b52011-10-28 14:34:56 +0300181 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000182
183
Georg Brandl116aa622007-08-15 14:28:22 +0000184.. method:: HTMLParser.handle_data(data)
185
Ezio Melotti4279bc72012-02-18 02:01:36 +0200186 This method is called to process arbitrary data (e.g. text nodes and the
187 content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl116aa622007-08-15 14:28:22 +0000188
189
190.. method:: HTMLParser.handle_entityref(name)
191
Ezio Melotti4279bc72012-02-18 02:01:36 +0200192 This method is called to process a named character reference of the form
193 ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
Ezio Melotti95401c52013-11-23 19:52:05 +0200194 (e.g. ``'gt'``). This method is never called if *convert_charrefs* is
195 ``True``.
Ezio Melotti4279bc72012-02-18 02:01:36 +0200196
197
198.. method:: HTMLParser.handle_charref(name)
199
200 This method is called to process decimal and hexadecimal numeric character
201 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
202 equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
Ezio Melotti95401c52013-11-23 19:52:05 +0200203 in this case the method will receive ``'62'`` or ``'x3E'``. This method
204 is never called if *convert_charrefs* is ``True``.
Georg Brandl116aa622007-08-15 14:28:22 +0000205
206
207.. method:: HTMLParser.handle_comment(data)
208
Ezio Melotti4279bc72012-02-18 02:01:36 +0200209 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
210
211 For example, the comment ``<!-- comment -->`` will cause this method to be
212 called with the argument ``' comment '``.
213
214 The content of Internet Explorer conditional comments (condcoms) will also be
215 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
216 this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000217
218
219.. method:: HTMLParser.handle_decl(decl)
220
Ezio Melotti4279bc72012-02-18 02:01:36 +0200221 This method is called to handle an HTML doctype declaration (e.g.
222 ``<!DOCTYPE html>``).
223
Georg Brandl46aa5c52010-07-29 13:38:37 +0000224 The *decl* parameter will be the entire contents of the declaration inside
Ezio Melotti4279bc72012-02-18 02:01:36 +0200225 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl116aa622007-08-15 14:28:22 +0000226
227
228.. method:: HTMLParser.handle_pi(data)
229
230 Method called when a processing instruction is encountered. The *data*
231 parameter will contain the entire processing instruction. For example, for the
232 processing instruction ``<?proc color='red'>``, this method would be called as
233 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
234 class; the base class implementation does nothing.
235
236 .. note::
237
238 The :class:`HTMLParser` class uses the SGML syntactic rules for processing
239 instructions. An XHTML processing instruction using the trailing ``'?'`` will
240 cause the ``'?'`` to be included in *data*.
241
242
Ezio Melotti4279bc72012-02-18 02:01:36 +0200243.. method:: HTMLParser.unknown_decl(data)
Georg Brandl116aa622007-08-15 14:28:22 +0000244
Ezio Melotti4279bc72012-02-18 02:01:36 +0200245 This method is called when an unrecognized declaration is read by the parser.
Georg Brandl116aa622007-08-15 14:28:22 +0000246
Ezio Melotti4279bc72012-02-18 02:01:36 +0200247 The *data* parameter will be the entire contents of the declaration inside
248 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
249 derived class. The base class implementation raises an :exc:`HTMLParseError`
250 when *strict* is ``True``.
251
252
253.. _htmlparser-examples:
254
255Examples
256--------
257
258The following class implements a parser that will be used to illustrate more
259examples::
Georg Brandl116aa622007-08-15 14:28:22 +0000260
Ezio Melottif99e4b52011-10-28 14:34:56 +0300261 from html.parser import HTMLParser
Ezio Melotti4279bc72012-02-18 02:01:36 +0200262 from html.entities import name2codepoint
Ezio Melottif99e4b52011-10-28 14:34:56 +0300263
264 class MyHTMLParser(HTMLParser):
265 def handle_starttag(self, tag, attrs):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200266 print("Start tag:", tag)
267 for attr in attrs:
268 print(" attr:", attr)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300269 def handle_endtag(self, tag):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200270 print("End tag :", tag)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300271 def handle_data(self, data):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200272 print("Data :", data)
273 def handle_comment(self, data):
274 print("Comment :", data)
275 def handle_entityref(self, name):
276 c = chr(name2codepoint[name])
277 print("Named ent:", c)
278 def handle_charref(self, name):
279 if name.startswith('x'):
280 c = chr(int(name[1:], 16))
281 else:
282 c = chr(int(name))
283 print("Num ent :", c)
284 def handle_decl(self, data):
285 print("Decl :", data)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300286
Ezio Melotti88ebfb12013-11-02 17:08:24 +0200287 parser = MyHTMLParser()
Georg Brandl116aa622007-08-15 14:28:22 +0000288
Ezio Melotti4279bc72012-02-18 02:01:36 +0200289Parsing a doctype::
290
291 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
292 ... '"http://www.w3.org/TR/html4/strict.dtd">')
293 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
294
295Parsing an element with a few attributes and a title::
296
297 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
298 Start tag: img
299 attr: ('src', 'python-logo.png')
300 attr: ('alt', 'The Python logo')
301 >>>
302 >>> parser.feed('<h1>Python</h1>')
303 Start tag: h1
304 Data : Python
305 End tag : h1
306
307The content of ``script`` and ``style`` elements is returned as is, without
308further parsing::
309
310 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
311 Start tag: style
312 attr: ('type', 'text/css')
313 Data : #python { color: green }
314 End tag : style
315 >>>
316 >>> parser.feed('<script type="text/javascript">'
317 ... 'alert("<strong>hello!</strong>");</script>')
318 Start tag: script
319 attr: ('type', 'text/javascript')
320 Data : alert("<strong>hello!</strong>");
321 End tag : script
322
323Parsing comments::
324
325 >>> parser.feed('<!-- a comment -->'
326 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
327 Comment : a comment
328 Comment : [if IE 9]>IE-specific content<![endif]
329
330Parsing named and numeric character references and converting them to the
331correct char (note: these 3 references are all equivalent to ``'>'``)::
332
333 >>> parser.feed('&gt;&#62;&#x3E;')
334 Named ent: >
335 Num ent : >
336 Num ent : >
337
338Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
Ezio Melotti95401c52013-11-23 19:52:05 +0200339:meth:`~HTMLParser.handle_data` might be called more than once
340(unless *convert_charrefs* is set to ``True``)::
Ezio Melotti4279bc72012-02-18 02:01:36 +0200341
342 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
343 ... parser.feed(chunk)
344 ...
345 Start tag: span
346 Data : buff
347 Data : ered
348 Data : text
349 End tag : span
350
351Parsing invalid HTML (e.g. unquoted attributes) also works::
352
353 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
354 Start tag: p
355 Start tag: a
356 attr: ('class', 'link')
357 attr: ('href', '#main')
358 Data : tag soup
359 End tag : p
360 End tag : a
Georg Brandl116aa622007-08-15 14:28:22 +0000361
R. David Murrayb579dba2010-12-03 04:06:39 +0000362.. rubric:: Footnotes
363
R. David Murraybb7b7532010-12-03 04:26:18 +0000364.. [#] For backward compatibility reasons *strict* mode does not raise
365 exceptions for all non-compliant HTML. That is, some invalid HTML
R. David Murrayb579dba2010-12-03 04:06:39 +0000366 is tolerated even in *strict* mode.