blob: 4715185fcc7666c59399fb321ad774a8e4584626 [file] [log] [blame]
Fred Drake3c50ea42008-05-17 22:02:32 +00001:mod:`html.parser` --- Simple HTML and XHTML parser
2===================================================
Georg Brandl116aa622007-08-15 14:28:22 +00003
Fred Drake3c50ea42008-05-17 22:02:32 +00004.. module:: html.parser
Georg Brandl116aa622007-08-15 14:28:22 +00005 :synopsis: A simple parser that can handle HTML and XHTML.
6
7
Georg Brandl9087b7f2008-05-18 07:53:01 +00008.. index::
9 single: HTML
10 single: XHTML
Georg Brandl116aa622007-08-15 14:28:22 +000011
Raymond Hettingera1993682011-01-27 01:20:32 +000012**Source code:** :source:`Lib/html/parser.py`
13
14--------------
15
Georg Brandl116aa622007-08-15 14:28:22 +000016This module defines a class :class:`HTMLParser` which serves as the basis for
17parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl116aa622007-08-15 14:28:22 +000018
Ezio Melotti3861d8b2012-06-23 15:27:51 +020019.. class:: HTMLParser(strict=False)
Georg Brandl116aa622007-08-15 14:28:22 +000020
Ezio Melotti3861d8b2012-06-23 15:27:51 +020021 Create a parser instance. If *strict* is ``False`` (the default), the parser
22 will accept and parse invalid markup. If *strict* is ``True`` the parser
23 will raise an :exc:`~html.parser.HTMLParseError` exception instead [#]_ when
24 it's not able to parse the markup.
25 The use of ``strict=True`` is discouraged and the *strict* argument is
26 deprecated.
Georg Brandl116aa622007-08-15 14:28:22 +000027
Ezio Melotti4279bc72012-02-18 02:01:36 +020028 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
29 when start tags, end tags, text, comments, and other markup elements are
30 encountered. The user should subclass :class:`.HTMLParser` and override its
31 methods to implement the desired behavior.
Georg Brandl116aa622007-08-15 14:28:22 +000032
Georg Brandl877b10a2008-06-01 21:25:55 +000033 This parser does not check that end tags match start tags or call the end-tag
34 handler for elements which are closed implicitly by closing an outer element.
Georg Brandl116aa622007-08-15 14:28:22 +000035
R. David Murraybb7b7532010-12-03 04:26:18 +000036 .. versionchanged:: 3.2 *strict* keyword added
37
Ezio Melotti3861d8b2012-06-23 15:27:51 +020038 .. deprecated-removed:: 3.3 3.5
39 The *strict* argument and the strict mode have been deprecated.
40 The parser is now able to accept and parse invalid markup too.
41
Georg Brandl116aa622007-08-15 14:28:22 +000042An exception is defined as well:
43
44
45.. exception:: HTMLParseError
46
47 Exception raised by the :class:`HTMLParser` class when it encounters an error
Ezio Melotti4279bc72012-02-18 02:01:36 +020048 while parsing and *strict* is ``True``. This exception provides three
49 attributes: :attr:`msg` is a brief message explaining the error,
50 :attr:`lineno` is the number of the line on which the broken construct was
51 detected, and :attr:`offset` is the number of characters into the line at
52 which the construct starts.
53
Ezio Melotti3861d8b2012-06-23 15:27:51 +020054 .. deprecated-removed:: 3.3 3.5
55 This exception has been deprecated because it's never raised by the parser
56 (when the default non-strict mode is used).
57
Ezio Melotti4279bc72012-02-18 02:01:36 +020058
59Example HTML Parser Application
60-------------------------------
61
62As a basic example, below is a simple HTML parser that uses the
63:class:`HTMLParser` class to print out start tags, end tags, and data
64as they are encountered::
65
66 from html.parser import HTMLParser
67
68 class MyHTMLParser(HTMLParser):
69 def handle_starttag(self, tag, attrs):
70 print("Encountered a start tag:", tag)
71 def handle_endtag(self, tag):
72 print("Encountered an end tag :", tag)
73 def handle_data(self, data):
74 print("Encountered some data :", data)
75
76 parser = MyHTMLParser(strict=False)
77 parser.feed('<html><head><title>Test</title></head>'
78 '<body><h1>Parse me!</h1></body></html>')
79
80The output will then be::
81
82 Encountered a start tag: html
83 Encountered a start tag: head
84 Encountered a start tag: title
85 Encountered some data : Test
86 Encountered an end tag : title
87 Encountered an end tag : head
88 Encountered a start tag: body
89 Encountered a start tag: h1
90 Encountered some data : Parse me!
91 Encountered an end tag : h1
92 Encountered an end tag : body
93 Encountered an end tag : html
94
95
96:class:`.HTMLParser` Methods
97----------------------------
Georg Brandl116aa622007-08-15 14:28:22 +000098
99:class:`HTMLParser` instances have the following methods:
100
101
Georg Brandl116aa622007-08-15 14:28:22 +0000102.. method:: HTMLParser.feed(data)
103
104 Feed some text to the parser. It is processed insofar as it consists of
105 complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti4279bc72012-02-18 02:01:36 +0200106 :meth:`close` is called. *data* must be :class:`str`.
Georg Brandl116aa622007-08-15 14:28:22 +0000107
108
109.. method:: HTMLParser.close()
110
111 Force processing of all buffered data as if it were followed by an end-of-file
112 mark. This method may be redefined by a derived class to define additional
113 processing at the end of the input, but the redefined version should always call
114 the :class:`HTMLParser` base class method :meth:`close`.
115
116
Ezio Melotti4279bc72012-02-18 02:01:36 +0200117.. method:: HTMLParser.reset()
118
119 Reset the instance. Loses all unprocessed data. This is called implicitly at
120 instantiation time.
121
122
Georg Brandl116aa622007-08-15 14:28:22 +0000123.. method:: HTMLParser.getpos()
124
125 Return current line number and offset.
126
127
128.. method:: HTMLParser.get_starttag_text()
129
130 Return the text of the most recently opened start tag. This should not normally
131 be needed for structured processing, but may be useful in dealing with HTML "as
132 deployed" or for re-generating input with minimal changes (whitespace between
133 attributes can be preserved, etc.).
134
135
Ezio Melotti4279bc72012-02-18 02:01:36 +0200136The following methods are called when data or markup elements are encountered
137and they are meant to be overridden in a subclass. The base class
138implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
139
140
Georg Brandl116aa622007-08-15 14:28:22 +0000141.. method:: HTMLParser.handle_starttag(tag, attrs)
142
Ezio Melotti4279bc72012-02-18 02:01:36 +0200143 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl116aa622007-08-15 14:28:22 +0000144
145 The *tag* argument is the name of the tag converted to lower case. The *attrs*
146 argument is a list of ``(name, value)`` pairs containing the attributes found
147 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
148 and quotes in the *value* have been removed, and character and entity references
Ezio Melotti4279bc72012-02-18 02:01:36 +0200149 have been replaced.
150
151 For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
152 would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
Georg Brandl116aa622007-08-15 14:28:22 +0000153
Georg Brandl9087b7f2008-05-18 07:53:01 +0000154 All entity references from :mod:`html.entities` are replaced in the attribute
155 values.
Georg Brandl116aa622007-08-15 14:28:22 +0000156
157
Ezio Melotti4279bc72012-02-18 02:01:36 +0200158.. method:: HTMLParser.handle_endtag(tag)
159
160 This method is called to handle the end tag of an element (e.g. ``</div>``).
161
162 The *tag* argument is the name of the tag converted to lower case.
163
164
Georg Brandl116aa622007-08-15 14:28:22 +0000165.. method:: HTMLParser.handle_startendtag(tag, attrs)
166
167 Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melottif99e4b52011-10-28 14:34:56 +0300168 XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl116aa622007-08-15 14:28:22 +0000169 subclasses which require this particular lexical information; the default
Ezio Melottif99e4b52011-10-28 14:34:56 +0300170 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000171
172
Georg Brandl116aa622007-08-15 14:28:22 +0000173.. method:: HTMLParser.handle_data(data)
174
Ezio Melotti4279bc72012-02-18 02:01:36 +0200175 This method is called to process arbitrary data (e.g. text nodes and the
176 content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl116aa622007-08-15 14:28:22 +0000177
178
179.. method:: HTMLParser.handle_entityref(name)
180
Ezio Melotti4279bc72012-02-18 02:01:36 +0200181 This method is called to process a named character reference of the form
182 ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
183 (e.g. ``'gt'``).
184
185
186.. method:: HTMLParser.handle_charref(name)
187
188 This method is called to process decimal and hexadecimal numeric character
189 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
190 equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
191 in this case the method will receive ``'62'`` or ``'x3E'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000192
193
194.. method:: HTMLParser.handle_comment(data)
195
Ezio Melotti4279bc72012-02-18 02:01:36 +0200196 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
197
198 For example, the comment ``<!-- comment -->`` will cause this method to be
199 called with the argument ``' comment '``.
200
201 The content of Internet Explorer conditional comments (condcoms) will also be
202 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
203 this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000204
205
206.. method:: HTMLParser.handle_decl(decl)
207
Ezio Melotti4279bc72012-02-18 02:01:36 +0200208 This method is called to handle an HTML doctype declaration (e.g.
209 ``<!DOCTYPE html>``).
210
Georg Brandl46aa5c52010-07-29 13:38:37 +0000211 The *decl* parameter will be the entire contents of the declaration inside
Ezio Melotti4279bc72012-02-18 02:01:36 +0200212 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl116aa622007-08-15 14:28:22 +0000213
214
215.. method:: HTMLParser.handle_pi(data)
216
217 Method called when a processing instruction is encountered. The *data*
218 parameter will contain the entire processing instruction. For example, for the
219 processing instruction ``<?proc color='red'>``, this method would be called as
220 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
221 class; the base class implementation does nothing.
222
223 .. note::
224
225 The :class:`HTMLParser` class uses the SGML syntactic rules for processing
226 instructions. An XHTML processing instruction using the trailing ``'?'`` will
227 cause the ``'?'`` to be included in *data*.
228
229
Ezio Melotti4279bc72012-02-18 02:01:36 +0200230.. method:: HTMLParser.unknown_decl(data)
Georg Brandl116aa622007-08-15 14:28:22 +0000231
Ezio Melotti4279bc72012-02-18 02:01:36 +0200232 This method is called when an unrecognized declaration is read by the parser.
Georg Brandl116aa622007-08-15 14:28:22 +0000233
Ezio Melotti4279bc72012-02-18 02:01:36 +0200234 The *data* parameter will be the entire contents of the declaration inside
235 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
236 derived class. The base class implementation raises an :exc:`HTMLParseError`
237 when *strict* is ``True``.
238
239
240.. _htmlparser-examples:
241
242Examples
243--------
244
245The following class implements a parser that will be used to illustrate more
246examples::
Georg Brandl116aa622007-08-15 14:28:22 +0000247
Ezio Melottif99e4b52011-10-28 14:34:56 +0300248 from html.parser import HTMLParser
Ezio Melotti4279bc72012-02-18 02:01:36 +0200249 from html.entities import name2codepoint
Ezio Melottif99e4b52011-10-28 14:34:56 +0300250
251 class MyHTMLParser(HTMLParser):
252 def handle_starttag(self, tag, attrs):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200253 print("Start tag:", tag)
254 for attr in attrs:
255 print(" attr:", attr)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300256 def handle_endtag(self, tag):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200257 print("End tag :", tag)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300258 def handle_data(self, data):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200259 print("Data :", data)
260 def handle_comment(self, data):
261 print("Comment :", data)
262 def handle_entityref(self, name):
263 c = chr(name2codepoint[name])
264 print("Named ent:", c)
265 def handle_charref(self, name):
266 if name.startswith('x'):
267 c = chr(int(name[1:], 16))
268 else:
269 c = chr(int(name))
270 print("Num ent :", c)
271 def handle_decl(self, data):
272 print("Decl :", data)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300273
Ezio Melotti4279bc72012-02-18 02:01:36 +0200274 parser = MyHTMLParser(strict=False)
Georg Brandl116aa622007-08-15 14:28:22 +0000275
Ezio Melotti4279bc72012-02-18 02:01:36 +0200276Parsing a doctype::
277
278 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
279 ... '"http://www.w3.org/TR/html4/strict.dtd">')
280 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
281
282Parsing an element with a few attributes and a title::
283
284 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
285 Start tag: img
286 attr: ('src', 'python-logo.png')
287 attr: ('alt', 'The Python logo')
288 >>>
289 >>> parser.feed('<h1>Python</h1>')
290 Start tag: h1
291 Data : Python
292 End tag : h1
293
294The content of ``script`` and ``style`` elements is returned as is, without
295further parsing::
296
297 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
298 Start tag: style
299 attr: ('type', 'text/css')
300 Data : #python { color: green }
301 End tag : style
302 >>>
303 >>> parser.feed('<script type="text/javascript">'
304 ... 'alert("<strong>hello!</strong>");</script>')
305 Start tag: script
306 attr: ('type', 'text/javascript')
307 Data : alert("<strong>hello!</strong>");
308 End tag : script
309
310Parsing comments::
311
312 >>> parser.feed('<!-- a comment -->'
313 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
314 Comment : a comment
315 Comment : [if IE 9]>IE-specific content<![endif]
316
317Parsing named and numeric character references and converting them to the
318correct char (note: these 3 references are all equivalent to ``'>'``)::
319
320 >>> parser.feed('&gt;&#62;&#x3E;')
321 Named ent: >
322 Num ent : >
323 Num ent : >
324
325Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
326:meth:`~HTMLParser.handle_data` might be called more than once::
327
328 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
329 ... parser.feed(chunk)
330 ...
331 Start tag: span
332 Data : buff
333 Data : ered
334 Data : text
335 End tag : span
336
337Parsing invalid HTML (e.g. unquoted attributes) also works::
338
339 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
340 Start tag: p
341 Start tag: a
342 attr: ('class', 'link')
343 attr: ('href', '#main')
344 Data : tag soup
345 End tag : p
346 End tag : a
Georg Brandl116aa622007-08-15 14:28:22 +0000347
R. David Murrayb579dba2010-12-03 04:06:39 +0000348.. rubric:: Footnotes
349
R. David Murraybb7b7532010-12-03 04:26:18 +0000350.. [#] For backward compatibility reasons *strict* mode does not raise
351 exceptions for all non-compliant HTML. That is, some invalid HTML
R. David Murrayb579dba2010-12-03 04:06:39 +0000352 is tolerated even in *strict* mode.