blob: e4154ef968a9b8fb50c3c9ddc8b70394d7f38d28 [file] [log] [blame]
Fred Drake3c50ea42008-05-17 22:02:32 +00001:mod:`html.parser` --- Simple HTML and XHTML parser
2===================================================
Georg Brandl116aa622007-08-15 14:28:22 +00003
Fred Drake3c50ea42008-05-17 22:02:32 +00004.. module:: html.parser
Georg Brandl116aa622007-08-15 14:28:22 +00005 :synopsis: A simple parser that can handle HTML and XHTML.
6
7
Georg Brandl9087b7f2008-05-18 07:53:01 +00008.. index::
9 single: HTML
10 single: XHTML
Georg Brandl116aa622007-08-15 14:28:22 +000011
Raymond Hettingera1993682011-01-27 01:20:32 +000012**Source code:** :source:`Lib/html/parser.py`
13
14--------------
15
Georg Brandl116aa622007-08-15 14:28:22 +000016This module defines a class :class:`HTMLParser` which serves as the basis for
17parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl116aa622007-08-15 14:28:22 +000018
Ezio Melotti3861d8b2012-06-23 15:27:51 +020019.. class:: HTMLParser(strict=False)
Georg Brandl116aa622007-08-15 14:28:22 +000020
Ezio Melotti3861d8b2012-06-23 15:27:51 +020021 Create a parser instance. If *strict* is ``False`` (the default), the parser
22 will accept and parse invalid markup. If *strict* is ``True`` the parser
23 will raise an :exc:`~html.parser.HTMLParseError` exception instead [#]_ when
24 it's not able to parse the markup.
25 The use of ``strict=True`` is discouraged and the *strict* argument is
26 deprecated.
Georg Brandl116aa622007-08-15 14:28:22 +000027
Ezio Melotti4279bc72012-02-18 02:01:36 +020028 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
29 when start tags, end tags, text, comments, and other markup elements are
30 encountered. The user should subclass :class:`.HTMLParser` and override its
31 methods to implement the desired behavior.
Georg Brandl116aa622007-08-15 14:28:22 +000032
Georg Brandl877b10a2008-06-01 21:25:55 +000033 This parser does not check that end tags match start tags or call the end-tag
34 handler for elements which are closed implicitly by closing an outer element.
Georg Brandl116aa622007-08-15 14:28:22 +000035
Georg Brandl61063cc2012-06-24 22:48:30 +020036 .. versionchanged:: 3.2
37 *strict* keyword added.
R. David Murraybb7b7532010-12-03 04:26:18 +000038
Ezio Melotti3861d8b2012-06-23 15:27:51 +020039 .. deprecated-removed:: 3.3 3.5
40 The *strict* argument and the strict mode have been deprecated.
41 The parser is now able to accept and parse invalid markup too.
42
Georg Brandl116aa622007-08-15 14:28:22 +000043An exception is defined as well:
44
45
46.. exception:: HTMLParseError
47
48 Exception raised by the :class:`HTMLParser` class when it encounters an error
Ezio Melotti4279bc72012-02-18 02:01:36 +020049 while parsing and *strict* is ``True``. This exception provides three
50 attributes: :attr:`msg` is a brief message explaining the error,
51 :attr:`lineno` is the number of the line on which the broken construct was
52 detected, and :attr:`offset` is the number of characters into the line at
53 which the construct starts.
54
Ezio Melotti3861d8b2012-06-23 15:27:51 +020055 .. deprecated-removed:: 3.3 3.5
56 This exception has been deprecated because it's never raised by the parser
57 (when the default non-strict mode is used).
58
Ezio Melotti4279bc72012-02-18 02:01:36 +020059
60Example HTML Parser Application
61-------------------------------
62
63As a basic example, below is a simple HTML parser that uses the
64:class:`HTMLParser` class to print out start tags, end tags, and data
65as they are encountered::
66
67 from html.parser import HTMLParser
68
69 class MyHTMLParser(HTMLParser):
70 def handle_starttag(self, tag, attrs):
71 print("Encountered a start tag:", tag)
72 def handle_endtag(self, tag):
73 print("Encountered an end tag :", tag)
74 def handle_data(self, data):
75 print("Encountered some data :", data)
76
77 parser = MyHTMLParser(strict=False)
78 parser.feed('<html><head><title>Test</title></head>'
79 '<body><h1>Parse me!</h1></body></html>')
80
81The output will then be::
82
83 Encountered a start tag: html
84 Encountered a start tag: head
85 Encountered a start tag: title
86 Encountered some data : Test
87 Encountered an end tag : title
88 Encountered an end tag : head
89 Encountered a start tag: body
90 Encountered a start tag: h1
91 Encountered some data : Parse me!
92 Encountered an end tag : h1
93 Encountered an end tag : body
94 Encountered an end tag : html
95
96
97:class:`.HTMLParser` Methods
98----------------------------
Georg Brandl116aa622007-08-15 14:28:22 +000099
100:class:`HTMLParser` instances have the following methods:
101
102
Georg Brandl116aa622007-08-15 14:28:22 +0000103.. method:: HTMLParser.feed(data)
104
105 Feed some text to the parser. It is processed insofar as it consists of
106 complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti4279bc72012-02-18 02:01:36 +0200107 :meth:`close` is called. *data* must be :class:`str`.
Georg Brandl116aa622007-08-15 14:28:22 +0000108
109
110.. method:: HTMLParser.close()
111
112 Force processing of all buffered data as if it were followed by an end-of-file
113 mark. This method may be redefined by a derived class to define additional
114 processing at the end of the input, but the redefined version should always call
115 the :class:`HTMLParser` base class method :meth:`close`.
116
117
Ezio Melotti4279bc72012-02-18 02:01:36 +0200118.. method:: HTMLParser.reset()
119
120 Reset the instance. Loses all unprocessed data. This is called implicitly at
121 instantiation time.
122
123
Georg Brandl116aa622007-08-15 14:28:22 +0000124.. method:: HTMLParser.getpos()
125
126 Return current line number and offset.
127
128
129.. method:: HTMLParser.get_starttag_text()
130
131 Return the text of the most recently opened start tag. This should not normally
132 be needed for structured processing, but may be useful in dealing with HTML "as
133 deployed" or for re-generating input with minimal changes (whitespace between
134 attributes can be preserved, etc.).
135
136
Ezio Melotti4279bc72012-02-18 02:01:36 +0200137The following methods are called when data or markup elements are encountered
138and they are meant to be overridden in a subclass. The base class
139implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
140
141
Georg Brandl116aa622007-08-15 14:28:22 +0000142.. method:: HTMLParser.handle_starttag(tag, attrs)
143
Ezio Melotti4279bc72012-02-18 02:01:36 +0200144 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl116aa622007-08-15 14:28:22 +0000145
146 The *tag* argument is the name of the tag converted to lower case. The *attrs*
147 argument is a list of ``(name, value)`` pairs containing the attributes found
148 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
149 and quotes in the *value* have been removed, and character and entity references
Ezio Melotti4279bc72012-02-18 02:01:36 +0200150 have been replaced.
151
152 For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
153 would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
Georg Brandl116aa622007-08-15 14:28:22 +0000154
Georg Brandl9087b7f2008-05-18 07:53:01 +0000155 All entity references from :mod:`html.entities` are replaced in the attribute
156 values.
Georg Brandl116aa622007-08-15 14:28:22 +0000157
158
Ezio Melotti4279bc72012-02-18 02:01:36 +0200159.. method:: HTMLParser.handle_endtag(tag)
160
161 This method is called to handle the end tag of an element (e.g. ``</div>``).
162
163 The *tag* argument is the name of the tag converted to lower case.
164
165
Georg Brandl116aa622007-08-15 14:28:22 +0000166.. method:: HTMLParser.handle_startendtag(tag, attrs)
167
168 Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melottif99e4b52011-10-28 14:34:56 +0300169 XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl116aa622007-08-15 14:28:22 +0000170 subclasses which require this particular lexical information; the default
Ezio Melottif99e4b52011-10-28 14:34:56 +0300171 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000172
173
Georg Brandl116aa622007-08-15 14:28:22 +0000174.. method:: HTMLParser.handle_data(data)
175
Ezio Melotti4279bc72012-02-18 02:01:36 +0200176 This method is called to process arbitrary data (e.g. text nodes and the
177 content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl116aa622007-08-15 14:28:22 +0000178
179
180.. method:: HTMLParser.handle_entityref(name)
181
Ezio Melotti4279bc72012-02-18 02:01:36 +0200182 This method is called to process a named character reference of the form
183 ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
184 (e.g. ``'gt'``).
185
186
187.. method:: HTMLParser.handle_charref(name)
188
189 This method is called to process decimal and hexadecimal numeric character
190 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
191 equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
192 in this case the method will receive ``'62'`` or ``'x3E'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000193
194
195.. method:: HTMLParser.handle_comment(data)
196
Ezio Melotti4279bc72012-02-18 02:01:36 +0200197 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
198
199 For example, the comment ``<!-- comment -->`` will cause this method to be
200 called with the argument ``' comment '``.
201
202 The content of Internet Explorer conditional comments (condcoms) will also be
203 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
204 this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000205
206
207.. method:: HTMLParser.handle_decl(decl)
208
Ezio Melotti4279bc72012-02-18 02:01:36 +0200209 This method is called to handle an HTML doctype declaration (e.g.
210 ``<!DOCTYPE html>``).
211
Georg Brandl46aa5c52010-07-29 13:38:37 +0000212 The *decl* parameter will be the entire contents of the declaration inside
Ezio Melotti4279bc72012-02-18 02:01:36 +0200213 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl116aa622007-08-15 14:28:22 +0000214
215
216.. method:: HTMLParser.handle_pi(data)
217
218 Method called when a processing instruction is encountered. The *data*
219 parameter will contain the entire processing instruction. For example, for the
220 processing instruction ``<?proc color='red'>``, this method would be called as
221 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
222 class; the base class implementation does nothing.
223
224 .. note::
225
226 The :class:`HTMLParser` class uses the SGML syntactic rules for processing
227 instructions. An XHTML processing instruction using the trailing ``'?'`` will
228 cause the ``'?'`` to be included in *data*.
229
230
Ezio Melotti4279bc72012-02-18 02:01:36 +0200231.. method:: HTMLParser.unknown_decl(data)
Georg Brandl116aa622007-08-15 14:28:22 +0000232
Ezio Melotti4279bc72012-02-18 02:01:36 +0200233 This method is called when an unrecognized declaration is read by the parser.
Georg Brandl116aa622007-08-15 14:28:22 +0000234
Ezio Melotti4279bc72012-02-18 02:01:36 +0200235 The *data* parameter will be the entire contents of the declaration inside
236 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
237 derived class. The base class implementation raises an :exc:`HTMLParseError`
238 when *strict* is ``True``.
239
240
241.. _htmlparser-examples:
242
243Examples
244--------
245
246The following class implements a parser that will be used to illustrate more
247examples::
Georg Brandl116aa622007-08-15 14:28:22 +0000248
Ezio Melottif99e4b52011-10-28 14:34:56 +0300249 from html.parser import HTMLParser
Ezio Melotti4279bc72012-02-18 02:01:36 +0200250 from html.entities import name2codepoint
Ezio Melottif99e4b52011-10-28 14:34:56 +0300251
252 class MyHTMLParser(HTMLParser):
253 def handle_starttag(self, tag, attrs):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200254 print("Start tag:", tag)
255 for attr in attrs:
256 print(" attr:", attr)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300257 def handle_endtag(self, tag):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200258 print("End tag :", tag)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300259 def handle_data(self, data):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200260 print("Data :", data)
261 def handle_comment(self, data):
262 print("Comment :", data)
263 def handle_entityref(self, name):
264 c = chr(name2codepoint[name])
265 print("Named ent:", c)
266 def handle_charref(self, name):
267 if name.startswith('x'):
268 c = chr(int(name[1:], 16))
269 else:
270 c = chr(int(name))
271 print("Num ent :", c)
272 def handle_decl(self, data):
273 print("Decl :", data)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300274
Ezio Melotti4279bc72012-02-18 02:01:36 +0200275 parser = MyHTMLParser(strict=False)
Georg Brandl116aa622007-08-15 14:28:22 +0000276
Ezio Melotti4279bc72012-02-18 02:01:36 +0200277Parsing a doctype::
278
279 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
280 ... '"http://www.w3.org/TR/html4/strict.dtd">')
281 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
282
283Parsing an element with a few attributes and a title::
284
285 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
286 Start tag: img
287 attr: ('src', 'python-logo.png')
288 attr: ('alt', 'The Python logo')
289 >>>
290 >>> parser.feed('<h1>Python</h1>')
291 Start tag: h1
292 Data : Python
293 End tag : h1
294
295The content of ``script`` and ``style`` elements is returned as is, without
296further parsing::
297
298 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
299 Start tag: style
300 attr: ('type', 'text/css')
301 Data : #python { color: green }
302 End tag : style
303 >>>
304 >>> parser.feed('<script type="text/javascript">'
305 ... 'alert("<strong>hello!</strong>");</script>')
306 Start tag: script
307 attr: ('type', 'text/javascript')
308 Data : alert("<strong>hello!</strong>");
309 End tag : script
310
311Parsing comments::
312
313 >>> parser.feed('<!-- a comment -->'
314 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
315 Comment : a comment
316 Comment : [if IE 9]>IE-specific content<![endif]
317
318Parsing named and numeric character references and converting them to the
319correct char (note: these 3 references are all equivalent to ``'>'``)::
320
321 >>> parser.feed('&gt;&#62;&#x3E;')
322 Named ent: >
323 Num ent : >
324 Num ent : >
325
326Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
327:meth:`~HTMLParser.handle_data` might be called more than once::
328
329 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
330 ... parser.feed(chunk)
331 ...
332 Start tag: span
333 Data : buff
334 Data : ered
335 Data : text
336 End tag : span
337
338Parsing invalid HTML (e.g. unquoted attributes) also works::
339
340 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
341 Start tag: p
342 Start tag: a
343 attr: ('class', 'link')
344 attr: ('href', '#main')
345 Data : tag soup
346 End tag : p
347 End tag : a
Georg Brandl116aa622007-08-15 14:28:22 +0000348
R. David Murrayb579dba2010-12-03 04:06:39 +0000349.. rubric:: Footnotes
350
R. David Murraybb7b7532010-12-03 04:26:18 +0000351.. [#] For backward compatibility reasons *strict* mode does not raise
352 exceptions for all non-compliant HTML. That is, some invalid HTML
R. David Murrayb579dba2010-12-03 04:06:39 +0000353 is tolerated even in *strict* mode.