blob: f3c36ec886719b8e2352f13df349e385e4148d47 [file] [log] [blame]
Fred Drake3c50ea42008-05-17 22:02:32 +00001:mod:`html.parser` --- Simple HTML and XHTML parser
2===================================================
Georg Brandl116aa622007-08-15 14:28:22 +00003
Fred Drake3c50ea42008-05-17 22:02:32 +00004.. module:: html.parser
Georg Brandl116aa622007-08-15 14:28:22 +00005 :synopsis: A simple parser that can handle HTML and XHTML.
6
7
Georg Brandl9087b7f2008-05-18 07:53:01 +00008.. index::
9 single: HTML
10 single: XHTML
Georg Brandl116aa622007-08-15 14:28:22 +000011
Raymond Hettingera1993682011-01-27 01:20:32 +000012**Source code:** :source:`Lib/html/parser.py`
13
14--------------
15
Georg Brandl116aa622007-08-15 14:28:22 +000016This module defines a class :class:`HTMLParser` which serves as the basis for
17parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl116aa622007-08-15 14:28:22 +000018
R. David Murrayb579dba2010-12-03 04:06:39 +000019.. class:: HTMLParser(strict=True)
Georg Brandl116aa622007-08-15 14:28:22 +000020
R. David Murrayb579dba2010-12-03 04:06:39 +000021 Create a parser instance. If *strict* is ``True`` (the default), invalid
Ezio Melotti4279bc72012-02-18 02:01:36 +020022 HTML results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If
R. David Murrayb579dba2010-12-03 04:06:39 +000023 *strict* is ``False``, the parser uses heuristics to make a best guess at
Ezio Melotti4279bc72012-02-18 02:01:36 +020024 the intention of any invalid HTML it encounters, similar to the way most
25 browsers do. Using ``strict=False`` is advised.
Georg Brandl116aa622007-08-15 14:28:22 +000026
Ezio Melotti4279bc72012-02-18 02:01:36 +020027 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
28 when start tags, end tags, text, comments, and other markup elements are
29 encountered. The user should subclass :class:`.HTMLParser` and override its
30 methods to implement the desired behavior.
Georg Brandl116aa622007-08-15 14:28:22 +000031
Georg Brandl877b10a2008-06-01 21:25:55 +000032 This parser does not check that end tags match start tags or call the end-tag
33 handler for elements which are closed implicitly by closing an outer element.
Georg Brandl116aa622007-08-15 14:28:22 +000034
R. David Murraybb7b7532010-12-03 04:26:18 +000035 .. versionchanged:: 3.2 *strict* keyword added
36
Georg Brandl116aa622007-08-15 14:28:22 +000037An exception is defined as well:
38
39
40.. exception:: HTMLParseError
41
42 Exception raised by the :class:`HTMLParser` class when it encounters an error
Ezio Melotti4279bc72012-02-18 02:01:36 +020043 while parsing and *strict* is ``True``. This exception provides three
44 attributes: :attr:`msg` is a brief message explaining the error,
45 :attr:`lineno` is the number of the line on which the broken construct was
46 detected, and :attr:`offset` is the number of characters into the line at
47 which the construct starts.
48
49
50Example HTML Parser Application
51-------------------------------
52
53As a basic example, below is a simple HTML parser that uses the
54:class:`HTMLParser` class to print out start tags, end tags, and data
55as they are encountered::
56
57 from html.parser import HTMLParser
58
59 class MyHTMLParser(HTMLParser):
60 def handle_starttag(self, tag, attrs):
61 print("Encountered a start tag:", tag)
62 def handle_endtag(self, tag):
63 print("Encountered an end tag :", tag)
64 def handle_data(self, data):
65 print("Encountered some data :", data)
66
67 parser = MyHTMLParser(strict=False)
68 parser.feed('<html><head><title>Test</title></head>'
69 '<body><h1>Parse me!</h1></body></html>')
70
71The output will then be::
72
73 Encountered a start tag: html
74 Encountered a start tag: head
75 Encountered a start tag: title
76 Encountered some data : Test
77 Encountered an end tag : title
78 Encountered an end tag : head
79 Encountered a start tag: body
80 Encountered a start tag: h1
81 Encountered some data : Parse me!
82 Encountered an end tag : h1
83 Encountered an end tag : body
84 Encountered an end tag : html
85
86
87:class:`.HTMLParser` Methods
88----------------------------
Georg Brandl116aa622007-08-15 14:28:22 +000089
90:class:`HTMLParser` instances have the following methods:
91
92
Georg Brandl116aa622007-08-15 14:28:22 +000093.. method:: HTMLParser.feed(data)
94
95 Feed some text to the parser. It is processed insofar as it consists of
96 complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti4279bc72012-02-18 02:01:36 +020097 :meth:`close` is called. *data* must be :class:`str`.
Georg Brandl116aa622007-08-15 14:28:22 +000098
99
100.. method:: HTMLParser.close()
101
102 Force processing of all buffered data as if it were followed by an end-of-file
103 mark. This method may be redefined by a derived class to define additional
104 processing at the end of the input, but the redefined version should always call
105 the :class:`HTMLParser` base class method :meth:`close`.
106
107
Ezio Melotti4279bc72012-02-18 02:01:36 +0200108.. method:: HTMLParser.reset()
109
110 Reset the instance. Loses all unprocessed data. This is called implicitly at
111 instantiation time.
112
113
Georg Brandl116aa622007-08-15 14:28:22 +0000114.. method:: HTMLParser.getpos()
115
116 Return current line number and offset.
117
118
119.. method:: HTMLParser.get_starttag_text()
120
121 Return the text of the most recently opened start tag. This should not normally
122 be needed for structured processing, but may be useful in dealing with HTML "as
123 deployed" or for re-generating input with minimal changes (whitespace between
124 attributes can be preserved, etc.).
125
126
Ezio Melotti4279bc72012-02-18 02:01:36 +0200127The following methods are called when data or markup elements are encountered
128and they are meant to be overridden in a subclass. The base class
129implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
130
131
Georg Brandl116aa622007-08-15 14:28:22 +0000132.. method:: HTMLParser.handle_starttag(tag, attrs)
133
Ezio Melotti4279bc72012-02-18 02:01:36 +0200134 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl116aa622007-08-15 14:28:22 +0000135
136 The *tag* argument is the name of the tag converted to lower case. The *attrs*
137 argument is a list of ``(name, value)`` pairs containing the attributes found
138 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
139 and quotes in the *value* have been removed, and character and entity references
Ezio Melotti4279bc72012-02-18 02:01:36 +0200140 have been replaced.
141
142 For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
143 would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
Georg Brandl116aa622007-08-15 14:28:22 +0000144
Georg Brandl9087b7f2008-05-18 07:53:01 +0000145 All entity references from :mod:`html.entities` are replaced in the attribute
146 values.
Georg Brandl116aa622007-08-15 14:28:22 +0000147
148
Ezio Melotti4279bc72012-02-18 02:01:36 +0200149.. method:: HTMLParser.handle_endtag(tag)
150
151 This method is called to handle the end tag of an element (e.g. ``</div>``).
152
153 The *tag* argument is the name of the tag converted to lower case.
154
155
Georg Brandl116aa622007-08-15 14:28:22 +0000156.. method:: HTMLParser.handle_startendtag(tag, attrs)
157
158 Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melottif99e4b52011-10-28 14:34:56 +0300159 XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl116aa622007-08-15 14:28:22 +0000160 subclasses which require this particular lexical information; the default
Ezio Melottif99e4b52011-10-28 14:34:56 +0300161 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000162
163
Georg Brandl116aa622007-08-15 14:28:22 +0000164.. method:: HTMLParser.handle_data(data)
165
Ezio Melotti4279bc72012-02-18 02:01:36 +0200166 This method is called to process arbitrary data (e.g. text nodes and the
167 content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl116aa622007-08-15 14:28:22 +0000168
169
170.. method:: HTMLParser.handle_entityref(name)
171
Ezio Melotti4279bc72012-02-18 02:01:36 +0200172 This method is called to process a named character reference of the form
173 ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
174 (e.g. ``'gt'``).
175
176
177.. method:: HTMLParser.handle_charref(name)
178
179 This method is called to process decimal and hexadecimal numeric character
180 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
181 equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
182 in this case the method will receive ``'62'`` or ``'x3E'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000183
184
185.. method:: HTMLParser.handle_comment(data)
186
Ezio Melotti4279bc72012-02-18 02:01:36 +0200187 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
188
189 For example, the comment ``<!-- comment -->`` will cause this method to be
190 called with the argument ``' comment '``.
191
192 The content of Internet Explorer conditional comments (condcoms) will also be
193 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
194 this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000195
196
197.. method:: HTMLParser.handle_decl(decl)
198
Ezio Melotti4279bc72012-02-18 02:01:36 +0200199 This method is called to handle an HTML doctype declaration (e.g.
200 ``<!DOCTYPE html>``).
201
Georg Brandl46aa5c52010-07-29 13:38:37 +0000202 The *decl* parameter will be the entire contents of the declaration inside
Ezio Melotti4279bc72012-02-18 02:01:36 +0200203 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl116aa622007-08-15 14:28:22 +0000204
205
206.. method:: HTMLParser.handle_pi(data)
207
208 Method called when a processing instruction is encountered. The *data*
209 parameter will contain the entire processing instruction. For example, for the
210 processing instruction ``<?proc color='red'>``, this method would be called as
211 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
212 class; the base class implementation does nothing.
213
214 .. note::
215
216 The :class:`HTMLParser` class uses the SGML syntactic rules for processing
217 instructions. An XHTML processing instruction using the trailing ``'?'`` will
218 cause the ``'?'`` to be included in *data*.
219
220
Ezio Melotti4279bc72012-02-18 02:01:36 +0200221.. method:: HTMLParser.unknown_decl(data)
Georg Brandl116aa622007-08-15 14:28:22 +0000222
Ezio Melotti4279bc72012-02-18 02:01:36 +0200223 This method is called when an unrecognized declaration is read by the parser.
Georg Brandl116aa622007-08-15 14:28:22 +0000224
Ezio Melotti4279bc72012-02-18 02:01:36 +0200225 The *data* parameter will be the entire contents of the declaration inside
226 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
227 derived class. The base class implementation raises an :exc:`HTMLParseError`
228 when *strict* is ``True``.
229
230
231.. _htmlparser-examples:
232
233Examples
234--------
235
236The following class implements a parser that will be used to illustrate more
237examples::
Georg Brandl116aa622007-08-15 14:28:22 +0000238
Ezio Melottif99e4b52011-10-28 14:34:56 +0300239 from html.parser import HTMLParser
Ezio Melotti4279bc72012-02-18 02:01:36 +0200240 from html.entities import name2codepoint
Ezio Melottif99e4b52011-10-28 14:34:56 +0300241
242 class MyHTMLParser(HTMLParser):
243 def handle_starttag(self, tag, attrs):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200244 print("Start tag:", tag)
245 for attr in attrs:
246 print(" attr:", attr)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300247 def handle_endtag(self, tag):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200248 print("End tag :", tag)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300249 def handle_data(self, data):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200250 print("Data :", data)
251 def handle_comment(self, data):
252 print("Comment :", data)
253 def handle_entityref(self, name):
254 c = chr(name2codepoint[name])
255 print("Named ent:", c)
256 def handle_charref(self, name):
257 if name.startswith('x'):
258 c = chr(int(name[1:], 16))
259 else:
260 c = chr(int(name))
261 print("Num ent :", c)
262 def handle_decl(self, data):
263 print("Decl :", data)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300264
Ezio Melotti4279bc72012-02-18 02:01:36 +0200265 parser = MyHTMLParser(strict=False)
Georg Brandl116aa622007-08-15 14:28:22 +0000266
Ezio Melotti4279bc72012-02-18 02:01:36 +0200267Parsing a doctype::
268
269 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
270 ... '"http://www.w3.org/TR/html4/strict.dtd">')
271 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
272
273Parsing an element with a few attributes and a title::
274
275 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
276 Start tag: img
277 attr: ('src', 'python-logo.png')
278 attr: ('alt', 'The Python logo')
279 >>>
280 >>> parser.feed('<h1>Python</h1>')
281 Start tag: h1
282 Data : Python
283 End tag : h1
284
285The content of ``script`` and ``style`` elements is returned as is, without
286further parsing::
287
288 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
289 Start tag: style
290 attr: ('type', 'text/css')
291 Data : #python { color: green }
292 End tag : style
293 >>>
294 >>> parser.feed('<script type="text/javascript">'
295 ... 'alert("<strong>hello!</strong>");</script>')
296 Start tag: script
297 attr: ('type', 'text/javascript')
298 Data : alert("<strong>hello!</strong>");
299 End tag : script
300
301Parsing comments::
302
303 >>> parser.feed('<!-- a comment -->'
304 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
305 Comment : a comment
306 Comment : [if IE 9]>IE-specific content<![endif]
307
308Parsing named and numeric character references and converting them to the
309correct char (note: these 3 references are all equivalent to ``'>'``)::
310
311 >>> parser.feed('&gt;&#62;&#x3E;')
312 Named ent: >
313 Num ent : >
314 Num ent : >
315
316Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
317:meth:`~HTMLParser.handle_data` might be called more than once::
318
319 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
320 ... parser.feed(chunk)
321 ...
322 Start tag: span
323 Data : buff
324 Data : ered
325 Data : text
326 End tag : span
327
328Parsing invalid HTML (e.g. unquoted attributes) also works::
329
330 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
331 Start tag: p
332 Start tag: a
333 attr: ('class', 'link')
334 attr: ('href', '#main')
335 Data : tag soup
336 End tag : p
337 End tag : a
Georg Brandl116aa622007-08-15 14:28:22 +0000338
R. David Murrayb579dba2010-12-03 04:06:39 +0000339.. rubric:: Footnotes
340
R. David Murraybb7b7532010-12-03 04:26:18 +0000341.. [#] For backward compatibility reasons *strict* mode does not raise
342 exceptions for all non-compliant HTML. That is, some invalid HTML
R. David Murrayb579dba2010-12-03 04:06:39 +0000343 is tolerated even in *strict* mode.