blob: 67ae139eb06f6a39cbdd17f989023c69ba78ed7c [file] [log] [blame]
Fred Drake3c50ea42008-05-17 22:02:32 +00001:mod:`html.parser` --- Simple HTML and XHTML parser
2===================================================
Georg Brandl116aa622007-08-15 14:28:22 +00003
Fred Drake3c50ea42008-05-17 22:02:32 +00004.. module:: html.parser
Georg Brandl116aa622007-08-15 14:28:22 +00005 :synopsis: A simple parser that can handle HTML and XHTML.
6
7
Georg Brandl9087b7f2008-05-18 07:53:01 +00008.. index::
9 single: HTML
10 single: XHTML
Georg Brandl116aa622007-08-15 14:28:22 +000011
Raymond Hettingera1993682011-01-27 01:20:32 +000012**Source code:** :source:`Lib/html/parser.py`
13
14--------------
15
Georg Brandl116aa622007-08-15 14:28:22 +000016This module defines a class :class:`HTMLParser` which serves as the basis for
17parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl116aa622007-08-15 14:28:22 +000018
Ezio Melotti73a43592014-08-02 14:10:30 +030019.. class:: HTMLParser(*, convert_charrefs=False)
Georg Brandl116aa622007-08-15 14:28:22 +000020
Ezio Melotti73a43592014-08-02 14:10:30 +030021 Create a parser instance able to parse invalid markup.
Ezio Melotti95401c52013-11-23 19:52:05 +020022
23 If *convert_charrefs* is ``True`` (default: ``False``), all character
24 references (except the ones in ``script``/``style`` elements) are
25 automatically converted to the corresponding Unicode characters.
26 The use of ``convert_charrefs=True`` is encouraged and will become
27 the default in Python 3.5.
28
Ezio Melotti4279bc72012-02-18 02:01:36 +020029 An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
30 when start tags, end tags, text, comments, and other markup elements are
31 encountered. The user should subclass :class:`.HTMLParser` and override its
32 methods to implement the desired behavior.
Georg Brandl116aa622007-08-15 14:28:22 +000033
Georg Brandl877b10a2008-06-01 21:25:55 +000034 This parser does not check that end tags match start tags or call the end-tag
35 handler for elements which are closed implicitly by closing an outer element.
Georg Brandl116aa622007-08-15 14:28:22 +000036
Ezio Melotti95401c52013-11-23 19:52:05 +020037 .. versionchanged:: 3.4
38 *convert_charrefs* keyword argument added.
39
Ezio Melotti4279bc72012-02-18 02:01:36 +020040
41Example HTML Parser Application
42-------------------------------
43
44As a basic example, below is a simple HTML parser that uses the
45:class:`HTMLParser` class to print out start tags, end tags, and data
46as they are encountered::
47
48 from html.parser import HTMLParser
49
50 class MyHTMLParser(HTMLParser):
51 def handle_starttag(self, tag, attrs):
52 print("Encountered a start tag:", tag)
53 def handle_endtag(self, tag):
54 print("Encountered an end tag :", tag)
55 def handle_data(self, data):
56 print("Encountered some data :", data)
57
Ezio Melotti88ebfb12013-11-02 17:08:24 +020058 parser = MyHTMLParser()
Ezio Melotti4279bc72012-02-18 02:01:36 +020059 parser.feed('<html><head><title>Test</title></head>'
60 '<body><h1>Parse me!</h1></body></html>')
61
62The output will then be::
63
64 Encountered a start tag: html
65 Encountered a start tag: head
66 Encountered a start tag: title
67 Encountered some data : Test
68 Encountered an end tag : title
69 Encountered an end tag : head
70 Encountered a start tag: body
71 Encountered a start tag: h1
72 Encountered some data : Parse me!
73 Encountered an end tag : h1
74 Encountered an end tag : body
75 Encountered an end tag : html
76
77
78:class:`.HTMLParser` Methods
79----------------------------
Georg Brandl116aa622007-08-15 14:28:22 +000080
81:class:`HTMLParser` instances have the following methods:
82
83
Georg Brandl116aa622007-08-15 14:28:22 +000084.. method:: HTMLParser.feed(data)
85
86 Feed some text to the parser. It is processed insofar as it consists of
87 complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti4279bc72012-02-18 02:01:36 +020088 :meth:`close` is called. *data* must be :class:`str`.
Georg Brandl116aa622007-08-15 14:28:22 +000089
90
91.. method:: HTMLParser.close()
92
93 Force processing of all buffered data as if it were followed by an end-of-file
94 mark. This method may be redefined by a derived class to define additional
95 processing at the end of the input, but the redefined version should always call
96 the :class:`HTMLParser` base class method :meth:`close`.
97
98
Ezio Melotti4279bc72012-02-18 02:01:36 +020099.. method:: HTMLParser.reset()
100
101 Reset the instance. Loses all unprocessed data. This is called implicitly at
102 instantiation time.
103
104
Georg Brandl116aa622007-08-15 14:28:22 +0000105.. method:: HTMLParser.getpos()
106
107 Return current line number and offset.
108
109
110.. method:: HTMLParser.get_starttag_text()
111
112 Return the text of the most recently opened start tag. This should not normally
113 be needed for structured processing, but may be useful in dealing with HTML "as
114 deployed" or for re-generating input with minimal changes (whitespace between
115 attributes can be preserved, etc.).
116
117
Ezio Melotti4279bc72012-02-18 02:01:36 +0200118The following methods are called when data or markup elements are encountered
119and they are meant to be overridden in a subclass. The base class
120implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
121
122
Georg Brandl116aa622007-08-15 14:28:22 +0000123.. method:: HTMLParser.handle_starttag(tag, attrs)
124
Ezio Melotti4279bc72012-02-18 02:01:36 +0200125 This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl116aa622007-08-15 14:28:22 +0000126
127 The *tag* argument is the name of the tag converted to lower case. The *attrs*
128 argument is a list of ``(name, value)`` pairs containing the attributes found
129 inside the tag's ``<>`` brackets. The *name* will be translated to lower case,
130 and quotes in the *value* have been removed, and character and entity references
Ezio Melotti4279bc72012-02-18 02:01:36 +0200131 have been replaced.
132
133 For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
134 would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
Georg Brandl116aa622007-08-15 14:28:22 +0000135
Georg Brandl9087b7f2008-05-18 07:53:01 +0000136 All entity references from :mod:`html.entities` are replaced in the attribute
137 values.
Georg Brandl116aa622007-08-15 14:28:22 +0000138
139
Ezio Melotti4279bc72012-02-18 02:01:36 +0200140.. method:: HTMLParser.handle_endtag(tag)
141
142 This method is called to handle the end tag of an element (e.g. ``</div>``).
143
144 The *tag* argument is the name of the tag converted to lower case.
145
146
Georg Brandl116aa622007-08-15 14:28:22 +0000147.. method:: HTMLParser.handle_startendtag(tag, attrs)
148
149 Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melottif99e4b52011-10-28 14:34:56 +0300150 XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl116aa622007-08-15 14:28:22 +0000151 subclasses which require this particular lexical information; the default
Ezio Melottif99e4b52011-10-28 14:34:56 +0300152 implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl116aa622007-08-15 14:28:22 +0000153
154
Georg Brandl116aa622007-08-15 14:28:22 +0000155.. method:: HTMLParser.handle_data(data)
156
Ezio Melotti4279bc72012-02-18 02:01:36 +0200157 This method is called to process arbitrary data (e.g. text nodes and the
158 content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl116aa622007-08-15 14:28:22 +0000159
160
161.. method:: HTMLParser.handle_entityref(name)
162
Ezio Melotti4279bc72012-02-18 02:01:36 +0200163 This method is called to process a named character reference of the form
164 ``&name;`` (e.g. ``&gt;``), where *name* is a general entity reference
Ezio Melotti95401c52013-11-23 19:52:05 +0200165 (e.g. ``'gt'``). This method is never called if *convert_charrefs* is
166 ``True``.
Ezio Melotti4279bc72012-02-18 02:01:36 +0200167
168
169.. method:: HTMLParser.handle_charref(name)
170
171 This method is called to process decimal and hexadecimal numeric character
172 references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
173 equivalent for ``&gt;`` is ``&#62;``, whereas the hexadecimal is ``&#x3E;``;
Ezio Melotti95401c52013-11-23 19:52:05 +0200174 in this case the method will receive ``'62'`` or ``'x3E'``. This method
175 is never called if *convert_charrefs* is ``True``.
Georg Brandl116aa622007-08-15 14:28:22 +0000176
177
178.. method:: HTMLParser.handle_comment(data)
179
Ezio Melotti4279bc72012-02-18 02:01:36 +0200180 This method is called when a comment is encountered (e.g. ``<!--comment-->``).
181
182 For example, the comment ``<!-- comment -->`` will cause this method to be
183 called with the argument ``' comment '``.
184
185 The content of Internet Explorer conditional comments (condcoms) will also be
186 sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
187 this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
Georg Brandl116aa622007-08-15 14:28:22 +0000188
189
190.. method:: HTMLParser.handle_decl(decl)
191
Ezio Melotti4279bc72012-02-18 02:01:36 +0200192 This method is called to handle an HTML doctype declaration (e.g.
193 ``<!DOCTYPE html>``).
194
Georg Brandl46aa5c52010-07-29 13:38:37 +0000195 The *decl* parameter will be the entire contents of the declaration inside
Ezio Melotti4279bc72012-02-18 02:01:36 +0200196 the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl116aa622007-08-15 14:28:22 +0000197
198
199.. method:: HTMLParser.handle_pi(data)
200
201 Method called when a processing instruction is encountered. The *data*
202 parameter will contain the entire processing instruction. For example, for the
203 processing instruction ``<?proc color='red'>``, this method would be called as
204 ``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
205 class; the base class implementation does nothing.
206
207 .. note::
208
209 The :class:`HTMLParser` class uses the SGML syntactic rules for processing
210 instructions. An XHTML processing instruction using the trailing ``'?'`` will
211 cause the ``'?'`` to be included in *data*.
212
213
Ezio Melotti4279bc72012-02-18 02:01:36 +0200214.. method:: HTMLParser.unknown_decl(data)
Georg Brandl116aa622007-08-15 14:28:22 +0000215
Ezio Melotti4279bc72012-02-18 02:01:36 +0200216 This method is called when an unrecognized declaration is read by the parser.
Georg Brandl116aa622007-08-15 14:28:22 +0000217
Ezio Melotti4279bc72012-02-18 02:01:36 +0200218 The *data* parameter will be the entire contents of the declaration inside
219 the ``<![...]>`` markup. It is sometimes useful to be overridden by a
Ezio Melotti73a43592014-08-02 14:10:30 +0300220 derived class. The base class implementation does nothing.
Ezio Melotti4279bc72012-02-18 02:01:36 +0200221
222
223.. _htmlparser-examples:
224
225Examples
226--------
227
228The following class implements a parser that will be used to illustrate more
229examples::
Georg Brandl116aa622007-08-15 14:28:22 +0000230
Ezio Melottif99e4b52011-10-28 14:34:56 +0300231 from html.parser import HTMLParser
Ezio Melotti4279bc72012-02-18 02:01:36 +0200232 from html.entities import name2codepoint
Ezio Melottif99e4b52011-10-28 14:34:56 +0300233
234 class MyHTMLParser(HTMLParser):
235 def handle_starttag(self, tag, attrs):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200236 print("Start tag:", tag)
237 for attr in attrs:
238 print(" attr:", attr)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300239 def handle_endtag(self, tag):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200240 print("End tag :", tag)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300241 def handle_data(self, data):
Ezio Melotti4279bc72012-02-18 02:01:36 +0200242 print("Data :", data)
243 def handle_comment(self, data):
244 print("Comment :", data)
245 def handle_entityref(self, name):
246 c = chr(name2codepoint[name])
247 print("Named ent:", c)
248 def handle_charref(self, name):
249 if name.startswith('x'):
250 c = chr(int(name[1:], 16))
251 else:
252 c = chr(int(name))
253 print("Num ent :", c)
254 def handle_decl(self, data):
255 print("Decl :", data)
Ezio Melottif99e4b52011-10-28 14:34:56 +0300256
Ezio Melotti88ebfb12013-11-02 17:08:24 +0200257 parser = MyHTMLParser()
Georg Brandl116aa622007-08-15 14:28:22 +0000258
Ezio Melotti4279bc72012-02-18 02:01:36 +0200259Parsing a doctype::
260
261 >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
262 ... '"http://www.w3.org/TR/html4/strict.dtd">')
263 Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
264
265Parsing an element with a few attributes and a title::
266
267 >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
268 Start tag: img
269 attr: ('src', 'python-logo.png')
270 attr: ('alt', 'The Python logo')
271 >>>
272 >>> parser.feed('<h1>Python</h1>')
273 Start tag: h1
274 Data : Python
275 End tag : h1
276
277The content of ``script`` and ``style`` elements is returned as is, without
278further parsing::
279
280 >>> parser.feed('<style type="text/css">#python { color: green }</style>')
281 Start tag: style
282 attr: ('type', 'text/css')
283 Data : #python { color: green }
284 End tag : style
285 >>>
286 >>> parser.feed('<script type="text/javascript">'
287 ... 'alert("<strong>hello!</strong>");</script>')
288 Start tag: script
289 attr: ('type', 'text/javascript')
290 Data : alert("<strong>hello!</strong>");
291 End tag : script
292
293Parsing comments::
294
295 >>> parser.feed('<!-- a comment -->'
296 ... '<!--[if IE 9]>IE-specific content<![endif]-->')
297 Comment : a comment
298 Comment : [if IE 9]>IE-specific content<![endif]
299
300Parsing named and numeric character references and converting them to the
301correct char (note: these 3 references are all equivalent to ``'>'``)::
302
303 >>> parser.feed('&gt;&#62;&#x3E;')
304 Named ent: >
305 Num ent : >
306 Num ent : >
307
308Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
Ezio Melotti95401c52013-11-23 19:52:05 +0200309:meth:`~HTMLParser.handle_data` might be called more than once
310(unless *convert_charrefs* is set to ``True``)::
Ezio Melotti4279bc72012-02-18 02:01:36 +0200311
312 >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
313 ... parser.feed(chunk)
314 ...
315 Start tag: span
316 Data : buff
317 Data : ered
318 Data : text
319 End tag : span
320
321Parsing invalid HTML (e.g. unquoted attributes) also works::
322
323 >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
324 Start tag: p
325 Start tag: a
326 attr: ('class', 'link')
327 attr: ('href', '#main')
328 Data : tag soup
329 End tag : p
330 End tag : a