Blame - Doc/library/html.parser.rst - platform/external/python/cpython3

blob: 4715185fcc7666c59399fb321ad774a8e4584626 [file] [log] [blame]

Fred Drake	3c50ea4	2008-05-17 22:02:32 +0000	[diff] [blame]	1	:mod:`html.parser` --- Simple HTML and XHTML parser
				2	===================================================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3
Fred Drake	3c50ea4	2008-05-17 22:02:32 +0000	[diff] [blame]	4	.. module:: html.parser
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	5	:synopsis: A simple parser that can handle HTML and XHTML.
				6
				7
Georg Brandl	9087b7f	2008-05-18 07:53:01 +0000	[diff] [blame]	8	.. index::
				9	single: HTML
				10	single: XHTML
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	11
Raymond Hettinger	a199368	2011-01-27 01:20:32 +0000	[diff] [blame]	12	Source code: :source:`Lib/html/parser.py`
				13
				14	--------------
				15
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	16	This module defines a class :class:`HTMLParser` which serves as the basis for
				17	parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
Ezio Melotti	3861d8b	2012-06-23 15:27:51 +0200	[diff] [blame]	19	.. class:: HTMLParser(strict=False)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	20
Ezio Melotti	3861d8b	2012-06-23 15:27:51 +0200	[diff] [blame]	21	Create a parser instance. If strict is ``False`` (the default), the parser
				22	will accept and parse invalid markup. If strict is ``True`` the parser
				23	will raise an :exc:`~html.parser.HTMLParseError` exception instead [#]_ when
				24	it's not able to parse the markup.
				25	The use of ``strict=True`` is discouraged and the strict argument is
				26	deprecated.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	27
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	28	An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
				29	when start tags, end tags, text, comments, and other markup elements are
				30	encountered. The user should subclass :class:`.HTMLParser` and override its
				31	methods to implement the desired behavior.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	32
Georg Brandl	877b10a	2008-06-01 21:25:55 +0000	[diff] [blame]	33	This parser does not check that end tags match start tags or call the end-tag
				34	handler for elements which are closed implicitly by closing an outer element.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	35
R. David Murray	bb7b753	2010-12-03 04:26:18 +0000	[diff] [blame]	36	.. versionchanged:: 3.2 strict keyword added
				37
Ezio Melotti	3861d8b	2012-06-23 15:27:51 +0200	[diff] [blame]	38	.. deprecated-removed:: 3.3 3.5
				39	The strict argument and the strict mode have been deprecated.
				40	The parser is now able to accept and parse invalid markup too.
				41
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	42	An exception is defined as well:
				43
				44
				45	.. exception:: HTMLParseError
				46
				47	Exception raised by the :class:`HTMLParser` class when it encounters an error
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	48	while parsing and strict is ``True``. This exception provides three
				49	attributes: :attr:`msg` is a brief message explaining the error,
				50	:attr:`lineno` is the number of the line on which the broken construct was
				51	detected, and :attr:`offset` is the number of characters into the line at
				52	which the construct starts.
				53
Ezio Melotti	3861d8b	2012-06-23 15:27:51 +0200	[diff] [blame]	54	.. deprecated-removed:: 3.3 3.5
				55	This exception has been deprecated because it's never raised by the parser
				56	(when the default non-strict mode is used).
				57
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	58
				59	Example HTML Parser Application
				60	-------------------------------
				61
				62	As a basic example, below is a simple HTML parser that uses the
				63	:class:`HTMLParser` class to print out start tags, end tags, and data
				64	as they are encountered::
				65
				66	from html.parser import HTMLParser
				67
				68	class MyHTMLParser(HTMLParser):
				69	def handle_starttag(self, tag, attrs):
				70	print("Encountered a start tag:", tag)
				71	def handle_endtag(self, tag):
				72	print("Encountered an end tag :", tag)
				73	def handle_data(self, data):
				74	print("Encountered some data :", data)
				75
				76	parser = MyHTMLParser(strict=False)
				77	parser.feed('<html><head><title>Test</title></head>'
				78	'<body><h1>Parse me!</h1></body></html>')
				79
				80	The output will then be::
				81
				82	Encountered a start tag: html
				83	Encountered a start tag: head
				84	Encountered a start tag: title
				85	Encountered some data : Test
				86	Encountered an end tag : title
				87	Encountered an end tag : head
				88	Encountered a start tag: body
				89	Encountered a start tag: h1
				90	Encountered some data : Parse me!
				91	Encountered an end tag : h1
				92	Encountered an end tag : body
				93	Encountered an end tag : html
				94
				95
				96	:class:`.HTMLParser` Methods
				97	----------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	98
				99	:class:`HTMLParser` instances have the following methods:
				100
				101
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	102	.. method:: HTMLParser.feed(data)
				103
				104	Feed some text to the parser. It is processed insofar as it consists of
				105	complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	106	:meth:`close` is called. data must be :class:`str`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	107
				108
				109	.. method:: HTMLParser.close()
				110
				111	Force processing of all buffered data as if it were followed by an end-of-file
				112	mark. This method may be redefined by a derived class to define additional
				113	processing at the end of the input, but the redefined version should always call
				114	the :class:`HTMLParser` base class method :meth:`close`.
				115
				116
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	117	.. method:: HTMLParser.reset()
				118
				119	Reset the instance. Loses all unprocessed data. This is called implicitly at
				120	instantiation time.
				121
				122
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	123	.. method:: HTMLParser.getpos()
				124
				125	Return current line number and offset.
				126
				127
				128	.. method:: HTMLParser.get_starttag_text()
				129
				130	Return the text of the most recently opened start tag. This should not normally
				131	be needed for structured processing, but may be useful in dealing with HTML "as
				132	deployed" or for re-generating input with minimal changes (whitespace between
				133	attributes can be preserved, etc.).
				134
				135
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	136	The following methods are called when data or markup elements are encountered
				137	and they are meant to be overridden in a subclass. The base class
				138	implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
				139
				140
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	141	.. method:: HTMLParser.handle_starttag(tag, attrs)
				142
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	143	This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	144
				145	The tag argument is the name of the tag converted to lower case. The attrs
				146	argument is a list of ``(name, value)`` pairs containing the attributes found
				147	inside the tag's ``<>`` brackets. The name will be translated to lower case,
				148	and quotes in the value have been removed, and character and entity references
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	149	have been replaced.
				150
				151	For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
				152	would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	153
Georg Brandl	9087b7f	2008-05-18 07:53:01 +0000	[diff] [blame]	154	All entity references from :mod:`html.entities` are replaced in the attribute
				155	values.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	156
				157
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	158	.. method:: HTMLParser.handle_endtag(tag)
				159
				160	This method is called to handle the end tag of an element (e.g. ``</div>``).
				161
				162	The tag argument is the name of the tag converted to lower case.
				163
				164
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	165	.. method:: HTMLParser.handle_startendtag(tag, attrs)
				166
				167	Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	168	XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	169	subclasses which require this particular lexical information; the default
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	170	implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	171
				172
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	173	.. method:: HTMLParser.handle_data(data)
				174
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	175	This method is called to process arbitrary data (e.g. text nodes and the
				176	content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	177
				178
				179	.. method:: HTMLParser.handle_entityref(name)
				180
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	181	This method is called to process a named character reference of the form
				182	``&name;`` (e.g. ``>``), where name is a general entity reference
				183	(e.g. ``'gt'``).
				184
				185
				186	.. method:: HTMLParser.handle_charref(name)
				187
				188	This method is called to process decimal and hexadecimal numeric character
				189	references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
				190	equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
				191	in this case the method will receive ``'62'`` or ``'x3E'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	192
				193
				194	.. method:: HTMLParser.handle_comment(data)
				195
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	196	This method is called when a comment is encountered (e.g. ``<!--comment-->``).
				197
				198	For example, the comment ``<!-- comment -->`` will cause this method to be
				199	called with the argument ``' comment '``.
				200
				201	The content of Internet Explorer conditional comments (condcoms) will also be
				202	sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
				203	this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	204
				205
				206	.. method:: HTMLParser.handle_decl(decl)
				207
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	208	This method is called to handle an HTML doctype declaration (e.g.
				209	``<!DOCTYPE html>``).
				210
Georg Brandl	46aa5c5	2010-07-29 13:38:37 +0000	[diff] [blame]	211	The decl parameter will be the entire contents of the declaration inside
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	212	the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	213
				214
				215	.. method:: HTMLParser.handle_pi(data)
				216
				217	Method called when a processing instruction is encountered. The data
				218	parameter will contain the entire processing instruction. For example, for the
				219	processing instruction ``<?proc color='red'>``, this method would be called as
				220	``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
				221	class; the base class implementation does nothing.
				222
				223	.. note::
				224
				225	The :class:`HTMLParser` class uses the SGML syntactic rules for processing
				226	instructions. An XHTML processing instruction using the trailing ``'?'`` will
				227	cause the ``'?'`` to be included in data.
				228
				229
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	230	.. method:: HTMLParser.unknown_decl(data)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	231
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	232	This method is called when an unrecognized declaration is read by the parser.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	233
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	234	The data parameter will be the entire contents of the declaration inside
				235	the ``<![...]>`` markup. It is sometimes useful to be overridden by a
				236	derived class. The base class implementation raises an :exc:`HTMLParseError`
				237	when strict is ``True``.
				238
				239
				240	.. _htmlparser-examples:
				241
				242	Examples
				243	--------
				244
				245	The following class implements a parser that will be used to illustrate more
				246	examples::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	247
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	248	from html.parser import HTMLParser
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	249	from html.entities import name2codepoint
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	250
				251	class MyHTMLParser(HTMLParser):
				252	def handle_starttag(self, tag, attrs):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	253	print("Start tag:", tag)
				254	for attr in attrs:
				255	print(" attr:", attr)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	256	def handle_endtag(self, tag):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	257	print("End tag :", tag)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	258	def handle_data(self, data):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	259	print("Data :", data)
				260	def handle_comment(self, data):
				261	print("Comment :", data)
				262	def handle_entityref(self, name):
				263	c = chr(name2codepoint[name])
				264	print("Named ent:", c)
				265	def handle_charref(self, name):
				266	if name.startswith('x'):
				267	c = chr(int(name[1:], 16))
				268	else:
				269	c = chr(int(name))
				270	print("Num ent :", c)
				271	def handle_decl(self, data):
				272	print("Decl :", data)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	273
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	274	parser = MyHTMLParser(strict=False)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	275
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	276	Parsing a doctype::
				277
				278	>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
				279	... '"http://www.w3.org/TR/html4/strict.dtd">')
				280	Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
				281
				282	Parsing an element with a few attributes and a title::
				283
				284	>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
				285	Start tag: img
				286	attr: ('src', 'python-logo.png')
				287	attr: ('alt', 'The Python logo')
				288	>>>
				289	>>> parser.feed('<h1>Python</h1>')
				290	Start tag: h1
				291	Data : Python
				292	End tag : h1
				293
				294	The content of ``script`` and ``style`` elements is returned as is, without
				295	further parsing::
				296
				297	>>> parser.feed('<style type="text/css">#python { color: green }</style>')
				298	Start tag: style
				299	attr: ('type', 'text/css')
				300	Data : #python { color: green }
				301	End tag : style
				302	>>>
				303	>>> parser.feed('<script type="text/javascript">'
				304	... 'alert("<strong>hello!</strong>");</script>')
				305	Start tag: script
				306	attr: ('type', 'text/javascript')
				307	Data : alert("<strong>hello!</strong>");
				308	End tag : script
				309
				310	Parsing comments::
				311
				312	>>> parser.feed('<!-- a comment -->'
				313	... '<!--[if IE 9]>IE-specific content<![endif]-->')
				314	Comment : a comment
				315	Comment : [if IE 9]>IE-specific content<![endif]
				316
				317	Parsing named and numeric character references and converting them to the
				318	correct char (note: these 3 references are all equivalent to ``'>'``)::
				319
				320	>>> parser.feed('>>>')
				321	Named ent: >
				322	Num ent : >
				323	Num ent : >
				324
				325	Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
				326	:meth:`~HTMLParser.handle_data` might be called more than once::
				327
				328	>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
				329	... parser.feed(chunk)
				330	...
				331	Start tag: span
				332	Data : buff
				333	Data : ered
				334	Data : text
				335	End tag : span
				336
				337	Parsing invalid HTML (e.g. unquoted attributes) also works::
				338
				339	>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
				340	Start tag: p
				341	Start tag: a
				342	attr: ('class', 'link')
				343	attr: ('href', '#main')
				344	Data : tag soup
				345	End tag : p
				346	End tag : a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	347
R. David Murray	b579dba	2010-12-03 04:06:39 +0000	[diff] [blame]	348	.. rubric:: Footnotes
				349
R. David Murray	bb7b753	2010-12-03 04:26:18 +0000	[diff] [blame]	350	.. [#] For backward compatibility reasons strict mode does not raise
				351	exceptions for all non-compliant HTML. That is, some invalid HTML
R. David Murray	b579dba	2010-12-03 04:06:39 +0000	[diff] [blame]	352	is tolerated even in strict mode.