Blame - Doc/library/html.parser.rst - platform/external/python/cpython3

blob: ac844a683bf7ac1b9cc9bdf873943647c8a3ee6d [file] [log] [blame]

Fred Drake	3c50ea4	2008-05-17 22:02:32 +0000	[diff] [blame]	1	:mod:`html.parser` --- Simple HTML and XHTML parser
				2	===================================================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3
Fred Drake	3c50ea4	2008-05-17 22:02:32 +0000	[diff] [blame]	4	.. module:: html.parser
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	5	:synopsis: A simple parser that can handle HTML and XHTML.
				6
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	7	Source code: :source:`Lib/html/parser.py`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	8
Georg Brandl	9087b7f	2008-05-18 07:53:01 +0000	[diff] [blame]	9	.. index::
				10	single: HTML
				11	single: XHTML
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	12
Raymond Hettinger	a199368	2011-01-27 01:20:32 +0000	[diff] [blame]	13	--------------
				14
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	15	This module defines a class :class:`HTMLParser` which serves as the basis for
				16	parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	17
Ezio Melotti	6fc16d8	2014-08-02 18:36:12 +0300	[diff] [blame]	18	.. class:: HTMLParser(*, convert_charrefs=True)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	19
Ezio Melotti	73a4359	2014-08-02 14:10:30 +0300	[diff] [blame]	20	Create a parser instance able to parse invalid markup.
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	21
Ezio Melotti	6fc16d8	2014-08-02 18:36:12 +0300	[diff] [blame]	22	If convert_charrefs is ``True`` (the default), all character
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	23	references (except the ones in ``script``/``style`` elements) are
				24	automatically converted to the corresponding Unicode characters.
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	25
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	26	An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
				27	when start tags, end tags, text, comments, and other markup elements are
				28	encountered. The user should subclass :class:`.HTMLParser` and override its
				29	methods to implement the desired behavior.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	30
Georg Brandl	877b10a	2008-06-01 21:25:55 +0000	[diff] [blame]	31	This parser does not check that end tags match start tags or call the end-tag
				32	handler for elements which are closed implicitly by closing an outer element.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	33
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	34	.. versionchanged:: 3.4
				35	convert_charrefs keyword argument added.
				36
Ezio Melotti	6fc16d8	2014-08-02 18:36:12 +0300	[diff] [blame]	37	.. versionchanged:: 3.5
				38	The default value for argument convert_charrefs is now ``True``.
				39
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	40
				41	Example HTML Parser Application
				42	-------------------------------
				43
				44	As a basic example, below is a simple HTML parser that uses the
				45	:class:`HTMLParser` class to print out start tags, end tags, and data
				46	as they are encountered::
				47
				48	from html.parser import HTMLParser
				49
				50	class MyHTMLParser(HTMLParser):
				51	def handle_starttag(self, tag, attrs):
				52	print("Encountered a start tag:", tag)
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	53
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	54	def handle_endtag(self, tag):
				55	print("Encountered an end tag :", tag)
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	56
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	57	def handle_data(self, data):
				58	print("Encountered some data :", data)
				59
Ezio Melotti	88ebfb1	2013-11-02 17:08:24 +0200	[diff] [blame]	60	parser = MyHTMLParser()
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	61	parser.feed('<html><head><title>Test</title></head>'
				62	'<body><h1>Parse me!</h1></body></html>')
				63
Martin Panter	1050d2d	2016-07-26 11:18:21 +0200	[diff] [blame]	64	The output will then be:
				65
				66	.. code-block:: none
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	67
				68	Encountered a start tag: html
				69	Encountered a start tag: head
				70	Encountered a start tag: title
				71	Encountered some data : Test
				72	Encountered an end tag : title
				73	Encountered an end tag : head
				74	Encountered a start tag: body
				75	Encountered a start tag: h1
				76	Encountered some data : Parse me!
				77	Encountered an end tag : h1
				78	Encountered an end tag : body
				79	Encountered an end tag : html
				80
				81
				82	:class:`.HTMLParser` Methods
				83	----------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	84
				85	:class:`HTMLParser` instances have the following methods:
				86
				87
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	88	.. method:: HTMLParser.feed(data)
				89
				90	Feed some text to the parser. It is processed insofar as it consists of
				91	complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	92	:meth:`close` is called. data must be :class:`str`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	93
				94
				95	.. method:: HTMLParser.close()
				96
				97	Force processing of all buffered data as if it were followed by an end-of-file
				98	mark. This method may be redefined by a derived class to define additional
				99	processing at the end of the input, but the redefined version should always call
				100	the :class:`HTMLParser` base class method :meth:`close`.
				101
				102
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	103	.. method:: HTMLParser.reset()
				104
				105	Reset the instance. Loses all unprocessed data. This is called implicitly at
				106	instantiation time.
				107
				108
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	109	.. method:: HTMLParser.getpos()
				110
				111	Return current line number and offset.
				112
				113
				114	.. method:: HTMLParser.get_starttag_text()
				115
				116	Return the text of the most recently opened start tag. This should not normally
				117	be needed for structured processing, but may be useful in dealing with HTML "as
				118	deployed" or for re-generating input with minimal changes (whitespace between
				119	attributes can be preserved, etc.).
				120
				121
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	122	The following methods are called when data or markup elements are encountered
				123	and they are meant to be overridden in a subclass. The base class
				124	implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
				125
				126
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	127	.. method:: HTMLParser.handle_starttag(tag, attrs)
				128
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	129	This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	130
				131	The tag argument is the name of the tag converted to lower case. The attrs
				132	argument is a list of ``(name, value)`` pairs containing the attributes found
				133	inside the tag's ``<>`` brackets. The name will be translated to lower case,
				134	and quotes in the value have been removed, and character and entity references
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	135	have been replaced.
				136
Serhiy Storchaka	6dff020	2016-05-07 10:49:07 +0300	[diff] [blame]	137	For instance, for the tag ``<A HREF="https://www.cwi.nl/">``, this method
				138	would be called as ``handle_starttag('a', [('href', 'https://www.cwi.nl/')])``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	139
Georg Brandl	9087b7f	2008-05-18 07:53:01 +0000	[diff] [blame]	140	All entity references from :mod:`html.entities` are replaced in the attribute
				141	values.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	142
				143
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	144	.. method:: HTMLParser.handle_endtag(tag)
				145
				146	This method is called to handle the end tag of an element (e.g. ``</div>``).
				147
				148	The tag argument is the name of the tag converted to lower case.
				149
				150
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	151	.. method:: HTMLParser.handle_startendtag(tag, attrs)
				152
				153	Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	154	XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	155	subclasses which require this particular lexical information; the default
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	156	implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	157
				158
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	159	.. method:: HTMLParser.handle_data(data)
				160
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	161	This method is called to process arbitrary data (e.g. text nodes and the
				162	content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	163
				164
				165	.. method:: HTMLParser.handle_entityref(name)
				166
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	167	This method is called to process a named character reference of the form
				168	``&name;`` (e.g. ``>``), where name is a general entity reference
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	169	(e.g. ``'gt'``). This method is never called if convert_charrefs is
				170	``True``.
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	171
				172
				173	.. method:: HTMLParser.handle_charref(name)
				174
				175	This method is called to process decimal and hexadecimal numeric character
				176	references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
				177	equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	178	in this case the method will receive ``'62'`` or ``'x3E'``. This method
				179	is never called if convert_charrefs is ``True``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	180
				181
				182	.. method:: HTMLParser.handle_comment(data)
				183
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	184	This method is called when a comment is encountered (e.g. ``<!--comment-->``).
				185
				186	For example, the comment ``<!-- comment -->`` will cause this method to be
				187	called with the argument ``' comment '``.
				188
				189	The content of Internet Explorer conditional comments (condcoms) will also be
				190	sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
R David Murray	87cbfb2	2015-08-24 12:55:03 -0400	[diff] [blame]	191	this method will receive ``'[if IE 9]>IE9-specific content<![endif]'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	192
				193
				194	.. method:: HTMLParser.handle_decl(decl)
				195
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	196	This method is called to handle an HTML doctype declaration (e.g.
				197	``<!DOCTYPE html>``).
				198
Georg Brandl	46aa5c5	2010-07-29 13:38:37 +0000	[diff] [blame]	199	The decl parameter will be the entire contents of the declaration inside
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	200	the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	201
				202
				203	.. method:: HTMLParser.handle_pi(data)
				204
				205	Method called when a processing instruction is encountered. The data
				206	parameter will contain the entire processing instruction. For example, for the
				207	processing instruction ``<?proc color='red'>``, this method would be called as
				208	``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
				209	class; the base class implementation does nothing.
				210
				211	.. note::
				212
				213	The :class:`HTMLParser` class uses the SGML syntactic rules for processing
				214	instructions. An XHTML processing instruction using the trailing ``'?'`` will
				215	cause the ``'?'`` to be included in data.
				216
				217
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	218	.. method:: HTMLParser.unknown_decl(data)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	219
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	220	This method is called when an unrecognized declaration is read by the parser.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	221
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	222	The data parameter will be the entire contents of the declaration inside
				223	the ``<![...]>`` markup. It is sometimes useful to be overridden by a
Ezio Melotti	73a4359	2014-08-02 14:10:30 +0300	[diff] [blame]	224	derived class. The base class implementation does nothing.
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	225
				226
				227	.. _htmlparser-examples:
				228
				229	Examples
				230	--------
				231
				232	The following class implements a parser that will be used to illustrate more
				233	examples::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	234
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	235	from html.parser import HTMLParser
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	236	from html.entities import name2codepoint
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	237
				238	class MyHTMLParser(HTMLParser):
				239	def handle_starttag(self, tag, attrs):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	240	print("Start tag:", tag)
				241	for attr in attrs:
				242	print(" attr:", attr)
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	243
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	244	def handle_endtag(self, tag):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	245	print("End tag :", tag)
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	246
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	247	def handle_data(self, data):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	248	print("Data :", data)
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	249
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	250	def handle_comment(self, data):
				251	print("Comment :", data)
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	252
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	253	def handle_entityref(self, name):
				254	c = chr(name2codepoint[name])
				255	print("Named ent:", c)
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	256
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	257	def handle_charref(self, name):
				258	if name.startswith('x'):
				259	c = chr(int(name[1:], 16))
				260	else:
				261	c = chr(int(name))
				262	print("Num ent :", c)
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	263
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	264	def handle_decl(self, data):
				265	print("Decl :", data)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	266
Ezio Melotti	88ebfb1	2013-11-02 17:08:24 +0200	[diff] [blame]	267	parser = MyHTMLParser()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	268
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	269	Parsing a doctype::
				270
				271	>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
				272	... '"http://www.w3.org/TR/html4/strict.dtd">')
				273	Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
				274
				275	Parsing an element with a few attributes and a title::
				276
				277	>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
				278	Start tag: img
				279	attr: ('src', 'python-logo.png')
				280	attr: ('alt', 'The Python logo')
				281	>>>
				282	>>> parser.feed('<h1>Python</h1>')
				283	Start tag: h1
				284	Data : Python
				285	End tag : h1
				286
				287	The content of ``script`` and ``style`` elements is returned as is, without
				288	further parsing::
				289
				290	>>> parser.feed('<style type="text/css">#python { color: green }</style>')
				291	Start tag: style
				292	attr: ('type', 'text/css')
				293	Data : #python { color: green }
				294	End tag : style
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	295
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	296	>>> parser.feed('<script type="text/javascript">'
				297	... 'alert("<strong>hello!</strong>");</script>')
				298	Start tag: script
				299	attr: ('type', 'text/javascript')
				300	Data : alert("<strong>hello!</strong>");
				301	End tag : script
				302
				303	Parsing comments::
				304
				305	>>> parser.feed('<!-- a comment -->'
				306	... '<!--[if IE 9]>IE-specific content<![endif]-->')
				307	Comment : a comment
				308	Comment : [if IE 9]>IE-specific content<![endif]
				309
				310	Parsing named and numeric character references and converting them to the
				311	correct char (note: these 3 references are all equivalent to ``'>'``)::
				312
				313	>>> parser.feed('>>>')
				314	Named ent: >
				315	Num ent : >
				316	Num ent : >
				317
				318	Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	319	:meth:`~HTMLParser.handle_data` might be called more than once
				320	(unless convert_charrefs is set to ``True``)::
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	321
				322	>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
				323	... parser.feed(chunk)
				324	...
				325	Start tag: span
				326	Data : buff
				327	Data : ered
				328	Data : text
				329	End tag : span
				330
				331	Parsing invalid HTML (e.g. unquoted attributes) also works::
				332
				333	>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
				334	Start tag: p
				335	Start tag: a
				336	attr: ('class', 'link')
				337	attr: ('href', '#main')
				338	Data : tag soup
				339	End tag : p
				340	End tag : a