Blame - Doc/library/html.parser.rst - platform/external/python/cpython3

blob: b84c60b708dd25495c11518012840e657a010e8d [file] [log] [blame]

Fred Drake	3c50ea4	2008-05-17 22:02:32 +0000	[diff] [blame]	1	:mod:`html.parser` --- Simple HTML and XHTML parser
				2	===================================================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3
Fred Drake	3c50ea4	2008-05-17 22:02:32 +0000	[diff] [blame]	4	.. module:: html.parser
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	5	:synopsis: A simple parser that can handle HTML and XHTML.
				6
				7
Georg Brandl	9087b7f	2008-05-18 07:53:01 +0000	[diff] [blame]	8	.. index::
				9	single: HTML
				10	single: XHTML
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	11
Raymond Hettinger	a199368	2011-01-27 01:20:32 +0000	[diff] [blame]	12	Source code: :source:`Lib/html/parser.py`
				13
				14	--------------
				15
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	16	This module defines a class :class:`HTMLParser` which serves as the basis for
				17	parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
Ezio Melotti	6fc16d8	2014-08-02 18:36:12 +0300	[diff] [blame]	19	.. class:: HTMLParser(*, convert_charrefs=True)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	20
Ezio Melotti	73a4359	2014-08-02 14:10:30 +0300	[diff] [blame]	21	Create a parser instance able to parse invalid markup.
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	22
Ezio Melotti	6fc16d8	2014-08-02 18:36:12 +0300	[diff] [blame]	23	If convert_charrefs is ``True`` (the default), all character
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	24	references (except the ones in ``script``/``style`` elements) are
				25	automatically converted to the corresponding Unicode characters.
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	26
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	27	An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
				28	when start tags, end tags, text, comments, and other markup elements are
				29	encountered. The user should subclass :class:`.HTMLParser` and override its
				30	methods to implement the desired behavior.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	31
Georg Brandl	877b10a	2008-06-01 21:25:55 +0000	[diff] [blame]	32	This parser does not check that end tags match start tags or call the end-tag
				33	handler for elements which are closed implicitly by closing an outer element.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	34
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	35	.. versionchanged:: 3.4
				36	convert_charrefs keyword argument added.
				37
Ezio Melotti	6fc16d8	2014-08-02 18:36:12 +0300	[diff] [blame]	38	.. versionchanged:: 3.5
				39	The default value for argument convert_charrefs is now ``True``.
				40
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	41
				42	Example HTML Parser Application
				43	-------------------------------
				44
				45	As a basic example, below is a simple HTML parser that uses the
				46	:class:`HTMLParser` class to print out start tags, end tags, and data
				47	as they are encountered::
				48
				49	from html.parser import HTMLParser
				50
				51	class MyHTMLParser(HTMLParser):
				52	def handle_starttag(self, tag, attrs):
				53	print("Encountered a start tag:", tag)
				54	def handle_endtag(self, tag):
				55	print("Encountered an end tag :", tag)
				56	def handle_data(self, data):
				57	print("Encountered some data :", data)
				58
Ezio Melotti	88ebfb1	2013-11-02 17:08:24 +0200	[diff] [blame]	59	parser = MyHTMLParser()
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	60	parser.feed('<html><head><title>Test</title></head>'
				61	'<body><h1>Parse me!</h1></body></html>')
				62
				63	The output will then be::
				64
				65	Encountered a start tag: html
				66	Encountered a start tag: head
				67	Encountered a start tag: title
				68	Encountered some data : Test
				69	Encountered an end tag : title
				70	Encountered an end tag : head
				71	Encountered a start tag: body
				72	Encountered a start tag: h1
				73	Encountered some data : Parse me!
				74	Encountered an end tag : h1
				75	Encountered an end tag : body
				76	Encountered an end tag : html
				77
				78
				79	:class:`.HTMLParser` Methods
				80	----------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	81
				82	:class:`HTMLParser` instances have the following methods:
				83
				84
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	85	.. method:: HTMLParser.feed(data)
				86
				87	Feed some text to the parser. It is processed insofar as it consists of
				88	complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	89	:meth:`close` is called. data must be :class:`str`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	90
				91
				92	.. method:: HTMLParser.close()
				93
				94	Force processing of all buffered data as if it were followed by an end-of-file
				95	mark. This method may be redefined by a derived class to define additional
				96	processing at the end of the input, but the redefined version should always call
				97	the :class:`HTMLParser` base class method :meth:`close`.
				98
				99
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	100	.. method:: HTMLParser.reset()
				101
				102	Reset the instance. Loses all unprocessed data. This is called implicitly at
				103	instantiation time.
				104
				105
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	106	.. method:: HTMLParser.getpos()
				107
				108	Return current line number and offset.
				109
				110
				111	.. method:: HTMLParser.get_starttag_text()
				112
				113	Return the text of the most recently opened start tag. This should not normally
				114	be needed for structured processing, but may be useful in dealing with HTML "as
				115	deployed" or for re-generating input with minimal changes (whitespace between
				116	attributes can be preserved, etc.).
				117
				118
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	119	The following methods are called when data or markup elements are encountered
				120	and they are meant to be overridden in a subclass. The base class
				121	implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
				122
				123
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	124	.. method:: HTMLParser.handle_starttag(tag, attrs)
				125
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	126	This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	127
				128	The tag argument is the name of the tag converted to lower case. The attrs
				129	argument is a list of ``(name, value)`` pairs containing the attributes found
				130	inside the tag's ``<>`` brackets. The name will be translated to lower case,
				131	and quotes in the value have been removed, and character and entity references
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	132	have been replaced.
				133
				134	For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
				135	would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	136
Georg Brandl	9087b7f	2008-05-18 07:53:01 +0000	[diff] [blame]	137	All entity references from :mod:`html.entities` are replaced in the attribute
				138	values.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	139
				140
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	141	.. method:: HTMLParser.handle_endtag(tag)
				142
				143	This method is called to handle the end tag of an element (e.g. ``</div>``).
				144
				145	The tag argument is the name of the tag converted to lower case.
				146
				147
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	148	.. method:: HTMLParser.handle_startendtag(tag, attrs)
				149
				150	Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	151	XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	152	subclasses which require this particular lexical information; the default
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	153	implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	154
				155
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	156	.. method:: HTMLParser.handle_data(data)
				157
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	158	This method is called to process arbitrary data (e.g. text nodes and the
				159	content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	160
				161
				162	.. method:: HTMLParser.handle_entityref(name)
				163
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	164	This method is called to process a named character reference of the form
				165	``&name;`` (e.g. ``>``), where name is a general entity reference
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	166	(e.g. ``'gt'``). This method is never called if convert_charrefs is
				167	``True``.
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	168
				169
				170	.. method:: HTMLParser.handle_charref(name)
				171
				172	This method is called to process decimal and hexadecimal numeric character
				173	references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
				174	equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	175	in this case the method will receive ``'62'`` or ``'x3E'``. This method
				176	is never called if convert_charrefs is ``True``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	177
				178
				179	.. method:: HTMLParser.handle_comment(data)
				180
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	181	This method is called when a comment is encountered (e.g. ``<!--comment-->``).
				182
				183	For example, the comment ``<!-- comment -->`` will cause this method to be
				184	called with the argument ``' comment '``.
				185
				186	The content of Internet Explorer conditional comments (condcoms) will also be
				187	sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
				188	this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	189
				190
				191	.. method:: HTMLParser.handle_decl(decl)
				192
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	193	This method is called to handle an HTML doctype declaration (e.g.
				194	``<!DOCTYPE html>``).
				195
Georg Brandl	46aa5c5	2010-07-29 13:38:37 +0000	[diff] [blame]	196	The decl parameter will be the entire contents of the declaration inside
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	197	the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	198
				199
				200	.. method:: HTMLParser.handle_pi(data)
				201
				202	Method called when a processing instruction is encountered. The data
				203	parameter will contain the entire processing instruction. For example, for the
				204	processing instruction ``<?proc color='red'>``, this method would be called as
				205	``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
				206	class; the base class implementation does nothing.
				207
				208	.. note::
				209
				210	The :class:`HTMLParser` class uses the SGML syntactic rules for processing
				211	instructions. An XHTML processing instruction using the trailing ``'?'`` will
				212	cause the ``'?'`` to be included in data.
				213
				214
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	215	.. method:: HTMLParser.unknown_decl(data)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	216
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	217	This method is called when an unrecognized declaration is read by the parser.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	218
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	219	The data parameter will be the entire contents of the declaration inside
				220	the ``<![...]>`` markup. It is sometimes useful to be overridden by a
Ezio Melotti	73a4359	2014-08-02 14:10:30 +0300	[diff] [blame]	221	derived class. The base class implementation does nothing.
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	222
				223
				224	.. _htmlparser-examples:
				225
				226	Examples
				227	--------
				228
				229	The following class implements a parser that will be used to illustrate more
				230	examples::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	231
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	232	from html.parser import HTMLParser
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	233	from html.entities import name2codepoint
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	234
				235	class MyHTMLParser(HTMLParser):
				236	def handle_starttag(self, tag, attrs):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	237	print("Start tag:", tag)
				238	for attr in attrs:
				239	print(" attr:", attr)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	240	def handle_endtag(self, tag):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	241	print("End tag :", tag)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	242	def handle_data(self, data):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	243	print("Data :", data)
				244	def handle_comment(self, data):
				245	print("Comment :", data)
				246	def handle_entityref(self, name):
				247	c = chr(name2codepoint[name])
				248	print("Named ent:", c)
				249	def handle_charref(self, name):
				250	if name.startswith('x'):
				251	c = chr(int(name[1:], 16))
				252	else:
				253	c = chr(int(name))
				254	print("Num ent :", c)
				255	def handle_decl(self, data):
				256	print("Decl :", data)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	257
Ezio Melotti	88ebfb1	2013-11-02 17:08:24 +0200	[diff] [blame]	258	parser = MyHTMLParser()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	259
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	260	Parsing a doctype::
				261
				262	>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
				263	... '"http://www.w3.org/TR/html4/strict.dtd">')
				264	Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
				265
				266	Parsing an element with a few attributes and a title::
				267
				268	>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
				269	Start tag: img
				270	attr: ('src', 'python-logo.png')
				271	attr: ('alt', 'The Python logo')
				272	>>>
				273	>>> parser.feed('<h1>Python</h1>')
				274	Start tag: h1
				275	Data : Python
				276	End tag : h1
				277
				278	The content of ``script`` and ``style`` elements is returned as is, without
				279	further parsing::
				280
				281	>>> parser.feed('<style type="text/css">#python { color: green }</style>')
				282	Start tag: style
				283	attr: ('type', 'text/css')
				284	Data : #python { color: green }
				285	End tag : style
				286	>>>
				287	>>> parser.feed('<script type="text/javascript">'
				288	... 'alert("<strong>hello!</strong>");</script>')
				289	Start tag: script
				290	attr: ('type', 'text/javascript')
				291	Data : alert("<strong>hello!</strong>");
				292	End tag : script
				293
				294	Parsing comments::
				295
				296	>>> parser.feed('<!-- a comment -->'
				297	... '<!--[if IE 9]>IE-specific content<![endif]-->')
				298	Comment : a comment
				299	Comment : [if IE 9]>IE-specific content<![endif]
				300
				301	Parsing named and numeric character references and converting them to the
				302	correct char (note: these 3 references are all equivalent to ``'>'``)::
				303
				304	>>> parser.feed('>>>')
				305	Named ent: >
				306	Num ent : >
				307	Num ent : >
				308
				309	Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
Ezio Melotti	95401c5	2013-11-23 19:52:05 +0200	[diff] [blame]	310	:meth:`~HTMLParser.handle_data` might be called more than once
				311	(unless convert_charrefs is set to ``True``)::
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	312
				313	>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
				314	... parser.feed(chunk)
				315	...
				316	Start tag: span
				317	Data : buff
				318	Data : ered
				319	Data : text
				320	End tag : span
				321
				322	Parsing invalid HTML (e.g. unquoted attributes) also works::
				323
				324	>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
				325	Start tag: p
				326	Start tag: a
				327	attr: ('class', 'link')
				328	attr: ('href', '#main')
				329	Data : tag soup
				330	End tag : p
				331	End tag : a