Blame - Doc/library/html.parser.rst - platform/external/python/cpython3

blob: f3c36ec886719b8e2352f13df349e385e4148d47 [file] [log] [blame]

Fred Drake	3c50ea4	2008-05-17 22:02:32 +0000	[diff] [blame]	1	:mod:`html.parser` --- Simple HTML and XHTML parser
				2	===================================================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3
Fred Drake	3c50ea4	2008-05-17 22:02:32 +0000	[diff] [blame]	4	.. module:: html.parser
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	5	:synopsis: A simple parser that can handle HTML and XHTML.
				6
				7
Georg Brandl	9087b7f	2008-05-18 07:53:01 +0000	[diff] [blame]	8	.. index::
				9	single: HTML
				10	single: XHTML
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	11
Raymond Hettinger	a199368	2011-01-27 01:20:32 +0000	[diff] [blame]	12	Source code: :source:`Lib/html/parser.py`
				13
				14	--------------
				15
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	16	This module defines a class :class:`HTMLParser` which serves as the basis for
				17	parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
R. David Murray	b579dba	2010-12-03 04:06:39 +0000	[diff] [blame]	19	.. class:: HTMLParser(strict=True)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	20
R. David Murray	b579dba	2010-12-03 04:06:39 +0000	[diff] [blame]	21	Create a parser instance. If strict is ``True`` (the default), invalid
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	22	HTML results in :exc:`~html.parser.HTMLParseError` exceptions [#]_. If
R. David Murray	b579dba	2010-12-03 04:06:39 +0000	[diff] [blame]	23	strict is ``False``, the parser uses heuristics to make a best guess at
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	24	the intention of any invalid HTML it encounters, similar to the way most
				25	browsers do. Using ``strict=False`` is advised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	26
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	27	An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
				28	when start tags, end tags, text, comments, and other markup elements are
				29	encountered. The user should subclass :class:`.HTMLParser` and override its
				30	methods to implement the desired behavior.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	31
Georg Brandl	877b10a	2008-06-01 21:25:55 +0000	[diff] [blame]	32	This parser does not check that end tags match start tags or call the end-tag
				33	handler for elements which are closed implicitly by closing an outer element.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	34
R. David Murray	bb7b753	2010-12-03 04:26:18 +0000	[diff] [blame]	35	.. versionchanged:: 3.2 strict keyword added
				36
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	37	An exception is defined as well:
				38
				39
				40	.. exception:: HTMLParseError
				41
				42	Exception raised by the :class:`HTMLParser` class when it encounters an error
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	43	while parsing and strict is ``True``. This exception provides three
				44	attributes: :attr:`msg` is a brief message explaining the error,
				45	:attr:`lineno` is the number of the line on which the broken construct was
				46	detected, and :attr:`offset` is the number of characters into the line at
				47	which the construct starts.
				48
				49
				50	Example HTML Parser Application
				51	-------------------------------
				52
				53	As a basic example, below is a simple HTML parser that uses the
				54	:class:`HTMLParser` class to print out start tags, end tags, and data
				55	as they are encountered::
				56
				57	from html.parser import HTMLParser
				58
				59	class MyHTMLParser(HTMLParser):
				60	def handle_starttag(self, tag, attrs):
				61	print("Encountered a start tag:", tag)
				62	def handle_endtag(self, tag):
				63	print("Encountered an end tag :", tag)
				64	def handle_data(self, data):
				65	print("Encountered some data :", data)
				66
				67	parser = MyHTMLParser(strict=False)
				68	parser.feed('<html><head><title>Test</title></head>'
				69	'<body><h1>Parse me!</h1></body></html>')
				70
				71	The output will then be::
				72
				73	Encountered a start tag: html
				74	Encountered a start tag: head
				75	Encountered a start tag: title
				76	Encountered some data : Test
				77	Encountered an end tag : title
				78	Encountered an end tag : head
				79	Encountered a start tag: body
				80	Encountered a start tag: h1
				81	Encountered some data : Parse me!
				82	Encountered an end tag : h1
				83	Encountered an end tag : body
				84	Encountered an end tag : html
				85
				86
				87	:class:`.HTMLParser` Methods
				88	----------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	89
				90	:class:`HTMLParser` instances have the following methods:
				91
				92
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	93	.. method:: HTMLParser.feed(data)
				94
				95	Feed some text to the parser. It is processed insofar as it consists of
				96	complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	97	:meth:`close` is called. data must be :class:`str`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	98
				99
				100	.. method:: HTMLParser.close()
				101
				102	Force processing of all buffered data as if it were followed by an end-of-file
				103	mark. This method may be redefined by a derived class to define additional
				104	processing at the end of the input, but the redefined version should always call
				105	the :class:`HTMLParser` base class method :meth:`close`.
				106
				107
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	108	.. method:: HTMLParser.reset()
				109
				110	Reset the instance. Loses all unprocessed data. This is called implicitly at
				111	instantiation time.
				112
				113
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	114	.. method:: HTMLParser.getpos()
				115
				116	Return current line number and offset.
				117
				118
				119	.. method:: HTMLParser.get_starttag_text()
				120
				121	Return the text of the most recently opened start tag. This should not normally
				122	be needed for structured processing, but may be useful in dealing with HTML "as
				123	deployed" or for re-generating input with minimal changes (whitespace between
				124	attributes can be preserved, etc.).
				125
				126
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	127	The following methods are called when data or markup elements are encountered
				128	and they are meant to be overridden in a subclass. The base class
				129	implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
				130
				131
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	132	.. method:: HTMLParser.handle_starttag(tag, attrs)
				133
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	134	This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	135
				136	The tag argument is the name of the tag converted to lower case. The attrs
				137	argument is a list of ``(name, value)`` pairs containing the attributes found
				138	inside the tag's ``<>`` brackets. The name will be translated to lower case,
				139	and quotes in the value have been removed, and character and entity references
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	140	have been replaced.
				141
				142	For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
				143	would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	144
Georg Brandl	9087b7f	2008-05-18 07:53:01 +0000	[diff] [blame]	145	All entity references from :mod:`html.entities` are replaced in the attribute
				146	values.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	147
				148
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	149	.. method:: HTMLParser.handle_endtag(tag)
				150
				151	This method is called to handle the end tag of an element (e.g. ``</div>``).
				152
				153	The tag argument is the name of the tag converted to lower case.
				154
				155
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	156	.. method:: HTMLParser.handle_startendtag(tag, attrs)
				157
				158	Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	159	XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	160	subclasses which require this particular lexical information; the default
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	161	implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	162
				163
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	164	.. method:: HTMLParser.handle_data(data)
				165
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	166	This method is called to process arbitrary data (e.g. text nodes and the
				167	content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	168
				169
				170	.. method:: HTMLParser.handle_entityref(name)
				171
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	172	This method is called to process a named character reference of the form
				173	``&name;`` (e.g. ``>``), where name is a general entity reference
				174	(e.g. ``'gt'``).
				175
				176
				177	.. method:: HTMLParser.handle_charref(name)
				178
				179	This method is called to process decimal and hexadecimal numeric character
				180	references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
				181	equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
				182	in this case the method will receive ``'62'`` or ``'x3E'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	183
				184
				185	.. method:: HTMLParser.handle_comment(data)
				186
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	187	This method is called when a comment is encountered (e.g. ``<!--comment-->``).
				188
				189	For example, the comment ``<!-- comment -->`` will cause this method to be
				190	called with the argument ``' comment '``.
				191
				192	The content of Internet Explorer conditional comments (condcoms) will also be
				193	sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
				194	this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	195
				196
				197	.. method:: HTMLParser.handle_decl(decl)
				198
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	199	This method is called to handle an HTML doctype declaration (e.g.
				200	``<!DOCTYPE html>``).
				201
Georg Brandl	46aa5c5	2010-07-29 13:38:37 +0000	[diff] [blame]	202	The decl parameter will be the entire contents of the declaration inside
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	203	the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	204
				205
				206	.. method:: HTMLParser.handle_pi(data)
				207
				208	Method called when a processing instruction is encountered. The data
				209	parameter will contain the entire processing instruction. For example, for the
				210	processing instruction ``<?proc color='red'>``, this method would be called as
				211	``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
				212	class; the base class implementation does nothing.
				213
				214	.. note::
				215
				216	The :class:`HTMLParser` class uses the SGML syntactic rules for processing
				217	instructions. An XHTML processing instruction using the trailing ``'?'`` will
				218	cause the ``'?'`` to be included in data.
				219
				220
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	221	.. method:: HTMLParser.unknown_decl(data)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	222
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	223	This method is called when an unrecognized declaration is read by the parser.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	224
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	225	The data parameter will be the entire contents of the declaration inside
				226	the ``<![...]>`` markup. It is sometimes useful to be overridden by a
				227	derived class. The base class implementation raises an :exc:`HTMLParseError`
				228	when strict is ``True``.
				229
				230
				231	.. _htmlparser-examples:
				232
				233	Examples
				234	--------
				235
				236	The following class implements a parser that will be used to illustrate more
				237	examples::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	238
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	239	from html.parser import HTMLParser
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	240	from html.entities import name2codepoint
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	241
				242	class MyHTMLParser(HTMLParser):
				243	def handle_starttag(self, tag, attrs):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	244	print("Start tag:", tag)
				245	for attr in attrs:
				246	print(" attr:", attr)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	247	def handle_endtag(self, tag):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	248	print("End tag :", tag)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	249	def handle_data(self, data):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	250	print("Data :", data)
				251	def handle_comment(self, data):
				252	print("Comment :", data)
				253	def handle_entityref(self, name):
				254	c = chr(name2codepoint[name])
				255	print("Named ent:", c)
				256	def handle_charref(self, name):
				257	if name.startswith('x'):
				258	c = chr(int(name[1:], 16))
				259	else:
				260	c = chr(int(name))
				261	print("Num ent :", c)
				262	def handle_decl(self, data):
				263	print("Decl :", data)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	264
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	265	parser = MyHTMLParser(strict=False)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	266
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame^]	267	Parsing a doctype::
				268
				269	>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
				270	... '"http://www.w3.org/TR/html4/strict.dtd">')
				271	Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
				272
				273	Parsing an element with a few attributes and a title::
				274
				275	>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
				276	Start tag: img
				277	attr: ('src', 'python-logo.png')
				278	attr: ('alt', 'The Python logo')
				279	>>>
				280	>>> parser.feed('<h1>Python</h1>')
				281	Start tag: h1
				282	Data : Python
				283	End tag : h1
				284
				285	The content of ``script`` and ``style`` elements is returned as is, without
				286	further parsing::
				287
				288	>>> parser.feed('<style type="text/css">#python { color: green }</style>')
				289	Start tag: style
				290	attr: ('type', 'text/css')
				291	Data : #python { color: green }
				292	End tag : style
				293	>>>
				294	>>> parser.feed('<script type="text/javascript">'
				295	... 'alert("<strong>hello!</strong>");</script>')
				296	Start tag: script
				297	attr: ('type', 'text/javascript')
				298	Data : alert("<strong>hello!</strong>");
				299	End tag : script
				300
				301	Parsing comments::
				302
				303	>>> parser.feed('<!-- a comment -->'
				304	... '<!--[if IE 9]>IE-specific content<![endif]-->')
				305	Comment : a comment
				306	Comment : [if IE 9]>IE-specific content<![endif]
				307
				308	Parsing named and numeric character references and converting them to the
				309	correct char (note: these 3 references are all equivalent to ``'>'``)::
				310
				311	>>> parser.feed('>>>')
				312	Named ent: >
				313	Num ent : >
				314	Num ent : >
				315
				316	Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
				317	:meth:`~HTMLParser.handle_data` might be called more than once::
				318
				319	>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
				320	... parser.feed(chunk)
				321	...
				322	Start tag: span
				323	Data : buff
				324	Data : ered
				325	Data : text
				326	End tag : span
				327
				328	Parsing invalid HTML (e.g. unquoted attributes) also works::
				329
				330	>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
				331	Start tag: p
				332	Start tag: a
				333	attr: ('class', 'link')
				334	attr: ('href', '#main')
				335	Data : tag soup
				336	End tag : p
				337	End tag : a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	338
R. David Murray	b579dba	2010-12-03 04:06:39 +0000	[diff] [blame]	339	.. rubric:: Footnotes
				340
R. David Murray	bb7b753	2010-12-03 04:26:18 +0000	[diff] [blame]	341	.. [#] For backward compatibility reasons strict mode does not raise
				342	exceptions for all non-compliant HTML. That is, some invalid HTML
R. David Murray	b579dba	2010-12-03 04:06:39 +0000	[diff] [blame]	343	is tolerated even in strict mode.