Blame - Doc/library/html.parser.rst - platform/external/python/cpython3

blob: 0ea964457c2bc1494499ec935c8373b2f59c2749 [file] [log] [blame]

Fred Drake	3c50ea4	2008-05-17 22:02:32 +0000	[diff] [blame]	1	:mod:`html.parser` --- Simple HTML and XHTML parser
				2	===================================================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3
Fred Drake	3c50ea4	2008-05-17 22:02:32 +0000	[diff] [blame]	4	.. module:: html.parser
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	5	:synopsis: A simple parser that can handle HTML and XHTML.
				6
				7
Georg Brandl	9087b7f	2008-05-18 07:53:01 +0000	[diff] [blame]	8	.. index::
				9	single: HTML
				10	single: XHTML
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	11
Raymond Hettinger	a199368	2011-01-27 01:20:32 +0000	[diff] [blame]	12	Source code: :source:`Lib/html/parser.py`
				13
				14	--------------
				15
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	16	This module defines a class :class:`HTMLParser` which serves as the basis for
				17	parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	18
Ezio Melotti	3861d8b	2012-06-23 15:27:51 +0200	[diff] [blame]	19	.. class:: HTMLParser(strict=False)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	20
Ezio Melotti	3861d8b	2012-06-23 15:27:51 +0200	[diff] [blame]	21	Create a parser instance. If strict is ``False`` (the default), the parser
				22	will accept and parse invalid markup. If strict is ``True`` the parser
				23	will raise an :exc:`~html.parser.HTMLParseError` exception instead [#]_ when
				24	it's not able to parse the markup.
				25	The use of ``strict=True`` is discouraged and the strict argument is
				26	deprecated.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	27
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	28	An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
				29	when start tags, end tags, text, comments, and other markup elements are
				30	encountered. The user should subclass :class:`.HTMLParser` and override its
				31	methods to implement the desired behavior.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	32
Georg Brandl	877b10a	2008-06-01 21:25:55 +0000	[diff] [blame]	33	This parser does not check that end tags match start tags or call the end-tag
				34	handler for elements which are closed implicitly by closing an outer element.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	35
Georg Brandl	61063cc	2012-06-24 22:48:30 +0200	[diff] [blame]	36	.. versionchanged:: 3.2
				37	strict keyword added.
R. David Murray	bb7b753	2010-12-03 04:26:18 +0000	[diff] [blame]	38
Ezio Melotti	3861d8b	2012-06-23 15:27:51 +0200	[diff] [blame]	39	.. deprecated-removed:: 3.3 3.5
				40	The strict argument and the strict mode have been deprecated.
				41	The parser is now able to accept and parse invalid markup too.
				42
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	43	An exception is defined as well:
				44
				45
				46	.. exception:: HTMLParseError
				47
				48	Exception raised by the :class:`HTMLParser` class when it encounters an error
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	49	while parsing and strict is ``True``. This exception provides three
				50	attributes: :attr:`msg` is a brief message explaining the error,
				51	:attr:`lineno` is the number of the line on which the broken construct was
				52	detected, and :attr:`offset` is the number of characters into the line at
				53	which the construct starts.
				54
Ezio Melotti	3861d8b	2012-06-23 15:27:51 +0200	[diff] [blame]	55	.. deprecated-removed:: 3.3 3.5
				56	This exception has been deprecated because it's never raised by the parser
				57	(when the default non-strict mode is used).
				58
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	59
				60	Example HTML Parser Application
				61	-------------------------------
				62
				63	As a basic example, below is a simple HTML parser that uses the
				64	:class:`HTMLParser` class to print out start tags, end tags, and data
				65	as they are encountered::
				66
				67	from html.parser import HTMLParser
				68
				69	class MyHTMLParser(HTMLParser):
				70	def handle_starttag(self, tag, attrs):
				71	print("Encountered a start tag:", tag)
				72	def handle_endtag(self, tag):
				73	print("Encountered an end tag :", tag)
				74	def handle_data(self, data):
				75	print("Encountered some data :", data)
				76
Ezio Melotti	88ebfb1	2013-11-02 17:08:24 +0200	[diff] [blame^]	77	parser = MyHTMLParser()
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	78	parser.feed('<html><head><title>Test</title></head>'
				79	'<body><h1>Parse me!</h1></body></html>')
				80
				81	The output will then be::
				82
				83	Encountered a start tag: html
				84	Encountered a start tag: head
				85	Encountered a start tag: title
				86	Encountered some data : Test
				87	Encountered an end tag : title
				88	Encountered an end tag : head
				89	Encountered a start tag: body
				90	Encountered a start tag: h1
				91	Encountered some data : Parse me!
				92	Encountered an end tag : h1
				93	Encountered an end tag : body
				94	Encountered an end tag : html
				95
				96
				97	:class:`.HTMLParser` Methods
				98	----------------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	99
				100	:class:`HTMLParser` instances have the following methods:
				101
				102
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	103	.. method:: HTMLParser.feed(data)
				104
				105	Feed some text to the parser. It is processed insofar as it consists of
				106	complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	107	:meth:`close` is called. data must be :class:`str`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	108
				109
				110	.. method:: HTMLParser.close()
				111
				112	Force processing of all buffered data as if it were followed by an end-of-file
				113	mark. This method may be redefined by a derived class to define additional
				114	processing at the end of the input, but the redefined version should always call
				115	the :class:`HTMLParser` base class method :meth:`close`.
				116
				117
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	118	.. method:: HTMLParser.reset()
				119
				120	Reset the instance. Loses all unprocessed data. This is called implicitly at
				121	instantiation time.
				122
				123
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	124	.. method:: HTMLParser.getpos()
				125
				126	Return current line number and offset.
				127
				128
				129	.. method:: HTMLParser.get_starttag_text()
				130
				131	Return the text of the most recently opened start tag. This should not normally
				132	be needed for structured processing, but may be useful in dealing with HTML "as
				133	deployed" or for re-generating input with minimal changes (whitespace between
				134	attributes can be preserved, etc.).
				135
				136
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	137	The following methods are called when data or markup elements are encountered
				138	and they are meant to be overridden in a subclass. The base class
				139	implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
				140
				141
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	142	.. method:: HTMLParser.handle_starttag(tag, attrs)
				143
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	144	This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	145
				146	The tag argument is the name of the tag converted to lower case. The attrs
				147	argument is a list of ``(name, value)`` pairs containing the attributes found
				148	inside the tag's ``<>`` brackets. The name will be translated to lower case,
				149	and quotes in the value have been removed, and character and entity references
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	150	have been replaced.
				151
				152	For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method
				153	would be called as ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	154
Georg Brandl	9087b7f	2008-05-18 07:53:01 +0000	[diff] [blame]	155	All entity references from :mod:`html.entities` are replaced in the attribute
				156	values.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	157
				158
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	159	.. method:: HTMLParser.handle_endtag(tag)
				160
				161	This method is called to handle the end tag of an element (e.g. ``</div>``).
				162
				163	The tag argument is the name of the tag converted to lower case.
				164
				165
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	166	.. method:: HTMLParser.handle_startendtag(tag, attrs)
				167
				168	Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	169	XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	170	subclasses which require this particular lexical information; the default
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	171	implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	172
				173
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	174	.. method:: HTMLParser.handle_data(data)
				175
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	176	This method is called to process arbitrary data (e.g. text nodes and the
				177	content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	178
				179
				180	.. method:: HTMLParser.handle_entityref(name)
				181
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	182	This method is called to process a named character reference of the form
				183	``&name;`` (e.g. ``>``), where name is a general entity reference
				184	(e.g. ``'gt'``).
				185
				186
				187	.. method:: HTMLParser.handle_charref(name)
				188
				189	This method is called to process decimal and hexadecimal numeric character
				190	references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
				191	equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
				192	in this case the method will receive ``'62'`` or ``'x3E'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	193
				194
				195	.. method:: HTMLParser.handle_comment(data)
				196
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	197	This method is called when a comment is encountered (e.g. ``<!--comment-->``).
				198
				199	For example, the comment ``<!-- comment -->`` will cause this method to be
				200	called with the argument ``' comment '``.
				201
				202	The content of Internet Explorer conditional comments (condcoms) will also be
				203	sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
				204	this method will receive ``'[if IE 9]>IE-specific content<![endif]'``.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	205
				206
				207	.. method:: HTMLParser.handle_decl(decl)
				208
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	209	This method is called to handle an HTML doctype declaration (e.g.
				210	``<!DOCTYPE html>``).
				211
Georg Brandl	46aa5c5	2010-07-29 13:38:37 +0000	[diff] [blame]	212	The decl parameter will be the entire contents of the declaration inside
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	213	the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	214
				215
				216	.. method:: HTMLParser.handle_pi(data)
				217
				218	Method called when a processing instruction is encountered. The data
				219	parameter will contain the entire processing instruction. For example, for the
				220	processing instruction ``<?proc color='red'>``, this method would be called as
				221	``handle_pi("proc color='red'")``. It is intended to be overridden by a derived
				222	class; the base class implementation does nothing.
				223
				224	.. note::
				225
				226	The :class:`HTMLParser` class uses the SGML syntactic rules for processing
				227	instructions. An XHTML processing instruction using the trailing ``'?'`` will
				228	cause the ``'?'`` to be included in data.
				229
				230
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	231	.. method:: HTMLParser.unknown_decl(data)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	232
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	233	This method is called when an unrecognized declaration is read by the parser.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	234
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	235	The data parameter will be the entire contents of the declaration inside
				236	the ``<![...]>`` markup. It is sometimes useful to be overridden by a
				237	derived class. The base class implementation raises an :exc:`HTMLParseError`
				238	when strict is ``True``.
				239
				240
				241	.. _htmlparser-examples:
				242
				243	Examples
				244	--------
				245
				246	The following class implements a parser that will be used to illustrate more
				247	examples::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	248
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	249	from html.parser import HTMLParser
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	250	from html.entities import name2codepoint
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	251
				252	class MyHTMLParser(HTMLParser):
				253	def handle_starttag(self, tag, attrs):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	254	print("Start tag:", tag)
				255	for attr in attrs:
				256	print(" attr:", attr)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	257	def handle_endtag(self, tag):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	258	print("End tag :", tag)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	259	def handle_data(self, data):
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	260	print("Data :", data)
				261	def handle_comment(self, data):
				262	print("Comment :", data)
				263	def handle_entityref(self, name):
				264	c = chr(name2codepoint[name])
				265	print("Named ent:", c)
				266	def handle_charref(self, name):
				267	if name.startswith('x'):
				268	c = chr(int(name[1:], 16))
				269	else:
				270	c = chr(int(name))
				271	print("Num ent :", c)
				272	def handle_decl(self, data):
				273	print("Decl :", data)
Ezio Melotti	f99e4b5	2011-10-28 14:34:56 +0300	[diff] [blame]	274
Ezio Melotti	88ebfb1	2013-11-02 17:08:24 +0200	[diff] [blame^]	275	parser = MyHTMLParser()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	276
Ezio Melotti	4279bc7	2012-02-18 02:01:36 +0200	[diff] [blame]	277	Parsing a doctype::
				278
				279	>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
				280	... '"http://www.w3.org/TR/html4/strict.dtd">')
				281	Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
				282
				283	Parsing an element with a few attributes and a title::
				284
				285	>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
				286	Start tag: img
				287	attr: ('src', 'python-logo.png')
				288	attr: ('alt', 'The Python logo')
				289	>>>
				290	>>> parser.feed('<h1>Python</h1>')
				291	Start tag: h1
				292	Data : Python
				293	End tag : h1
				294
				295	The content of ``script`` and ``style`` elements is returned as is, without
				296	further parsing::
				297
				298	>>> parser.feed('<style type="text/css">#python { color: green }</style>')
				299	Start tag: style
				300	attr: ('type', 'text/css')
				301	Data : #python { color: green }
				302	End tag : style
				303	>>>
				304	>>> parser.feed('<script type="text/javascript">'
				305	... 'alert("<strong>hello!</strong>");</script>')
				306	Start tag: script
				307	attr: ('type', 'text/javascript')
				308	Data : alert("<strong>hello!</strong>");
				309	End tag : script
				310
				311	Parsing comments::
				312
				313	>>> parser.feed('<!-- a comment -->'
				314	... '<!--[if IE 9]>IE-specific content<![endif]-->')
				315	Comment : a comment
				316	Comment : [if IE 9]>IE-specific content<![endif]
				317
				318	Parsing named and numeric character references and converting them to the
				319	correct char (note: these 3 references are all equivalent to ``'>'``)::
				320
				321	>>> parser.feed('>>>')
				322	Named ent: >
				323	Num ent : >
				324	Num ent : >
				325
				326	Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
				327	:meth:`~HTMLParser.handle_data` might be called more than once::
				328
				329	>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
				330	... parser.feed(chunk)
				331	...
				332	Start tag: span
				333	Data : buff
				334	Data : ered
				335	Data : text
				336	End tag : span
				337
				338	Parsing invalid HTML (e.g. unquoted attributes) also works::
				339
				340	>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
				341	Start tag: p
				342	Start tag: a
				343	attr: ('class', 'link')
				344	attr: ('href', '#main')
				345	Data : tag soup
				346	End tag : p
				347	End tag : a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	348
R. David Murray	b579dba	2010-12-03 04:06:39 +0000	[diff] [blame]	349	.. rubric:: Footnotes
				350
R. David Murray	bb7b753	2010-12-03 04:26:18 +0000	[diff] [blame]	351	.. [#] For backward compatibility reasons strict mode does not raise
				352	exceptions for all non-compliant HTML. That is, some invalid HTML
R. David Murray	b579dba	2010-12-03 04:06:39 +0000	[diff] [blame]	353	is tolerated even in strict mode.