Blame - Doc/library/htmlparser.rst - platform/external/python/cpython2

blob: e73ce07745ac7aebd43c8dd3f5abd29c323f71cb [file] [log] [blame]

Fred Drake	d995e11	2008-05-20 06:08:38 +0000	[diff] [blame]	1
				2	:mod:`HTMLParser` --- Simple HTML and XHTML parser
				3	==================================================
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	4
Fred Drake	20b5660	2008-05-17 21:23:02 +0000	[diff] [blame]	5	.. module:: HTMLParser
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	6	:synopsis: A simple parser that can handle HTML and XHTML.
				7
Fred Drake	20b5660	2008-05-17 21:23:02 +0000	[diff] [blame]	8	.. note::
Georg Brandl	3682dfe	2008-05-20 07:21:58 +0000	[diff] [blame]	9
				10	The :mod:`HTMLParser` module has been renamed to :mod:`html.parser` in Python
Ezio Melotti	8703352	2011-10-28 14:20:08 +0300	[diff] [blame]	11	3. The :term:`2to3` tool will automatically adapt imports when converting
				12	your sources to Python 3.
Fred Drake	20b5660	2008-05-17 21:23:02 +0000	[diff] [blame]	13
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	14
				15	.. versionadded:: 2.2
				16
				17	.. index::
				18	single: HTML
				19	single: XHTML
				20
Éric Araujo	29a0b57	2011-08-19 02:14:03 +0200	[diff] [blame]	21	Source code: :source:`Lib/HTMLParser.py`
				22
				23	--------------
				24
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	25	This module defines a class :class:`.HTMLParser` which serves as the basis for
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	26	parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML.
				27	Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser
				28	in :mod:`sgmllib`.
				29
				30
				31	.. class:: HTMLParser()
				32
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	33	An :class:`.HTMLParser` instance is fed HTML data and calls handler methods
				34	when start tags, end tags, text, comments, and other markup elements are
				35	encountered. The user should subclass :class:`.HTMLParser` and override its
				36	methods to implement the desired behavior.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	37
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	38	The :class:`.HTMLParser` class is instantiated without arguments.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	39
				40	Unlike the parser in :mod:`htmllib`, this parser does not check that end tags
				41	match start tags or call the end-tag handler for elements which are closed
				42	implicitly by closing an outer element.
				43
				44	An exception is defined as well:
				45
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	46	.. exception:: HTMLParseError
				47
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	48	:class:`.HTMLParser` is able to handle broken markup, but in some cases it
				49	might raise this exception when it encounters an error while parsing.
				50	This exception provides three attributes: :attr:`msg` is a brief
				51	message explaining the error, :attr:`lineno` is the number of the line on
				52	which the broken construct was detected, and :attr:`offset` is the number of
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	53	characters into the line at which the construct starts.
				54
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	55
				56	Example HTML Parser Application
				57	-------------------------------
				58
				59	As a basic example, below is a simple HTML parser that uses the
				60	:class:`.HTMLParser` class to print out start tags, end tags and data
				61	as they are encountered::
				62
				63	from HTMLParser import HTMLParser
				64
				65	# create a subclass and override the handler methods
				66	class MyHTMLParser(HTMLParser):
				67	def handle_starttag(self, tag, attrs):
				68	print "Encountered a start tag:", tag
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	69
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	70	def handle_endtag(self, tag):
				71	print "Encountered an end tag :", tag
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	72
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	73	def handle_data(self, data):
				74	print "Encountered some data :", data
				75
				76	# instantiate the parser and fed it some HTML
				77	parser = MyHTMLParser()
				78	parser.feed('<html><head><title>Test</title></head>'
				79	'<body><h1>Parse me!</h1></body></html>')
				80
				81	The output will then be::
				82
				83	Encountered a start tag: html
				84	Encountered a start tag: head
				85	Encountered a start tag: title
				86	Encountered some data : Test
				87	Encountered an end tag : title
				88	Encountered an end tag : head
				89	Encountered a start tag: body
				90	Encountered a start tag: h1
				91	Encountered some data : Parse me!
				92	Encountered an end tag : h1
				93	Encountered an end tag : body
				94	Encountered an end tag : html
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	95
				96
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	97	:class:`.HTMLParser` Methods
				98	----------------------------
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	99
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	100	:class:`.HTMLParser` instances have the following methods:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	101
				102
				103	.. method:: HTMLParser.feed(data)
				104
				105	Feed some text to the parser. It is processed insofar as it consists of
				106	complete elements; incomplete data is buffered until more data is fed or
Ezio Melotti	d0ffcd6	2011-12-19 07:15:26 +0200	[diff] [blame]	107	:meth:`close` is called. data can be either :class:`unicode` or
				108	:class:`str`, but passing :class:`unicode` is advised.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	109
				110
				111	.. method:: HTMLParser.close()
				112
				113	Force processing of all buffered data as if it were followed by an end-of-file
				114	mark. This method may be redefined by a derived class to define additional
				115	processing at the end of the input, but the redefined version should always call
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	116	the :class:`.HTMLParser` base class method :meth:`close`.
				117
				118
				119	.. method:: HTMLParser.reset()
				120
				121	Reset the instance. Loses all unprocessed data. This is called implicitly at
				122	instantiation time.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	123
				124
				125	.. method:: HTMLParser.getpos()
				126
				127	Return current line number and offset.
				128
				129
				130	.. method:: HTMLParser.get_starttag_text()
				131
				132	Return the text of the most recently opened start tag. This should not normally
				133	be needed for structured processing, but may be useful in dealing with HTML "as
				134	deployed" or for re-generating input with minimal changes (whitespace between
				135	attributes can be preserved, etc.).
				136
				137
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	138	The following methods are called when data or markup elements are encountered
				139	and they are meant to be overridden in a subclass. The base class
				140	implementations do nothing (except for :meth:`~HTMLParser.handle_startendtag`):
				141
				142
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	143	.. method:: HTMLParser.handle_starttag(tag, attrs)
				144
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	145	This method is called to handle the start of a tag (e.g. ``<div id="main">``).
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	146
				147	The tag argument is the name of the tag converted to lower case. The attrs
				148	argument is a list of ``(name, value)`` pairs containing the attributes found
				149	inside the tag's ``<>`` brackets. The name will be translated to lower case,
				150	and quotes in the value have been removed, and character and entity references
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	151	have been replaced.
				152
Serhiy Storchaka	b4905ef	2016-05-07 10:50:12 +0300	[diff] [blame]	153	For instance, for the tag ``<A HREF="https://www.cwi.nl/">``, this method
				154	would be called as ``handle_starttag('a', [('href', 'https://www.cwi.nl/')])``.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	155
				156	.. versionchanged:: 2.6
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	157	All entity references from :mod:`htmlentitydefs` are now replaced in the
				158	attribute values.
				159
				160
				161	.. method:: HTMLParser.handle_endtag(tag)
				162
				163	This method is called to handle the end tag of an element (e.g. ``</div>``).
				164
				165	The tag argument is the name of the tag converted to lower case.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	166
				167
				168	.. method:: HTMLParser.handle_startendtag(tag, attrs)
				169
				170	Similar to :meth:`handle_starttag`, but called when the parser encounters an
Ezio Melotti	8703352	2011-10-28 14:20:08 +0300	[diff] [blame]	171	XHTML-style empty tag (``<img ... />``). This method may be overridden by
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	172	subclasses which require this particular lexical information; the default
Ezio Melotti	8703352	2011-10-28 14:20:08 +0300	[diff] [blame]	173	implementation simply calls :meth:`handle_starttag` and :meth:`handle_endtag`.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	174
				175
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	176	.. method:: HTMLParser.handle_data(data)
				177
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	178	This method is called to process arbitrary data (e.g. text nodes and the
				179	content of ``<script>...</script>`` and ``<style>...</style>``).
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	180
				181
				182	.. method:: HTMLParser.handle_entityref(name)
				183
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	184	This method is called to process a named character reference of the form
				185	``&name;`` (e.g. ``>``), where name is a general entity reference
				186	(e.g. ``'gt'``).
				187
				188
				189	.. method:: HTMLParser.handle_charref(name)
				190
				191	This method is called to process decimal and hexadecimal numeric character
				192	references of the form ``&#NNN;`` and ``&#xNNN;``. For example, the decimal
				193	equivalent for ``>`` is ``>``, whereas the hexadecimal is ``>``;
				194	in this case the method will receive ``'62'`` or ``'x3E'``.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	195
				196
				197	.. method:: HTMLParser.handle_comment(data)
				198
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	199	This method is called when a comment is encountered (e.g. ``<!--comment-->``).
				200
				201	For example, the comment ``<!-- comment -->`` will cause this method to be
				202	called with the argument ``' comment '``.
				203
				204	The content of Internet Explorer conditional comments (condcoms) will also be
				205	sent to this method, so, for ``<!--[if IE 9]>IE9-specific content<![endif]-->``,
R David Murray	f79aa58	2015-08-24 12:50:50 -0400	[diff] [blame]	206	this method will receive ``'[if IE 9]>IE9-specific content<![endif]'``.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	207
				208
				209	.. method:: HTMLParser.handle_decl(decl)
				210
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	211	This method is called to handle an HTML doctype declaration (e.g.
				212	``<!DOCTYPE html>``).
				213
Georg Brandl	c79d432	2010-08-01 21:10:57 +0000	[diff] [blame]	214	The decl parameter will be the entire contents of the declaration inside
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	215	the ``<!...>`` markup (e.g. ``'DOCTYPE html'``).
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	216
				217
				218	.. method:: HTMLParser.handle_pi(data)
				219
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	220	This method is called when a processing instruction is encountered. The data
				221	parameter will contain the entire processing instruction. For example, for the
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	222	processing instruction ``<?proc color='red'>``, this method would be called as
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	223	``handle_pi("proc color='red'")``.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	224
				225	.. note::
				226
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	227	The :class:`.HTMLParser` class uses the SGML syntactic rules for processing
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	228	instructions. An XHTML processing instruction using the trailing ``'?'`` will
				229	cause the ``'?'`` to be included in data.
				230
				231
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	232	.. method:: HTMLParser.unknown_decl(data)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	233
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	234	This method is called when an unrecognized declaration is read by the parser.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	235
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	236	The data parameter will be the entire contents of the declaration inside
				237	the ``<![...]>`` markup. It is sometimes useful to be overridden by a
				238	derived class.
				239
				240
				241	.. _htmlparser-examples:
				242
				243	Examples
				244	--------
				245
				246	The following class implements a parser that will be used to illustrate more
				247	examples::
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	248
Fred Drake	d995e11	2008-05-20 06:08:38 +0000	[diff] [blame]	249	from HTMLParser import HTMLParser
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	250	from htmlentitydefs import name2codepoint
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	251
				252	class MyHTMLParser(HTMLParser):
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	253	def handle_starttag(self, tag, attrs):
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	254	print "Start tag:", tag
				255	for attr in attrs:
				256	print " attr:", attr
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	257
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	258	def handle_endtag(self, tag):
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	259	print "End tag :", tag
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	260
Ezio Melotti	f9cc80d	2011-10-28 14:14:34 +0300	[diff] [blame]	261	def handle_data(self, data):
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	262	print "Data :", data
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	263
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	264	def handle_comment(self, data):
				265	print "Comment :", data
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	266
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	267	def handle_entityref(self, name):
				268	c = unichr(name2codepoint[name])
				269	print "Named ent:", c
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	270
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	271	def handle_charref(self, name):
				272	if name.startswith('x'):
				273	c = unichr(int(name[1:], 16))
				274	else:
				275	c = unichr(int(name))
				276	print "Num ent :", c
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	277
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	278	def handle_decl(self, data):
				279	print "Decl :", data
Ezio Melotti	f9cc80d	2011-10-28 14:14:34 +0300	[diff] [blame]	280
				281	parser = MyHTMLParser()
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	282
				283	Parsing a doctype::
				284
				285	>>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
				286	... '"http://www.w3.org/TR/html4/strict.dtd">')
				287	Decl : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
				288
				289	Parsing an element with a few attributes and a title::
				290
				291	>>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
				292	Start tag: img
				293	attr: ('src', 'python-logo.png')
				294	attr: ('alt', 'The Python logo')
				295	>>>
				296	>>> parser.feed('<h1>Python</h1>')
				297	Start tag: h1
				298	Data : Python
				299	End tag : h1
				300
				301	The content of ``script`` and ``style`` elements is returned as is, without
				302	further parsing::
				303
				304	>>> parser.feed('<style type="text/css">#python { color: green }</style>')
				305	Start tag: style
				306	attr: ('type', 'text/css')
				307	Data : #python { color: green }
				308	End tag : style
Serhiy Storchaka	12d547a	2016-05-10 13:45:32 +0300	[diff] [blame]	309
Ezio Melotti	c39b552	2012-02-18 01:46:04 +0200	[diff] [blame]	310	>>> parser.feed('<script type="text/javascript">'
				311	... 'alert("<strong>hello!</strong>");</script>')
				312	Start tag: script
				313	attr: ('type', 'text/javascript')
				314	Data : alert("<strong>hello!</strong>");
				315	End tag : script
				316
				317	Parsing comments::
				318
				319	>>> parser.feed('<!-- a comment -->'
				320	... '<!--[if IE 9]>IE-specific content<![endif]-->')
				321	Comment : a comment
				322	Comment : [if IE 9]>IE-specific content<![endif]
				323
				324	Parsing named and numeric character references and converting them to the
				325	correct char (note: these 3 references are all equivalent to ``'>'``)::
				326
				327	>>> parser.feed('>>>')
				328	Named ent: >
				329	Num ent : >
				330	Num ent : >
				331
				332	Feeding incomplete chunks to :meth:`~HTMLParser.feed` works, but
				333	:meth:`~HTMLParser.handle_data` might be called more than once::
				334
				335	>>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
				336	... parser.feed(chunk)
				337	...
				338	Start tag: span
				339	Data : buff
				340	Data : ered
				341	Data : text
				342	End tag : span
				343
				344	Parsing invalid HTML (e.g. unquoted attributes) also works::
				345
				346	>>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
				347	Start tag: p
				348	Start tag: a
				349	attr: ('class', 'link')
				350	attr: ('href', '#main')
				351	Data : tag soup
				352	End tag : p
				353	End tag : a