|  | 
 | :mod:`HTMLParser` --- Simple HTML and XHTML parser | 
 | ================================================== | 
 |  | 
 | .. module:: HTMLParser | 
 |    :synopsis: A simple parser that can handle HTML and XHTML. | 
 |  | 
 |  | 
 | .. versionadded:: 2.2 | 
 |  | 
 | .. index:: | 
 |    single: HTML | 
 |    single: XHTML | 
 |  | 
 | This module defines a class :class:`HTMLParser` which serves as the basis for | 
 | parsing text files formatted in HTML (HyperText Mark-up Language) and XHTML. | 
 | Unlike the parser in :mod:`htmllib`, this parser is not based on the SGML parser | 
 | in :mod:`sgmllib`. | 
 |  | 
 |  | 
 | .. class:: HTMLParser() | 
 |  | 
 |    The :class:`HTMLParser` class is instantiated without arguments. | 
 |  | 
 |    An HTMLParser instance is fed HTML data and calls handler functions when tags | 
 |    begin and end.  The :class:`HTMLParser` class is meant to be overridden by the | 
 |    user to provide a desired behavior. | 
 |  | 
 |    Unlike the parser in :mod:`htmllib`, this parser does not check that end tags | 
 |    match start tags or call the end-tag handler for elements which are closed | 
 |    implicitly by closing an outer element. | 
 |  | 
 | An exception is defined as well: | 
 |  | 
 |  | 
 | .. exception:: HTMLParseError | 
 |  | 
 |    Exception raised by the :class:`HTMLParser` class when it encounters an error | 
 |    while parsing.  This exception provides three attributes: :attr:`msg` is a brief | 
 |    message explaining the error, :attr:`lineno` is the number of the line on which | 
 |    the broken construct was detected, and :attr:`offset` is the number of | 
 |    characters into the line at which the construct starts. | 
 |  | 
 | :class:`HTMLParser` instances have the following methods: | 
 |  | 
 |  | 
 | .. method:: HTMLParser.reset() | 
 |  | 
 |    Reset the instance.  Loses all unprocessed data.  This is called implicitly at | 
 |    instantiation time. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.feed(data) | 
 |  | 
 |    Feed some text to the parser.  It is processed insofar as it consists of | 
 |    complete elements; incomplete data is buffered until more data is fed or | 
 |    :meth:`close` is called. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.close() | 
 |  | 
 |    Force processing of all buffered data as if it were followed by an end-of-file | 
 |    mark.  This method may be redefined by a derived class to define additional | 
 |    processing at the end of the input, but the redefined version should always call | 
 |    the :class:`HTMLParser` base class method :meth:`close`. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.getpos() | 
 |  | 
 |    Return current line number and offset. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.get_starttag_text() | 
 |  | 
 |    Return the text of the most recently opened start tag.  This should not normally | 
 |    be needed for structured processing, but may be useful in dealing with HTML "as | 
 |    deployed" or for re-generating input with minimal changes (whitespace between | 
 |    attributes can be preserved, etc.). | 
 |  | 
 |  | 
 | .. method:: HTMLParser.handle_starttag(tag, attrs) | 
 |  | 
 |    This method is called to handle the start of a tag.  It is intended to be | 
 |    overridden by a derived class; the base class implementation does nothing. | 
 |  | 
 |    The *tag* argument is the name of the tag converted to lower case. The *attrs* | 
 |    argument is a list of ``(name, value)`` pairs containing the attributes found | 
 |    inside the tag's ``<>`` brackets.  The *name* will be translated to lower case, | 
 |    and quotes in the *value* have been removed, and character and entity references | 
 |    have been replaced.  For instance, for the tag ``<A | 
 |    HREF="http://www.cwi.nl/">``, this method would be called as | 
 |    ``handle_starttag('a', [('href', 'http://www.cwi.nl/')])``. | 
 |  | 
 |    .. versionchanged:: 2.6 | 
 |       All entity references from htmlentitydefs are now replaced in the attribute | 
 |       values. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.handle_startendtag(tag, attrs) | 
 |  | 
 |    Similar to :meth:`handle_starttag`, but called when the parser encounters an | 
 |    XHTML-style empty tag (``<a .../>``).  This method may be overridden by | 
 |    subclasses which require this particular lexical information; the default | 
 |    implementation simple calls :meth:`handle_starttag` and :meth:`handle_endtag`. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.handle_endtag(tag) | 
 |  | 
 |    This method is called to handle the end tag of an element.  It is intended to be | 
 |    overridden by a derived class; the base class implementation does nothing.  The | 
 |    *tag* argument is the name of the tag converted to lower case. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.handle_data(data) | 
 |  | 
 |    This method is called to process arbitrary data.  It is intended to be | 
 |    overridden by a derived class; the base class implementation does nothing. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.handle_charref(name) | 
 |  | 
 |    This method is called to process a character reference of the form ``&#ref;``. | 
 |    It is intended to be overridden by a derived class; the base class | 
 |    implementation does nothing. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.handle_entityref(name) | 
 |  | 
 |    This method is called to process a general entity reference of the form | 
 |    ``&name;`` where *name* is an general entity reference.  It is intended to be | 
 |    overridden by a derived class; the base class implementation does nothing. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.handle_comment(data) | 
 |  | 
 |    This method is called when a comment is encountered.  The *comment* argument is | 
 |    a string containing the text between the ``--`` and ``--`` delimiters, but not | 
 |    the delimiters themselves.  For example, the comment ``<!--text-->`` will cause | 
 |    this method to be called with the argument ``'text'``.  It is intended to be | 
 |    overridden by a derived class; the base class implementation does nothing. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.handle_decl(decl) | 
 |  | 
 |    Method called when an SGML declaration is read by the parser.  The *decl* | 
 |    parameter will be the entire contents of the declaration inside the ``<!``...\ | 
 |    ``>`` markup.  It is intended to be overridden by a derived class; the base | 
 |    class implementation does nothing. | 
 |  | 
 |  | 
 | .. method:: HTMLParser.handle_pi(data) | 
 |  | 
 |    Method called when a processing instruction is encountered.  The *data* | 
 |    parameter will contain the entire processing instruction. For example, for the | 
 |    processing instruction ``<?proc color='red'>``, this method would be called as | 
 |    ``handle_pi("proc color='red'")``.  It is intended to be overridden by a derived | 
 |    class; the base class implementation does nothing. | 
 |  | 
 |    .. note:: | 
 |  | 
 |       The :class:`HTMLParser` class uses the SGML syntactic rules for processing | 
 |       instructions.  An XHTML processing instruction using the trailing ``'?'`` will | 
 |       cause the ``'?'`` to be included in *data*. | 
 |  | 
 |  | 
 | .. _htmlparser-example: | 
 |  | 
 | Example HTML Parser Application | 
 | ------------------------------- | 
 |  | 
 | As a basic example, below is a very basic HTML parser that uses the | 
 | :class:`HTMLParser` class to print out tags as they are encountered:: | 
 |  | 
 |    from HTMLParser import HTMLParser | 
 |  | 
 |    class MyHTMLParser(HTMLParser): | 
 |  | 
 |        def handle_starttag(self, tag, attrs): | 
 |            print "Encountered the beginning of a %s tag" % tag | 
 |  | 
 |        def handle_endtag(self, tag): | 
 |            print "Encountered the end of a %s tag" % tag | 
 |  |