| \section{Standard Module \sectcode{htmllib}} | 
 | \label{module-htmllib} | 
 | \stmodindex{htmllib} | 
 | \index{HTML} | 
 | \index{hypertext} | 
 |  | 
 | \renewcommand{\indexsubitem}{(in module htmllib)} | 
 |  | 
 | This module defines a class which can serve as a base for parsing text | 
 | files formatted in the HyperText Mark-up Language (HTML).  The class | 
 | is not directly concerned with I/O --- it must be provided with input | 
 | in string form via a method, and makes calls to methods of a | 
 | ``formatter'' object in order to produce output.  The | 
 | \code{HTMLParser} class is designed to be used as a base class for | 
 | other classes in order to add functionality, and allows most of its | 
 | methods to be extended or overridden.  In turn, this class is derived | 
 | from and extends the \code{SGMLParser} class defined in module | 
 | \code{sgmllib}.  Two implementations of formatter objects are | 
 | provided in the \code{formatter} module; refer to the documentation | 
 | for that module for information on the formatter interface. | 
 | \index{SGML} | 
 | \stmodindex{sgmllib} | 
 | \ttindex{SGMLParser} | 
 | \index{formatter} | 
 | \stmodindex{formatter} | 
 |  | 
 | The following is a summary of the interface defined by | 
 | \code{sgmllib.SGMLParser}: | 
 |  | 
 | \begin{itemize} | 
 |  | 
 | \item | 
 | The interface to feed data to an instance is through the \code{feed()} | 
 | method, which takes a string argument.  This can be called with as | 
 | little or as much text at a time as desired; \code{p.feed(a); | 
 | p.feed(b)} has the same effect as \code{p.feed(a+b)}.  When the data | 
 | contains complete HTML tags, these are processed immediately; | 
 | incomplete elements are saved in a buffer.  To force processing of all | 
 | unprocessed data, call the \code{close()} method. | 
 |  | 
 | For example, to parse the entire contents of a file, use: | 
 | \bcode\begin{verbatim} | 
 | parser.feed(open('myfile.html').read()) | 
 | parser.close() | 
 | \end{verbatim}\ecode | 
 | % | 
 | \item | 
 | The interface to define semantics for HTML tags is very simple: derive | 
 | a class and define methods called \code{start_\var{tag}()}, | 
 | \code{end_\var{tag}()}, or \code{do_\var{tag}()}.  The parser will | 
 | call these at appropriate moments: \code{start_\var{tag}} or | 
 | \code{do_\var{tag}} is called when an opening tag of the form | 
 | \code{<\var{tag} ...>} is encountered; \code{end_\var{tag}} is called | 
 | when a closing tag of the form \code{<\var{tag}>} is encountered.  If | 
 | an opening tag requires a corresponding closing tag, like \code{<H1>} | 
 | ... \code{</H1>}, the class should define the \code{start_\var{tag}} | 
 | method; if a tag requires no closing tag, like \code{<P>}, the class | 
 | should define the \code{do_\var{tag}} method. | 
 |  | 
 | \end{itemize} | 
 |  | 
 | The module defines a single class: | 
 |  | 
 | \begin{funcdesc}{HTMLParser}{formatter} | 
 | This is the basic HTML parser class.  It supports all entity names | 
 | required by the HTML 2.0 specification (RFC 1866).  It also defines | 
 | handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. | 
 | \end{funcdesc} | 
 |  | 
 | In addition to tag methods, the \code{HTMLParser} class provides some | 
 | additional methods and instance variables for use within tag methods. | 
 |  | 
 | \renewcommand{\indexsubitem}{({\tt HTMLParser} method)} | 
 |  | 
 | \begin{datadesc}{formatter} | 
 | This is the formatter instance associated with the parser. | 
 | \end{datadesc} | 
 |  | 
 | \begin{datadesc}{nofill} | 
 | Boolean flag which should be true when whitespace should not be | 
 | collapsed, or false when it should be.  In general, this should only | 
 | be true when character data is to be treated as ``preformatted'' text, | 
 | as within a \code{<PRE>} element.  The default value is false.  This | 
 | affects the operation of \code{handle_data()} and \code{save_end()}. | 
 | \end{datadesc} | 
 |  | 
 | \begin{funcdesc}{anchor_bgn}{href\, name\, type} | 
 | This method is called at the start of an anchor region.  The arguments | 
 | correspond to the attributes of the \code{<A>} tag with the same | 
 | names.  The default implementation maintains a list of hyperlinks | 
 | (defined by the \code{href} argument) within the document.  The list | 
 | of hyperlinks is available as the data attribute \code{anchorlist}. | 
 | \end{funcdesc} | 
 |  | 
 | \begin{funcdesc}{anchor_end}{} | 
 | This method is called at the end of an anchor region.  The default | 
 | implementation adds a textual footnote marker using an index into the | 
 | list of hyperlinks created by \code{anchor_bgn()}. | 
 | \end{funcdesc} | 
 |  | 
 | \begin{funcdesc}{handle_image}{source\, alt\optional{\, ismap\optional{\, align\optional{\, width\optional{\, height}}}}} | 
 | This method is called to handle images.  The default implementation | 
 | simply passes the \code{alt} value to the \code{handle_data()} | 
 | method. | 
 | \end{funcdesc} | 
 |  | 
 | \begin{funcdesc}{save_bgn}{} | 
 | Begins saving character data in a buffer instead of sending it to the | 
 | formatter object.  Retrieve the stored data via \code{save_end()} | 
 | Use of the \code{save_bgn()} / \code{save_end()} pair may not be | 
 | nested. | 
 | \end{funcdesc} | 
 |  | 
 | \begin{funcdesc}{save_end}{} | 
 | Ends buffering character data and returns all data saved since the | 
 | preceeding call to \code{save_bgn()}.  If \code{nofill} flag is false, | 
 | whitespace is collapsed to single spaces.  A call to this method | 
 | without a preceeding call to \code{save_bgn()} will raise a | 
 | \code{TypeError} exception. | 
 | \end{funcdesc} |