| \section{\module{htmllib} --- |
| A parser for HTML documents} |
| |
| \declaremodule{standard}{htmllib} |
| \modulesynopsis{A parser for HTML documents.} |
| |
| \index{HTML} |
| \index{hypertext} |
| |
| |
| This module defines a class which can serve as a base for parsing text |
| files formatted in the HyperText Mark-up Language (HTML). The class |
| is not directly concerned with I/O --- it must be provided with input |
| in string form via a method, and makes calls to methods of a |
| ``formatter'' object in order to produce output. The |
| \class{HTMLParser} class is designed to be used as a base class for |
| other classes in order to add functionality, and allows most of its |
| methods to be extended or overridden. In turn, this class is derived |
| from and extends the \class{SGMLParser} class defined in module |
| \refmodule{sgmllib}\refstmodindex{sgmllib}. The \class{HTMLParser} |
| implementation supports the HTML 2.0 language as described in |
| \rfc{1866}. Two implementations of formatter objects are provided in |
| the \refmodule{formatter}\refstmodindex{formatter} module; refer to the |
| documentation for that module for information on the formatter |
| interface. |
| \withsubitem{(in module sgmllib)}{\ttindex{SGMLParser}} |
| |
| The following is a summary of the interface defined by |
| \class{sgmllib.SGMLParser}: |
| |
| \begin{itemize} |
| |
| \item |
| The interface to feed data to an instance is through the \method{feed()} |
| method, which takes a string argument. This can be called with as |
| little or as much text at a time as desired; \samp{p.feed(a); |
| p.feed(b)} has the same effect as \samp{p.feed(a+b)}. When the data |
| contains complete HTML tags, these are processed immediately; |
| incomplete elements are saved in a buffer. To force processing of all |
| unprocessed data, call the \method{close()} method. |
| |
| For example, to parse the entire contents of a file, use: |
| \begin{verbatim} |
| parser.feed(open('myfile.html').read()) |
| parser.close() |
| \end{verbatim} |
| |
| \item |
| The interface to define semantics for HTML tags is very simple: derive |
| a class and define methods called \method{start_\var{tag}()}, |
| \method{end_\var{tag}()}, or \method{do_\var{tag}()}. The parser will |
| call these at appropriate moments: \method{start_\var{tag}} or |
| \method{do_\var{tag}()} is called when an opening tag of the form |
| \code{<\var{tag} ...>} is encountered; \method{end_\var{tag}()} is called |
| when a closing tag of the form \code{<\var{tag}>} is encountered. If |
| an opening tag requires a corresponding closing tag, like \code{<H1>} |
| ... \code{</H1>}, the class should define the \method{start_\var{tag}()} |
| method; if a tag requires no closing tag, like \code{<P>}, the class |
| should define the \method{do_\var{tag}()} method. |
| |
| \end{itemize} |
| |
| The module defines a single class: |
| |
| \begin{classdesc}{HTMLParser}{formatter} |
| This is the basic HTML parser class. It supports all entity names |
| required by the HTML 2.0 specification (\rfc{1866}). It also defines |
| handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. |
| \end{classdesc} |
| |
| |
| \begin{seealso} |
| \seemodule{HTMLParser}{Alternate HTML parser that offers a slightly |
| lower-level view of the input, but is |
| designed to work with XHTML, and does not |
| implement some of the SGML syntax not used in |
| ``HTML as deployed'' and which isn't legal |
| for XHTML.} |
| \seemodule{htmlentitydefs}{Definition of replacement text for HTML |
| 2.0 entities.} |
| \seemodule{sgmllib}{Base class for \class{HTMLParser}.} |
| \end{seealso} |
| |
| |
| \subsection{HTMLParser Objects \label{html-parser-objects}} |
| |
| In addition to tag methods, the \class{HTMLParser} class provides some |
| additional methods and instance variables for use within tag methods. |
| |
| \begin{memberdesc}{formatter} |
| This is the formatter instance associated with the parser. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{nofill} |
| Boolean flag which should be true when whitespace should not be |
| collapsed, or false when it should be. In general, this should only |
| be true when character data is to be treated as ``preformatted'' text, |
| as within a \code{<PRE>} element. The default value is false. This |
| affects the operation of \method{handle_data()} and \method{save_end()}. |
| \end{memberdesc} |
| |
| |
| \begin{methoddesc}{anchor_bgn}{href, name, type} |
| This method is called at the start of an anchor region. The arguments |
| correspond to the attributes of the \code{<A>} tag with the same |
| names. The default implementation maintains a list of hyperlinks |
| (defined by the \code{HREF} attribute for \code{<A>} tags) within the |
| document. The list of hyperlinks is available as the data attribute |
| \member{anchorlist}. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{anchor_end}{} |
| This method is called at the end of an anchor region. The default |
| implementation adds a textual footnote marker using an index into the |
| list of hyperlinks created by \method{anchor_bgn()}. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{handle_image}{source, alt\optional{, ismap\optional{, align\optional{, width\optional{, height}}}}} |
| This method is called to handle images. The default implementation |
| simply passes the \var{alt} value to the \method{handle_data()} |
| method. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{save_bgn}{} |
| Begins saving character data in a buffer instead of sending it to the |
| formatter object. Retrieve the stored data via \method{save_end()}. |
| Use of the \method{save_bgn()} / \method{save_end()} pair may not be |
| nested. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{save_end}{} |
| Ends buffering character data and returns all data saved since the |
| preceding call to \method{save_bgn()}. If the \member{nofill} flag is |
| false, whitespace is collapsed to single spaces. A call to this |
| method without a preceding call to \method{save_bgn()} will raise a |
| \exception{TypeError} exception. |
| \end{methoddesc} |
| |
| |
| |
| \section{\module{htmlentitydefs} --- |
| Definitions of HTML general entities} |
| |
| \declaremodule{standard}{htmlentitydefs} |
| \modulesynopsis{Definitions of HTML general entities.} |
| \sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org} |
| |
| This module defines three dictionaries, \code{name2codepoint}, |
| \code{codepoint2name}, and \code{entitydefs}. \code{entitydefs} is |
| used by the \refmodule{htmllib} module to provide the |
| \member{entitydefs} member of the \class{HTMLParser} class. The |
| definition provided here contains all the entities defined by XHTML 1.0 |
| that can be handled using simple textual substitution in the Latin-1 |
| character set (ISO-8859-1). |
| |
| |
| \begin{datadesc}{entitydefs} |
| A dictionary mapping XHTML 1.0 entity definitions to their |
| replacement text in ISO Latin-1. |
| |
| \end{datadesc} |
| |
| \begin{datadesc}{name2codepoint} |
| A dictionary that maps HTML entity names to the Unicode codepoints. |
| \versionadded{2.3} |
| \end{datadesc} |
| |
| \begin{datadesc}{codepoint2name} |
| A dictionary that maps Unicode codepoints to HTML entity names. |
| \versionadded{2.3} |
| \end{datadesc} |