blob: aaa2072d388df053a659687f9495785a5b7b90cf [file] [log] [blame]
Guido van Rossum470be141995-03-17 16:07:09 +00001\section{Standard Module \sectcode{htmllib}}
Guido van Rossume47da0a1997-07-17 16:34:52 +00002\label{module-htmllib}
Guido van Rossuma12ef941995-02-27 17:53:25 +00003\stmodindex{htmllib}
Guido van Rossum86751151995-02-28 17:14:32 +00004\index{HTML}
5\index{hypertext}
6
7\renewcommand{\indexsubitem}{(in module htmllib)}
8
Fred Drake58d7f691996-10-08 21:52:23 +00009This module defines a class which can serve as a base for parsing text
10files formatted in the HyperText Mark-up Language (HTML). The class
11is not directly concerned with I/O --- it must be provided with input
12in string form via a method, and makes calls to methods of a
13``formatter'' object in order to produce output. The
14\code{HTMLParser} class is designed to be used as a base class for
15other classes in order to add functionality, and allows most of its
16methods to be extended or overridden. In turn, this class is derived
17from and extends the \code{SGMLParser} class defined in module
18\code{sgmllib}. Two implementations of formatter objects are
19provided in the \code{formatter} module; refer to the documentation
20for that module for information on the formatter interface.
Guido van Rossum86751151995-02-28 17:14:32 +000021\index{SGML}
22\stmodindex{sgmllib}
23\ttindex{SGMLParser}
24\index{formatter}
Fred Drake58d7f691996-10-08 21:52:23 +000025\stmodindex{formatter}
Guido van Rossum86751151995-02-28 17:14:32 +000026
27The following is a summary of the interface defined by
28\code{sgmllib.SGMLParser}:
29
30\begin{itemize}
31
32\item
33The interface to feed data to an instance is through the \code{feed()}
34method, which takes a string argument. This can be called with as
Fred Drake58d7f691996-10-08 21:52:23 +000035little or as much text at a time as desired; \code{p.feed(a);
36p.feed(b)} has the same effect as \code{p.feed(a+b)}. When the data
37contains complete HTML tags, these are processed immediately;
38incomplete elements are saved in a buffer. To force processing of all
39unprocessed data, call the \code{close()} method.
Guido van Rossum86751151995-02-28 17:14:32 +000040
Fred Drake58d7f691996-10-08 21:52:23 +000041For example, to parse the entire contents of a file, use:
Guido van Rossume47da0a1997-07-17 16:34:52 +000042\bcode\begin{verbatim}
Fred Drake58d7f691996-10-08 21:52:23 +000043parser.feed(open('myfile.html').read())
44parser.close()
Guido van Rossume47da0a1997-07-17 16:34:52 +000045\end{verbatim}\ecode
46%
Guido van Rossum86751151995-02-28 17:14:32 +000047\item
48The interface to define semantics for HTML tags is very simple: derive
49a class and define methods called \code{start_\var{tag}()},
50\code{end_\var{tag}()}, or \code{do_\var{tag}()}. The parser will
51call these at appropriate moments: \code{start_\var{tag}} or
52\code{do_\var{tag}} is called when an opening tag of the form
53\code{<\var{tag} ...>} is encountered; \code{end_\var{tag}} is called
54when a closing tag of the form \code{<\var{tag}>} is encountered. If
55an opening tag requires a corresponding closing tag, like \code{<H1>}
56... \code{</H1>}, the class should define the \code{start_\var{tag}}
57method; if a tag requires no closing tag, like \code{<P>}, the class
58should define the \code{do_\var{tag}} method.
59
60\end{itemize}
61
Fred Drake58d7f691996-10-08 21:52:23 +000062The module defines a single class:
Guido van Rossum86751151995-02-28 17:14:32 +000063
Fred Drake58d7f691996-10-08 21:52:23 +000064\begin{funcdesc}{HTMLParser}{formatter}
65This is the basic HTML parser class. It supports all entity names
66required by the HTML 2.0 specification (RFC 1866). It also defines
67handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
Guido van Rossum86751151995-02-28 17:14:32 +000068\end{funcdesc}
69
Fred Drake58d7f691996-10-08 21:52:23 +000070In addition to tag methods, the \code{HTMLParser} class provides some
71additional methods and instance variables for use within tag methods.
72
Fred Drake8f925951996-10-09 16:13:22 +000073\renewcommand{\indexsubitem}{({\tt HTMLParser} method)}
74
Fred Drake58d7f691996-10-08 21:52:23 +000075\begin{datadesc}{formatter}
76This is the formatter instance associated with the parser.
77\end{datadesc}
78
79\begin{datadesc}{nofill}
80Boolean flag which should be true when whitespace should not be
81collapsed, or false when it should be. In general, this should only
82be true when character data is to be treated as ``preformatted'' text,
83as within a \code{<PRE>} element. The default value is false. This
84affects the operation of \code{handle_data()} and \code{save_end()}.
85\end{datadesc}
86
87\begin{funcdesc}{anchor_bgn}{href\, name\, type}
88This method is called at the start of an anchor region. The arguments
89correspond to the attributes of the \code{<A>} tag with the same
90names. The default implementation maintains a list of hyperlinks
91(defined by the \code{href} argument) within the document. The list
92of hyperlinks is available as the data attribute \code{anchorlist}.
Guido van Rossum86751151995-02-28 17:14:32 +000093\end{funcdesc}
94
Fred Drake58d7f691996-10-08 21:52:23 +000095\begin{funcdesc}{anchor_end}{}
96This method is called at the end of an anchor region. The default
97implementation adds a textual footnote marker using an index into the
98list of hyperlinks created by \code{anchor_bgn()}.
Guido van Rossum86751151995-02-28 17:14:32 +000099\end{funcdesc}
100
Fred Drake58d7f691996-10-08 21:52:23 +0000101\begin{funcdesc}{handle_image}{source\, alt\optional{\, ismap\optional{\, align\optional{\, width\optional{\, height}}}}}
102This method is called to handle images. The default implementation
103simply passes the \code{alt} value to the \code{handle_data()}
104method.
Guido van Rossum86751151995-02-28 17:14:32 +0000105\end{funcdesc}
106
Fred Drake58d7f691996-10-08 21:52:23 +0000107\begin{funcdesc}{save_bgn}{}
108Begins saving character data in a buffer instead of sending it to the
109formatter object. Retrieve the stored data via \code{save_end()}
110Use of the \code{save_bgn()} / \code{save_end()} pair may not be
111nested.
Guido van Rossum86751151995-02-28 17:14:32 +0000112\end{funcdesc}
113
Fred Drake58d7f691996-10-08 21:52:23 +0000114\begin{funcdesc}{save_end}{}
115Ends buffering character data and returns all data saved since the
116preceeding call to \code{save_bgn()}. If \code{nofill} flag is false,
117whitespace is collapsed to single spaces. A call to this method
118without a preceeding call to \code{save_bgn()} will raise a
119\code{TypeError} exception.
Guido van Rossum86751151995-02-28 17:14:32 +0000120\end{funcdesc}