blob: 5fbb34d93b604451fb0ff21f312cc7b9091b1c8c [file] [log] [blame]
Guido van Rossum470be141995-03-17 16:07:09 +00001\section{Standard Module \sectcode{htmllib}}
Guido van Rossume47da0a1997-07-17 16:34:52 +00002\label{module-htmllib}
Guido van Rossuma12ef941995-02-27 17:53:25 +00003\stmodindex{htmllib}
Fred Drakec5891241998-02-09 19:16:20 +00004\rfcindex{1866}
Guido van Rossum86751151995-02-28 17:14:32 +00005\index{HTML}
6\index{hypertext}
7
8\renewcommand{\indexsubitem}{(in module htmllib)}
9
Fred Drake58d7f691996-10-08 21:52:23 +000010This module defines a class which can serve as a base for parsing text
11files formatted in the HyperText Mark-up Language (HTML). The class
12is not directly concerned with I/O --- it must be provided with input
13in string form via a method, and makes calls to methods of a
14``formatter'' object in order to produce output. The
15\code{HTMLParser} class is designed to be used as a base class for
16other classes in order to add functionality, and allows most of its
17methods to be extended or overridden. In turn, this class is derived
18from and extends the \code{SGMLParser} class defined in module
19\code{sgmllib}. Two implementations of formatter objects are
20provided in the \code{formatter} module; refer to the documentation
21for that module for information on the formatter interface.
Guido van Rossum86751151995-02-28 17:14:32 +000022\index{SGML}
Fred Drake54820dc1997-12-15 21:56:05 +000023\refstmodindex{sgmllib}
Guido van Rossum86751151995-02-28 17:14:32 +000024\ttindex{SGMLParser}
25\index{formatter}
Fred Drake54820dc1997-12-15 21:56:05 +000026\refstmodindex{formatter}
Guido van Rossum86751151995-02-28 17:14:32 +000027
28The following is a summary of the interface defined by
29\code{sgmllib.SGMLParser}:
30
31\begin{itemize}
32
33\item
34The interface to feed data to an instance is through the \code{feed()}
35method, which takes a string argument. This can be called with as
Fred Drake58d7f691996-10-08 21:52:23 +000036little or as much text at a time as desired; \code{p.feed(a);
37p.feed(b)} has the same effect as \code{p.feed(a+b)}. When the data
38contains complete HTML tags, these are processed immediately;
39incomplete elements are saved in a buffer. To force processing of all
40unprocessed data, call the \code{close()} method.
Guido van Rossum86751151995-02-28 17:14:32 +000041
Fred Drake58d7f691996-10-08 21:52:23 +000042For example, to parse the entire contents of a file, use:
Guido van Rossume47da0a1997-07-17 16:34:52 +000043\bcode\begin{verbatim}
Fred Drake58d7f691996-10-08 21:52:23 +000044parser.feed(open('myfile.html').read())
45parser.close()
Guido van Rossume47da0a1997-07-17 16:34:52 +000046\end{verbatim}\ecode
47%
Guido van Rossum86751151995-02-28 17:14:32 +000048\item
49The interface to define semantics for HTML tags is very simple: derive
50a class and define methods called \code{start_\var{tag}()},
51\code{end_\var{tag}()}, or \code{do_\var{tag}()}. The parser will
52call these at appropriate moments: \code{start_\var{tag}} or
53\code{do_\var{tag}} is called when an opening tag of the form
54\code{<\var{tag} ...>} is encountered; \code{end_\var{tag}} is called
55when a closing tag of the form \code{<\var{tag}>} is encountered. If
56an opening tag requires a corresponding closing tag, like \code{<H1>}
57... \code{</H1>}, the class should define the \code{start_\var{tag}}
58method; if a tag requires no closing tag, like \code{<P>}, the class
59should define the \code{do_\var{tag}} method.
60
61\end{itemize}
62
Fred Drake58d7f691996-10-08 21:52:23 +000063The module defines a single class:
Guido van Rossum86751151995-02-28 17:14:32 +000064
Fred Drake58d7f691996-10-08 21:52:23 +000065\begin{funcdesc}{HTMLParser}{formatter}
66This is the basic HTML parser class. It supports all entity names
Fred Drakec5891241998-02-09 19:16:20 +000067required by the HTML 2.0 specification (\rfc{1866}). It also defines
Fred Drake58d7f691996-10-08 21:52:23 +000068handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
Guido van Rossum86751151995-02-28 17:14:32 +000069\end{funcdesc}
70
Fred Drake58d7f691996-10-08 21:52:23 +000071In addition to tag methods, the \code{HTMLParser} class provides some
72additional methods and instance variables for use within tag methods.
73
Fred Drake095f35a1997-12-12 05:34:35 +000074\renewcommand{\indexsubitem}{(HTMLParser method)}
Fred Drake8f925951996-10-09 16:13:22 +000075
Fred Drake58d7f691996-10-08 21:52:23 +000076\begin{datadesc}{formatter}
77This is the formatter instance associated with the parser.
78\end{datadesc}
79
80\begin{datadesc}{nofill}
81Boolean flag which should be true when whitespace should not be
82collapsed, or false when it should be. In general, this should only
83be true when character data is to be treated as ``preformatted'' text,
84as within a \code{<PRE>} element. The default value is false. This
85affects the operation of \code{handle_data()} and \code{save_end()}.
86\end{datadesc}
87
88\begin{funcdesc}{anchor_bgn}{href\, name\, type}
89This method is called at the start of an anchor region. The arguments
90correspond to the attributes of the \code{<A>} tag with the same
91names. The default implementation maintains a list of hyperlinks
92(defined by the \code{href} argument) within the document. The list
93of hyperlinks is available as the data attribute \code{anchorlist}.
Guido van Rossum86751151995-02-28 17:14:32 +000094\end{funcdesc}
95
Fred Drake58d7f691996-10-08 21:52:23 +000096\begin{funcdesc}{anchor_end}{}
97This method is called at the end of an anchor region. The default
98implementation adds a textual footnote marker using an index into the
99list of hyperlinks created by \code{anchor_bgn()}.
Guido van Rossum86751151995-02-28 17:14:32 +0000100\end{funcdesc}
101
Fred Drake58d7f691996-10-08 21:52:23 +0000102\begin{funcdesc}{handle_image}{source\, alt\optional{\, ismap\optional{\, align\optional{\, width\optional{\, height}}}}}
103This method is called to handle images. The default implementation
104simply passes the \code{alt} value to the \code{handle_data()}
105method.
Guido van Rossum86751151995-02-28 17:14:32 +0000106\end{funcdesc}
107
Fred Drake58d7f691996-10-08 21:52:23 +0000108\begin{funcdesc}{save_bgn}{}
109Begins saving character data in a buffer instead of sending it to the
110formatter object. Retrieve the stored data via \code{save_end()}
111Use of the \code{save_bgn()} / \code{save_end()} pair may not be
112nested.
Guido van Rossum86751151995-02-28 17:14:32 +0000113\end{funcdesc}
114
Fred Drake58d7f691996-10-08 21:52:23 +0000115\begin{funcdesc}{save_end}{}
116Ends buffering character data and returns all data saved since the
117preceeding call to \code{save_bgn()}. If \code{nofill} flag is false,
118whitespace is collapsed to single spaces. A call to this method
119without a preceeding call to \code{save_bgn()} will raise a
120\code{TypeError} exception.
Guido van Rossum86751151995-02-28 17:14:32 +0000121\end{funcdesc}