blob: c856b42e1542140868fac9925ad7b28204fbed08 [file] [log] [blame]
Fred Drake3a0351c1998-04-04 07:23:21 +00001\section{Standard Module \module{htmllib}}
Guido van Rossume47da0a1997-07-17 16:34:52 +00002\label{module-htmllib}
Guido van Rossuma12ef941995-02-27 17:53:25 +00003\stmodindex{htmllib}
Guido van Rossum86751151995-02-28 17:14:32 +00004\index{HTML}
5\index{hypertext}
6
Guido van Rossum86751151995-02-28 17:14:32 +00007
Fred Drake58d7f691996-10-08 21:52:23 +00008This module defines a class which can serve as a base for parsing text
9files formatted in the HyperText Mark-up Language (HTML). The class
10is not directly concerned with I/O --- it must be provided with input
11in string form via a method, and makes calls to methods of a
12``formatter'' object in order to produce output. The
Fred Drake526467c1998-02-10 21:42:27 +000013\class{HTMLParser} class is designed to be used as a base class for
Fred Drake58d7f691996-10-08 21:52:23 +000014other classes in order to add functionality, and allows most of its
15methods to be extended or overridden. In turn, this class is derived
Fred Drake526467c1998-02-10 21:42:27 +000016from and extends the \class{SGMLParser} class defined in module
17\module{sgmllib}\refstmodindex{sgmllib}. The \class{HTMLParser}
18implementation supports the HTML 2.0 language as described in
19\rfc{1866}. Two implementations of formatter objects are provided in
20the \module{formatter}\refstmodindex{formatter} module; refer to the
21documentation for that module for information on the formatter
22interface.
Guido van Rossum86751151995-02-28 17:14:32 +000023\index{SGML}
Fred Drake51375ae1998-03-12 14:39:09 +000024\withsubitem{(in module sgmllib)}{\ttindex{SGMLParser}}
Guido van Rossum86751151995-02-28 17:14:32 +000025\index{formatter}
26
27The following is a summary of the interface defined by
Fred Drake526467c1998-02-10 21:42:27 +000028\class{sgmllib.SGMLParser}:
Guido van Rossum86751151995-02-28 17:14:32 +000029
30\begin{itemize}
31
32\item
Fred Drake526467c1998-02-10 21:42:27 +000033The interface to feed data to an instance is through the \method{feed()}
Guido van Rossum86751151995-02-28 17:14:32 +000034method, which takes a string argument. This can be called with as
Fred Drake526467c1998-02-10 21:42:27 +000035little or as much text at a time as desired; \samp{p.feed(a);
36p.feed(b)} has the same effect as \samp{p.feed(a+b)}. When the data
Fred Drake58d7f691996-10-08 21:52:23 +000037contains complete HTML tags, these are processed immediately;
38incomplete elements are saved in a buffer. To force processing of all
Fred Drake526467c1998-02-10 21:42:27 +000039unprocessed data, call the \method{close()} method.
Guido van Rossum86751151995-02-28 17:14:32 +000040
Fred Drake58d7f691996-10-08 21:52:23 +000041For example, to parse the entire contents of a file, use:
Fred Drake19479911998-02-13 06:58:54 +000042\begin{verbatim}
Fred Drake58d7f691996-10-08 21:52:23 +000043parser.feed(open('myfile.html').read())
44parser.close()
Fred Drake19479911998-02-13 06:58:54 +000045\end{verbatim}
Fred Drake51375ae1998-03-12 14:39:09 +000046
Guido van Rossum86751151995-02-28 17:14:32 +000047\item
48The interface to define semantics for HTML tags is very simple: derive
49a class and define methods called \code{start_\var{tag}()},
50\code{end_\var{tag}()}, or \code{do_\var{tag}()}. The parser will
51call these at appropriate moments: \code{start_\var{tag}} or
Fred Drake526467c1998-02-10 21:42:27 +000052\code{do_\var{tag}()} is called when an opening tag of the form
53\code{<\var{tag} ...>} is encountered; \code{end_\var{tag}()} is called
Guido van Rossum86751151995-02-28 17:14:32 +000054when a closing tag of the form \code{<\var{tag}>} is encountered. If
55an opening tag requires a corresponding closing tag, like \code{<H1>}
Fred Drake526467c1998-02-10 21:42:27 +000056... \code{</H1>}, the class should define the \code{start_\var{tag}()}
Guido van Rossum86751151995-02-28 17:14:32 +000057method; if a tag requires no closing tag, like \code{<P>}, the class
Fred Drake526467c1998-02-10 21:42:27 +000058should define the \code{do_\var{tag}()} method.
Guido van Rossum86751151995-02-28 17:14:32 +000059
60\end{itemize}
61
Fred Drake58d7f691996-10-08 21:52:23 +000062The module defines a single class:
Guido van Rossum86751151995-02-28 17:14:32 +000063
Fred Drake51375ae1998-03-12 14:39:09 +000064\begin{classdesc}{HTMLParser}{formatter}
Fred Drake58d7f691996-10-08 21:52:23 +000065This is the basic HTML parser class. It supports all entity names
Fred Drakec5891241998-02-09 19:16:20 +000066required by the HTML 2.0 specification (\rfc{1866}). It also defines
Fred Drake58d7f691996-10-08 21:52:23 +000067handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
Fred Drake51375ae1998-03-12 14:39:09 +000068\end{classdesc}
Guido van Rossum86751151995-02-28 17:14:32 +000069
Fred Drake526467c1998-02-10 21:42:27 +000070In addition to tag methods, the \class{HTMLParser} class provides some
Fred Drake58d7f691996-10-08 21:52:23 +000071additional methods and instance variables for use within tag methods.
72
Fred Drake8fe533e1998-03-27 05:27:08 +000073\begin{memberdesc}{formatter}
Fred Drake58d7f691996-10-08 21:52:23 +000074This is the formatter instance associated with the parser.
Fred Drake8fe533e1998-03-27 05:27:08 +000075\end{memberdesc}
Fred Drake58d7f691996-10-08 21:52:23 +000076
Fred Drake8fe533e1998-03-27 05:27:08 +000077\begin{memberdesc}{nofill}
Fred Drake58d7f691996-10-08 21:52:23 +000078Boolean flag which should be true when whitespace should not be
79collapsed, or false when it should be. In general, this should only
80be true when character data is to be treated as ``preformatted'' text,
81as within a \code{<PRE>} element. The default value is false. This
Fred Drake526467c1998-02-10 21:42:27 +000082affects the operation of \method{handle_data()} and \method{save_end()}.
Fred Drake8fe533e1998-03-27 05:27:08 +000083\end{memberdesc}
Fred Drake58d7f691996-10-08 21:52:23 +000084
Fred Drake526467c1998-02-10 21:42:27 +000085
Fred Drake8fe533e1998-03-27 05:27:08 +000086\begin{methoddesc}{anchor_bgn}{href, name, type}
Fred Drake58d7f691996-10-08 21:52:23 +000087This method is called at the start of an anchor region. The arguments
88correspond to the attributes of the \code{<A>} tag with the same
89names. The default implementation maintains a list of hyperlinks
Fred Drake526467c1998-02-10 21:42:27 +000090(defined by the \code{href} attribute) within the document. The list
Fred Drake58d7f691996-10-08 21:52:23 +000091of hyperlinks is available as the data attribute \code{anchorlist}.
Fred Drake8fe533e1998-03-27 05:27:08 +000092\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000093
Fred Drake8fe533e1998-03-27 05:27:08 +000094\begin{methoddesc}{anchor_end}{}
Fred Drake58d7f691996-10-08 21:52:23 +000095This method is called at the end of an anchor region. The default
96implementation adds a textual footnote marker using an index into the
Fred Drake526467c1998-02-10 21:42:27 +000097list of hyperlinks created by \method{anchor_bgn()}.
Fred Drake8fe533e1998-03-27 05:27:08 +000098\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000099
Fred Drake8fe533e1998-03-27 05:27:08 +0000100\begin{methoddesc}{handle_image}{source, alt\optional{, ismap\optional{, align\optional{, width\optional{, height}}}}}
Fred Drake58d7f691996-10-08 21:52:23 +0000101This method is called to handle images. The default implementation
Fred Drake526467c1998-02-10 21:42:27 +0000102simply passes the \var{alt} value to the \method{handle_data()}
Fred Drake58d7f691996-10-08 21:52:23 +0000103method.
Fred Drake8fe533e1998-03-27 05:27:08 +0000104\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000105
Fred Drake8fe533e1998-03-27 05:27:08 +0000106\begin{methoddesc}{save_bgn}{}
Fred Drake58d7f691996-10-08 21:52:23 +0000107Begins saving character data in a buffer instead of sending it to the
Fred Drake526467c1998-02-10 21:42:27 +0000108formatter object. Retrieve the stored data via \method{save_end()}.
109Use of the \method{save_bgn()} / \method{save_end()} pair may not be
Fred Drake58d7f691996-10-08 21:52:23 +0000110nested.
Fred Drake8fe533e1998-03-27 05:27:08 +0000111\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000112
Fred Drake8fe533e1998-03-27 05:27:08 +0000113\begin{methoddesc}{save_end}{}
Fred Drake58d7f691996-10-08 21:52:23 +0000114Ends buffering character data and returns all data saved since the
Fred Drake526467c1998-02-10 21:42:27 +0000115preceeding call to \method{save_bgn()}. If the \code{nofill} flag is
116false, whitespace is collapsed to single spaces. A call to this
117method without a preceeding call to \method{save_bgn()} will raise a
118\exception{TypeError} exception.
Fred Drake8fe533e1998-03-27 05:27:08 +0000119\end{methoddesc}