blob: a84dd856d44dc16e22246dbf0d05a83017821231 [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{htmllib} ---
Fred Drake4e28c591999-04-22 18:25:47 +00002 A parser for HTML documents}
Fred Drakeb91e9341998-07-23 17:59:49 +00003
Fred Drake4e28c591999-04-22 18:25:47 +00004\declaremodule{standard}{htmllib}
Fred Drakeb91e9341998-07-23 17:59:49 +00005\modulesynopsis{A parser for HTML documents.}
6
Guido van Rossum86751151995-02-28 17:14:32 +00007\index{HTML}
8\index{hypertext}
9
Guido van Rossum86751151995-02-28 17:14:32 +000010
Fred Drake58d7f691996-10-08 21:52:23 +000011This module defines a class which can serve as a base for parsing text
12files formatted in the HyperText Mark-up Language (HTML). The class
13is not directly concerned with I/O --- it must be provided with input
14in string form via a method, and makes calls to methods of a
15``formatter'' object in order to produce output. The
Fred Drake526467c1998-02-10 21:42:27 +000016\class{HTMLParser} class is designed to be used as a base class for
Fred Drake58d7f691996-10-08 21:52:23 +000017other classes in order to add functionality, and allows most of its
18methods to be extended or overridden. In turn, this class is derived
Fred Drake526467c1998-02-10 21:42:27 +000019from and extends the \class{SGMLParser} class defined in module
Fred Drake4e28c591999-04-22 18:25:47 +000020\refmodule{sgmllib}\refstmodindex{sgmllib}. The \class{HTMLParser}
Fred Drake526467c1998-02-10 21:42:27 +000021implementation supports the HTML 2.0 language as described in
22\rfc{1866}. Two implementations of formatter objects are provided in
Fred Drake852fe062003-12-18 06:26:56 +000023the \refmodule{formatter}\refstmodindex{formatter}\ module; refer to the
Fred Drake526467c1998-02-10 21:42:27 +000024documentation for that module for information on the formatter
25interface.
Fred Drake51375ae1998-03-12 14:39:09 +000026\withsubitem{(in module sgmllib)}{\ttindex{SGMLParser}}
Guido van Rossum86751151995-02-28 17:14:32 +000027
28The following is a summary of the interface defined by
Fred Drake526467c1998-02-10 21:42:27 +000029\class{sgmllib.SGMLParser}:
Guido van Rossum86751151995-02-28 17:14:32 +000030
31\begin{itemize}
32
33\item
Fred Drake526467c1998-02-10 21:42:27 +000034The interface to feed data to an instance is through the \method{feed()}
Guido van Rossum86751151995-02-28 17:14:32 +000035method, which takes a string argument. This can be called with as
Fred Drake526467c1998-02-10 21:42:27 +000036little or as much text at a time as desired; \samp{p.feed(a);
37p.feed(b)} has the same effect as \samp{p.feed(a+b)}. When the data
Fred Drake961c2882004-09-10 01:20:21 +000038contains complete HTML markup constructs, these are processed immediately;
39incomplete constructs are saved in a buffer. To force processing of all
Fred Drake526467c1998-02-10 21:42:27 +000040unprocessed data, call the \method{close()} method.
Guido van Rossum86751151995-02-28 17:14:32 +000041
Fred Drake58d7f691996-10-08 21:52:23 +000042For example, to parse the entire contents of a file, use:
Fred Drake19479911998-02-13 06:58:54 +000043\begin{verbatim}
Fred Drake58d7f691996-10-08 21:52:23 +000044parser.feed(open('myfile.html').read())
45parser.close()
Fred Drake19479911998-02-13 06:58:54 +000046\end{verbatim}
Fred Drake51375ae1998-03-12 14:39:09 +000047
Guido van Rossum86751151995-02-28 17:14:32 +000048\item
49The interface to define semantics for HTML tags is very simple: derive
Fred Drake4e28c591999-04-22 18:25:47 +000050a class and define methods called \method{start_\var{tag}()},
51\method{end_\var{tag}()}, or \method{do_\var{tag}()}. The parser will
52call these at appropriate moments: \method{start_\var{tag}} or
53\method{do_\var{tag}()} is called when an opening tag of the form
54\code{<\var{tag} ...>} is encountered; \method{end_\var{tag}()} is called
Guido van Rossum86751151995-02-28 17:14:32 +000055when a closing tag of the form \code{<\var{tag}>} is encountered. If
56an opening tag requires a corresponding closing tag, like \code{<H1>}
Fred Drake4e28c591999-04-22 18:25:47 +000057... \code{</H1>}, the class should define the \method{start_\var{tag}()}
Guido van Rossum86751151995-02-28 17:14:32 +000058method; if a tag requires no closing tag, like \code{<P>}, the class
Fred Drake4e28c591999-04-22 18:25:47 +000059should define the \method{do_\var{tag}()} method.
Guido van Rossum86751151995-02-28 17:14:32 +000060
61\end{itemize}
62
Fred Drake961c2882004-09-10 01:20:21 +000063The module defines a parser class and an exception:
Guido van Rossum86751151995-02-28 17:14:32 +000064
Fred Drake51375ae1998-03-12 14:39:09 +000065\begin{classdesc}{HTMLParser}{formatter}
Fred Drake58d7f691996-10-08 21:52:23 +000066This is the basic HTML parser class. It supports all entity names
Andrew M. Kuchlingb546be22003-10-27 15:46:16 +000067required by the XHTML 1.0 Recommendation (\url{http://www.w3.org/TR/xhtml1}).
68It also defines handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements.
Fred Drake51375ae1998-03-12 14:39:09 +000069\end{classdesc}
Guido van Rossum86751151995-02-28 17:14:32 +000070
Fred Drake961c2882004-09-10 01:20:21 +000071\begin{excdesc}{HTMLParseError}
72Exception raised by the \class{HTMLParser} class when it encounters an
73error while parsing.
74\versionadded{2.4}
75\end{excdesc}
76
Fred Drake0f871dc1999-06-21 21:20:56 +000077
78\begin{seealso}
Raymond Hettinger9bb33862003-07-14 08:15:47 +000079 \seemodule{formatter}{Interface definition for transforming an
80 abstract flow of formatting events into
81 specific output events on writer objects.}
Fred Drake25211f52001-07-05 16:34:36 +000082 \seemodule{HTMLParser}{Alternate HTML parser that offers a slightly
83 lower-level view of the input, but is
84 designed to work with XHTML, and does not
85 implement some of the SGML syntax not used in
86 ``HTML as deployed'' and which isn't legal
87 for XHTML.}
Andrew M. Kuchlingb546be22003-10-27 15:46:16 +000088 \seemodule{htmlentitydefs}{Definition of replacement text for XHTML 1.0
89 entities.}
Fred Drake0f871dc1999-06-21 21:20:56 +000090 \seemodule{sgmllib}{Base class for \class{HTMLParser}.}
91\end{seealso}
92
93
94\subsection{HTMLParser Objects \label{html-parser-objects}}
95
Fred Drake526467c1998-02-10 21:42:27 +000096In addition to tag methods, the \class{HTMLParser} class provides some
Fred Drake58d7f691996-10-08 21:52:23 +000097additional methods and instance variables for use within tag methods.
98
Fred Drake8fe533e1998-03-27 05:27:08 +000099\begin{memberdesc}{formatter}
Fred Drake58d7f691996-10-08 21:52:23 +0000100This is the formatter instance associated with the parser.
Fred Drake8fe533e1998-03-27 05:27:08 +0000101\end{memberdesc}
Fred Drake58d7f691996-10-08 21:52:23 +0000102
Fred Drake8fe533e1998-03-27 05:27:08 +0000103\begin{memberdesc}{nofill}
Fred Drake58d7f691996-10-08 21:52:23 +0000104Boolean flag which should be true when whitespace should not be
105collapsed, or false when it should be. In general, this should only
106be true when character data is to be treated as ``preformatted'' text,
107as within a \code{<PRE>} element. The default value is false. This
Fred Drake526467c1998-02-10 21:42:27 +0000108affects the operation of \method{handle_data()} and \method{save_end()}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000109\end{memberdesc}
Fred Drake58d7f691996-10-08 21:52:23 +0000110
Fred Drake526467c1998-02-10 21:42:27 +0000111
Fred Drake8fe533e1998-03-27 05:27:08 +0000112\begin{methoddesc}{anchor_bgn}{href, name, type}
Fred Drake58d7f691996-10-08 21:52:23 +0000113This method is called at the start of an anchor region. The arguments
114correspond to the attributes of the \code{<A>} tag with the same
115names. The default implementation maintains a list of hyperlinks
Fred Drake4e28c591999-04-22 18:25:47 +0000116(defined by the \code{HREF} attribute for \code{<A>} tags) within the
117document. The list of hyperlinks is available as the data attribute
118\member{anchorlist}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000119\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000120
Fred Drake8fe533e1998-03-27 05:27:08 +0000121\begin{methoddesc}{anchor_end}{}
Fred Drake58d7f691996-10-08 21:52:23 +0000122This method is called at the end of an anchor region. The default
123implementation adds a textual footnote marker using an index into the
Fred Drake526467c1998-02-10 21:42:27 +0000124list of hyperlinks created by \method{anchor_bgn()}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000125\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000126
Fred Drake961c2882004-09-10 01:20:21 +0000127\begin{methoddesc}{handle_image}{source, alt\optional{, ismap\optional{,
128 align\optional{, width\optional{, height}}}}}
Fred Drake58d7f691996-10-08 21:52:23 +0000129This method is called to handle images. The default implementation
Fred Drake526467c1998-02-10 21:42:27 +0000130simply passes the \var{alt} value to the \method{handle_data()}
Fred Drake58d7f691996-10-08 21:52:23 +0000131method.
Fred Drake8fe533e1998-03-27 05:27:08 +0000132\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000133
Fred Drake8fe533e1998-03-27 05:27:08 +0000134\begin{methoddesc}{save_bgn}{}
Fred Drake58d7f691996-10-08 21:52:23 +0000135Begins saving character data in a buffer instead of sending it to the
Fred Drake526467c1998-02-10 21:42:27 +0000136formatter object. Retrieve the stored data via \method{save_end()}.
137Use of the \method{save_bgn()} / \method{save_end()} pair may not be
Fred Drake58d7f691996-10-08 21:52:23 +0000138nested.
Fred Drake8fe533e1998-03-27 05:27:08 +0000139\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000140
Fred Drake8fe533e1998-03-27 05:27:08 +0000141\begin{methoddesc}{save_end}{}
Fred Drake58d7f691996-10-08 21:52:23 +0000142Ends buffering character data and returns all data saved since the
Thomas Woutersf8316632000-07-16 19:01:10 +0000143preceding call to \method{save_bgn()}. If the \member{nofill} flag is
Fred Drake526467c1998-02-10 21:42:27 +0000144false, whitespace is collapsed to single spaces. A call to this
Thomas Woutersf8316632000-07-16 19:01:10 +0000145method without a preceding call to \method{save_bgn()} will raise a
Fred Drake526467c1998-02-10 21:42:27 +0000146\exception{TypeError} exception.
Fred Drake8fe533e1998-03-27 05:27:08 +0000147\end{methoddesc}
Fred Drake0f871dc1999-06-21 21:20:56 +0000148
149
150
151\section{\module{htmlentitydefs} ---
152 Definitions of HTML general entities}
153
154\declaremodule{standard}{htmlentitydefs}
155\modulesynopsis{Definitions of HTML general entities.}
156\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
157
Walter Dörwald5688b7a2003-04-16 09:46:13 +0000158This module defines three dictionaries, \code{name2codepoint},
159\code{codepoint2name}, and \code{entitydefs}. \code{entitydefs} is
Fred Drake0f871dc1999-06-21 21:20:56 +0000160used by the \refmodule{htmllib} module to provide the
161\member{entitydefs} member of the \class{HTMLParser} class. The
Walter Dörwald5688b7a2003-04-16 09:46:13 +0000162definition provided here contains all the entities defined by XHTML 1.0
Fred Drake0f871dc1999-06-21 21:20:56 +0000163that can be handled using simple textual substitution in the Latin-1
164character set (ISO-8859-1).
165
166
167\begin{datadesc}{entitydefs}
Walter Dörwald5688b7a2003-04-16 09:46:13 +0000168 A dictionary mapping XHTML 1.0 entity definitions to their
Fred Drake0f871dc1999-06-21 21:20:56 +0000169 replacement text in ISO Latin-1.
Walter Dörwald5688b7a2003-04-16 09:46:13 +0000170
Fred Drake0f871dc1999-06-21 21:20:56 +0000171\end{datadesc}
Walter Dörwald5688b7a2003-04-16 09:46:13 +0000172
173\begin{datadesc}{name2codepoint}
174 A dictionary that maps HTML entity names to the Unicode codepoints.
Neal Norwitz1475c492003-04-16 13:21:06 +0000175 \versionadded{2.3}
Walter Dörwald5688b7a2003-04-16 09:46:13 +0000176\end{datadesc}
177
178\begin{datadesc}{codepoint2name}
179 A dictionary that maps Unicode codepoints to HTML entity names.
Neal Norwitz1475c492003-04-16 13:21:06 +0000180 \versionadded{2.3}
Walter Dörwald5688b7a2003-04-16 09:46:13 +0000181\end{datadesc}