blob: aeb4ce9c568b31a0c161b14111483776601b9f12 [file] [log] [blame]
Guido van Rossum470be141995-03-17 16:07:09 +00001\section{Standard Module \sectcode{htmllib}}
Guido van Rossuma12ef941995-02-27 17:53:25 +00002\stmodindex{htmllib}
Guido van Rossum86751151995-02-28 17:14:32 +00003\index{HTML}
4\index{hypertext}
5
6\renewcommand{\indexsubitem}{(in module htmllib)}
7
8This module defines a number of classes which can serve as a basis for
9parsing text files formatted in HTML (HyperText Mark-up Language).
10The classes are not directly concerned with I/O --- the have to be fed
11their input in string form, and will make calls to methods of a
12``formatter'' object in order to produce output. The classes are
13designed to be used as base classes for other classes in order to add
14functionality, and allow most of their methods to be extended or
15overridden. In turn, the classes are derived from and extend the
16class \code{SGMLParser} defined in module \code{sgmllib}.
17\index{SGML}
18\stmodindex{sgmllib}
19\ttindex{SGMLParser}
20\index{formatter}
21
22The following is a summary of the interface defined by
23\code{sgmllib.SGMLParser}:
24
25\begin{itemize}
26
27\item
28The interface to feed data to an instance is through the \code{feed()}
29method, which takes a string argument. This can be called with as
Guido van Rossum470be141995-03-17 16:07:09 +000030little or as much text at a time as desired;
31\code{p.feed(a); p.feed(b)} has the same effect as \code{p.feed(a+b)}.
32When the data contains complete
Guido van Rossum86751151995-02-28 17:14:32 +000033HTML elements, these are processed immediately; incomplete elements
34are saved in a buffer. To force processing of all unprocessed data,
35call the \code{close()} method.
36
Guido van Rossum470be141995-03-17 16:07:09 +000037Example: to parse the entire contents of a file, do\\
Guido van Rossum86751151995-02-28 17:14:32 +000038\code{parser.feed(open(file).read()); parser.close()}.
39
40\item
41The interface to define semantics for HTML tags is very simple: derive
42a class and define methods called \code{start_\var{tag}()},
43\code{end_\var{tag}()}, or \code{do_\var{tag}()}. The parser will
44call these at appropriate moments: \code{start_\var{tag}} or
45\code{do_\var{tag}} is called when an opening tag of the form
46\code{<\var{tag} ...>} is encountered; \code{end_\var{tag}} is called
47when a closing tag of the form \code{<\var{tag}>} is encountered. If
48an opening tag requires a corresponding closing tag, like \code{<H1>}
49... \code{</H1>}, the class should define the \code{start_\var{tag}}
50method; if a tag requires no closing tag, like \code{<P>}, the class
51should define the \code{do_\var{tag}} method.
52
53\end{itemize}
54
55The module defines the following classes:
56
57\begin{funcdesc}{HTMLParser}{}
58This is the most basic HTML parser class. It defines one additional
59entity name over the names defined by the \code{SGMLParser} base
60class, \code{\&bullet;}. It also defines handlers for the following
61tags: \code{<LISTING>...</LISTING>}, \code{<XMP>...</XMP>}, and
62\code{<PLAINTEXT>} (the latter is terminated only by end of file).
63\end{funcdesc}
64
65\begin{funcdesc}{CollectingParser}{}
66This class, derived from \code{HTMLParser}, collects various useful
67bits of information from the HTML text. To this end it defines
68additional handlers for the following tags: \code{<A>...</A>},
69\code{<HEAD>...</HEAD>}, \code{<BODY>...</BODY>},
70\code{<TITLE>...</TITLE>}, \code{<NEXTID>}, and \code{<ISINDEX>}.
71\end{funcdesc}
72
73\begin{funcdesc}{FormattingParser}{formatter\, stylesheet}
74This class, derived from \code{CollectingParser}, interprets a wide
75selection of HTML tags so it can produce formatted output from the
76parsed data. It is initialized with two objects, a \var{formatter}
77which should define a number of methods to format text into
78paragraphs, and a \var{stylesheet} which defines a number of static
79parameters for the formatting process. Formatters and style sheets
80are documented later in this section.
81\index{formatter}
82\index{style sheet}
83\end{funcdesc}
84
85\begin{funcdesc}{AnchoringParser}{formatter\, stylesheet}
86This class, derived from \code{FormattingParser}, extends the handling
87of the \code{<A>...</A>} tag pair to call the formatter's
88\code{bgn_anchor()} and \code{end_anchor()} methods. This allows the
89formatter to display the anchor in a different font or color, etc.
90\end{funcdesc}
91
92Instances of \code{CollectingParser} (and thus also instances of
93\code{FormattingParser} and \code{AnchoringParser}) have the following
94instance variables:
95
96\begin{datadesc}{anchornames}
Guido van Rossumd01c1001995-03-07 10:12:59 +000097A list of the values of the \code{NAME} attributes of the \code{<A>}
Guido van Rossum86751151995-02-28 17:14:32 +000098tags encountered.
99\end{datadesc}
100
101\begin{datadesc}{anchors}
102A list of the values of \code{HREF} attributes of the \code{<A>} tags
103encountered.
104\end{datadesc}
105
106\begin{datadesc}{anchortypes}
Guido van Rossumd01c1001995-03-07 10:12:59 +0000107A list of the values of the \code{TYPE} attributes of the \code{<A>}
Guido van Rossum86751151995-02-28 17:14:32 +0000108tags encountered.
109\end{datadesc}
110
111\begin{datadesc}{inanchor}
Guido van Rossumd01c1001995-03-07 10:12:59 +0000112Outside an \code{<A>...</A>} tag pair, this is zero. Inside such a
Guido van Rossum86751151995-02-28 17:14:32 +0000113pair, it is a unique integer, which is positive if the anchor has a
114\code{HREF} attribute, negative if it hasn't. Its absolute value is
115one more than the index of the anchor in the \code{anchors},
116\code{anchornames} and \code{anchortypes} lists.
117\end{datadesc}
118
119\begin{datadesc}{isindex}
120True if the \code{<ISINDEX>} tag has been encountered.
121\end{datadesc}
122
123\begin{datadesc}{nextid}
124The attribute list of the last \code{<NEXTID>} tag encountered, or
125an empty list if none.
126\end{datadesc}
127
128\begin{datadesc}{title}
129The text inside the last \code{<TITLE>...</TITLE>} tag pair, or
130\code{''} if no title has been encountered yet.
131\end{datadesc}
132
133The \code{anchors}, \code{anchornames} and \code{anchortypes} lists
134are ``parallel arrays'': items in these lists with the same index
135pertain to the same anchor. Missing attributes default to the empty
Guido van Rossumd01c1001995-03-07 10:12:59 +0000136string. Anchors with neither a \code{HREF} nor a \code{NAME}
Guido van Rossum86751151995-02-28 17:14:32 +0000137attribute are not entered in these lists at all.
138
139The module also defines a number of style sheet classes. These should
Guido van Rossumdc46c7f1995-03-01 15:38:16 +0000140never be instantiated --- their class variables are the only behavior
Guido van Rossum86751151995-02-28 17:14:32 +0000141required. Note that style sheets are specifically designed for a
142particular formatter implementation. The currently defined style
143sheets are:
144\index{style sheet}
145
146\begin{datadesc}{NullStylesheet}
Guido van Rossum470be141995-03-17 16:07:09 +0000147A style sheet for use on a dumb output device such as an \ASCII{}
Guido van Rossum86751151995-02-28 17:14:32 +0000148terminal.
149\end{datadesc}
150
151\begin{datadesc}{X11Stylesheet}
152A style sheet for use with an X11 server.
153\end{datadesc}
154
155\begin{datadesc}{MacStylesheet}
156A style sheet for use on Apple Macintosh computers.
157\end{datadesc}
158
159\begin{datadesc}{StdwinStylesheet}
160A style sheet for use with the \code{stdwin} module; it is an alias
161for either \code{X11Stylesheet} or \code{MacStylesheet}.
162\bimodindex{stdwin}
163\end{datadesc}
164
165\begin{datadesc}{GLStylesheet}
166A style sheet for use with the SGI Graphics Library and its font
167manager (the SGI-specific built-in modules \code{gl} and \code{fm}).
168\bimodindex{gl}
169\bimodindex{fm}
170\end{datadesc}
171
172Style sheets have the following class variables:
173
174\begin{datadesc}{stdfontset}
175A list of up to four font definititions, respectively for the roman,
176italic, bold and constant-width variant of a font for normal text. If
177the list contains less than four font definitions, the last item is
178used as the default for missing items. The type of a font definition
179depends on the formatter in use; its only use is as a parameter to the
180formatter's \code{setfont()} method.
181\end{datadesc}
182
183\begin{datadesc}{h1fontset}
184\dataline{h2fontset}
185\dataline{h3fontset}
186The font set used for various headers (text inside \code{<H1>...</H1>}
187tag pairs etc.).
188\end{datadesc}
189
190\begin{datadesc}{stdindent}
191The indentation of normal text. This is measured in the ``native''
192units of the formatter in use; for some formatters these are
193characters, for others (especially those that actually support
194variable-spacing fonts) in pixels or printer points.
195\end{datadesc}
196
197\begin{datadesc}{ddindent}
198The indentation used for the first level of \code{<DD>} tags.
199\end{datadesc}
200
201\begin{datadesc}{ulindent}
202The indentation used for the first level of \code{<UL>} tags.
203\end{datadesc}
204
205\begin{datadesc}{h1indent}
206The indentation used for level 1 headers.
207\end{datadesc}
208
209\begin{datadesc}{h2indent}
210The indentation used for level 2 headers.
211\end{datadesc}
212
213\begin{datadesc}{literalindent}
214The indentation used for literal text (text inside
215\code{<PRE>...</PRE>} and similar tag pairs).
216\end{datadesc}
217
218Although no documented implementation of a formatter exists, the
219\code{FormattingParser} class assumes that formatters have a
220certain interface. This interface requires the following methods:
221\index{formatter}
222
223\begin{funcdesc}{setfont}{fontspec}
224Set the font to be used subsequently. The \var{fontspec} argument is
225an item in a style sheet's font set.
226\end{funcdesc}
227
228\begin{funcdesc}{flush}{}
229Finish the current line, if not empty, and begin a new one.
230\end{funcdesc}
231
232\begin{funcdesc}{setleftindent}{n}
233Set the left indentation of the following lines to \var{n} units.
234\end{funcdesc}
235
236\begin{funcdesc}{needvspace}{n}
237Require at least \var{n} blank lines before the next line. Implies
238\code{flush()}.
239\end{funcdesc}
240
241\begin{funcdesc}{addword}{word\, space}
Guido van Rossumd01c1001995-03-07 10:12:59 +0000242Add a \var{word} to the current paragraph, followed by \var{space}
Guido van Rossum86751151995-02-28 17:14:32 +0000243spaces.
244\end{funcdesc}
245
246\begin{datadesc}{nospace}
Guido van Rossum470be141995-03-17 16:07:09 +0000247If this instance variable is true, empty words should be ignored by
248\code{addword}. It should be set to false after a non-empty word has
249been added.
Guido van Rossum86751151995-02-28 17:14:32 +0000250\end{datadesc}
251
252\begin{funcdesc}{setjust}{justification}
253Set the justification of the current paragraph. The
254\var{justification} can be \code{'c'} (center), \code{'l'} (left
255justified), \code{'r'} (right justified) or \code{'lr'} (left and
256right justified).
257\end{funcdesc}
258
259\begin{funcdesc}{bgn_anchor}{id}
260Begin an anchor. The \var{id} parameter is the value of the parser's
261\code{inanchor} attribute.
262\end{funcdesc}
263
264\begin{funcdesc}{end_anchor}{id}
265End an anchor. The \var{id} parameter is the value of the parser's
266\code{inanchor} attribute.
267\end{funcdesc}
268
Guido van Rossumd01c1001995-03-07 10:12:59 +0000269A sample formatter implementation can be found in the module
270\code{fmt}, which in turn uses the module \code{Para}. These modules are
271not intended as standard library modules; they are available as an
272example of how to write a formatter.
Guido van Rossum86751151995-02-28 17:14:32 +0000273\ttindex{fmt}
274\ttindex{Para}