blob: e19277461a82fe6a0d5c58a1aa0d35a50ce25e4a [file] [log] [blame]
Guido van Rossuma12ef941995-02-27 17:53:25 +00001\section{Built-in module \sectcode{htmllib}}
2\stmodindex{htmllib}
Guido van Rossum86751151995-02-28 17:14:32 +00003\index{HTML}
4\index{hypertext}
5
6\renewcommand{\indexsubitem}{(in module htmllib)}
7
8This module defines a number of classes which can serve as a basis for
9parsing text files formatted in HTML (HyperText Mark-up Language).
10The classes are not directly concerned with I/O --- the have to be fed
11their input in string form, and will make calls to methods of a
12``formatter'' object in order to produce output. The classes are
13designed to be used as base classes for other classes in order to add
14functionality, and allow most of their methods to be extended or
15overridden. In turn, the classes are derived from and extend the
16class \code{SGMLParser} defined in module \code{sgmllib}.
17\index{SGML}
18\stmodindex{sgmllib}
19\ttindex{SGMLParser}
20\index{formatter}
21
22The following is a summary of the interface defined by
23\code{sgmllib.SGMLParser}:
24
25\begin{itemize}
26
27\item
28The interface to feed data to an instance is through the \code{feed()}
29method, which takes a string argument. This can be called with as
30little or as much text at a time. When the data contains complete
31HTML elements, these are processed immediately; incomplete elements
32are saved in a buffer. To force processing of all unprocessed data,
33call the \code{close()} method.
34
35Example: to parse the entire contents of a file, do
36\code{parser.feed(open(file).read()); parser.close()}.
37
38\item
39The interface to define semantics for HTML tags is very simple: derive
40a class and define methods called \code{start_\var{tag}()},
41\code{end_\var{tag}()}, or \code{do_\var{tag}()}. The parser will
42call these at appropriate moments: \code{start_\var{tag}} or
43\code{do_\var{tag}} is called when an opening tag of the form
44\code{<\var{tag} ...>} is encountered; \code{end_\var{tag}} is called
45when a closing tag of the form \code{<\var{tag}>} is encountered. If
46an opening tag requires a corresponding closing tag, like \code{<H1>}
47... \code{</H1>}, the class should define the \code{start_\var{tag}}
48method; if a tag requires no closing tag, like \code{<P>}, the class
49should define the \code{do_\var{tag}} method.
50
51\end{itemize}
52
53The module defines the following classes:
54
55\begin{funcdesc}{HTMLParser}{}
56This is the most basic HTML parser class. It defines one additional
57entity name over the names defined by the \code{SGMLParser} base
58class, \code{\&bullet;}. It also defines handlers for the following
59tags: \code{<LISTING>...</LISTING>}, \code{<XMP>...</XMP>}, and
60\code{<PLAINTEXT>} (the latter is terminated only by end of file).
61\end{funcdesc}
62
63\begin{funcdesc}{CollectingParser}{}
64This class, derived from \code{HTMLParser}, collects various useful
65bits of information from the HTML text. To this end it defines
66additional handlers for the following tags: \code{<A>...</A>},
67\code{<HEAD>...</HEAD>}, \code{<BODY>...</BODY>},
68\code{<TITLE>...</TITLE>}, \code{<NEXTID>}, and \code{<ISINDEX>}.
69\end{funcdesc}
70
71\begin{funcdesc}{FormattingParser}{formatter\, stylesheet}
72This class, derived from \code{CollectingParser}, interprets a wide
73selection of HTML tags so it can produce formatted output from the
74parsed data. It is initialized with two objects, a \var{formatter}
75which should define a number of methods to format text into
76paragraphs, and a \var{stylesheet} which defines a number of static
77parameters for the formatting process. Formatters and style sheets
78are documented later in this section.
79\index{formatter}
80\index{style sheet}
81\end{funcdesc}
82
83\begin{funcdesc}{AnchoringParser}{formatter\, stylesheet}
84This class, derived from \code{FormattingParser}, extends the handling
85of the \code{<A>...</A>} tag pair to call the formatter's
86\code{bgn_anchor()} and \code{end_anchor()} methods. This allows the
87formatter to display the anchor in a different font or color, etc.
88\end{funcdesc}
89
90Instances of \code{CollectingParser} (and thus also instances of
91\code{FormattingParser} and \code{AnchoringParser}) have the following
92instance variables:
93
94\begin{datadesc}{anchornames}
95A list of the values if the \code{NAME} attributes of the \code{<A>}
96tags encountered.
97\end{datadesc}
98
99\begin{datadesc}{anchors}
100A list of the values of \code{HREF} attributes of the \code{<A>} tags
101encountered.
102\end{datadesc}
103
104\begin{datadesc}{anchortypes}
105A list of the values if the \code{TYPE} attributes of the \code{<A>}
106tags encountered.
107\end{datadesc}
108
109\begin{datadesc}{inanchor}
110Outside an \code{<A>...</A>} tag pair, this is zero. inside such a
111pair, it is a unique integer, which is positive if the anchor has a
112\code{HREF} attribute, negative if it hasn't. Its absolute value is
113one more than the index of the anchor in the \code{anchors},
114\code{anchornames} and \code{anchortypes} lists.
115\end{datadesc}
116
117\begin{datadesc}{isindex}
118True if the \code{<ISINDEX>} tag has been encountered.
119\end{datadesc}
120
121\begin{datadesc}{nextid}
122The attribute list of the last \code{<NEXTID>} tag encountered, or
123an empty list if none.
124\end{datadesc}
125
126\begin{datadesc}{title}
127The text inside the last \code{<TITLE>...</TITLE>} tag pair, or
128\code{''} if no title has been encountered yet.
129\end{datadesc}
130
131The \code{anchors}, \code{anchornames} and \code{anchortypes} lists
132are ``parallel arrays'': items in these lists with the same index
133pertain to the same anchor. Missing attributes default to the empty
134string. Anchors with neither a \code{HREF} not a \code{NAME}
135attribute are not entered in these lists at all.
136
137The module also defines a number of style sheet classes. These should
138never be instantiated --- their class variables are the only behaviour
139required. Note that style sheets are specifically designed for a
140particular formatter implementation. The currently defined style
141sheets are:
142\index{style sheet}
143
144\begin{datadesc}{NullStylesheet}
145A style sheet for use on a dumb output device such as an ASCII
146terminal.
147\end{datadesc}
148
149\begin{datadesc}{X11Stylesheet}
150A style sheet for use with an X11 server.
151\end{datadesc}
152
153\begin{datadesc}{MacStylesheet}
154A style sheet for use on Apple Macintosh computers.
155\end{datadesc}
156
157\begin{datadesc}{StdwinStylesheet}
158A style sheet for use with the \code{stdwin} module; it is an alias
159for either \code{X11Stylesheet} or \code{MacStylesheet}.
160\bimodindex{stdwin}
161\end{datadesc}
162
163\begin{datadesc}{GLStylesheet}
164A style sheet for use with the SGI Graphics Library and its font
165manager (the SGI-specific built-in modules \code{gl} and \code{fm}).
166\bimodindex{gl}
167\bimodindex{fm}
168\end{datadesc}
169
170Style sheets have the following class variables:
171
172\begin{datadesc}{stdfontset}
173A list of up to four font definititions, respectively for the roman,
174italic, bold and constant-width variant of a font for normal text. If
175the list contains less than four font definitions, the last item is
176used as the default for missing items. The type of a font definition
177depends on the formatter in use; its only use is as a parameter to the
178formatter's \code{setfont()} method.
179\end{datadesc}
180
181\begin{datadesc}{h1fontset}
182\dataline{h2fontset}
183\dataline{h3fontset}
184The font set used for various headers (text inside \code{<H1>...</H1>}
185tag pairs etc.).
186\end{datadesc}
187
188\begin{datadesc}{stdindent}
189The indentation of normal text. This is measured in the ``native''
190units of the formatter in use; for some formatters these are
191characters, for others (especially those that actually support
192variable-spacing fonts) in pixels or printer points.
193\end{datadesc}
194
195\begin{datadesc}{ddindent}
196The indentation used for the first level of \code{<DD>} tags.
197\end{datadesc}
198
199\begin{datadesc}{ulindent}
200The indentation used for the first level of \code{<UL>} tags.
201\end{datadesc}
202
203\begin{datadesc}{h1indent}
204The indentation used for level 1 headers.
205\end{datadesc}
206
207\begin{datadesc}{h2indent}
208The indentation used for level 2 headers.
209\end{datadesc}
210
211\begin{datadesc}{literalindent}
212The indentation used for literal text (text inside
213\code{<PRE>...</PRE>} and similar tag pairs).
214\end{datadesc}
215
216Although no documented implementation of a formatter exists, the
217\code{FormattingParser} class assumes that formatters have a
218certain interface. This interface requires the following methods:
219\index{formatter}
220
221\begin{funcdesc}{setfont}{fontspec}
222Set the font to be used subsequently. The \var{fontspec} argument is
223an item in a style sheet's font set.
224\end{funcdesc}
225
226\begin{funcdesc}{flush}{}
227Finish the current line, if not empty, and begin a new one.
228\end{funcdesc}
229
230\begin{funcdesc}{setleftindent}{n}
231Set the left indentation of the following lines to \var{n} units.
232\end{funcdesc}
233
234\begin{funcdesc}{needvspace}{n}
235Require at least \var{n} blank lines before the next line. Implies
236\code{flush()}.
237\end{funcdesc}
238
239\begin{funcdesc}{addword}{word\, space}
240Add a var{word} to the current paragraph, followed by \var{space}
241spaces.
242\end{funcdesc}
243
244\begin{datadesc}{nospace}
245If this instance variable is true, empty words are ignored by
246\code{addword}. It is set to false after a non-empty word has been
247added.
248\end{datadesc}
249
250\begin{funcdesc}{setjust}{justification}
251Set the justification of the current paragraph. The
252\var{justification} can be \code{'c'} (center), \code{'l'} (left
253justified), \code{'r'} (right justified) or \code{'lr'} (left and
254right justified).
255\end{funcdesc}
256
257\begin{funcdesc}{bgn_anchor}{id}
258Begin an anchor. The \var{id} parameter is the value of the parser's
259\code{inanchor} attribute.
260\end{funcdesc}
261
262\begin{funcdesc}{end_anchor}{id}
263End an anchor. The \var{id} parameter is the value of the parser's
264\code{inanchor} attribute.
265\end{funcdesc}
266
267A sample formatters implementation can be found in the module
268\code{fmt}, which in turn uses the module \code{Para}. These are
269currently not intended as a
270\ttindex{fmt}
271\ttindex{Para}