Blame - Doc/libhtmllib.tex - platform/external/python/cpython3

blob: e19277461a82fe6a0d5c58a1aa0d35a50ce25e4a [file] [log] [blame]

Guido van Rossum	a12ef94	1995-02-27 17:53:25 +0000	[diff] [blame]	1	\section{Built-in module \sectcode{htmllib}}
				2	\stmodindex{htmllib}
Guido van Rossum	8675115	1995-02-28 17:14:32 +0000	[diff] [blame^]	3	\index{HTML}
				4	\index{hypertext}
				5
				6	\renewcommand{\indexsubitem}{(in module htmllib)}
				7
				8	This module defines a number of classes which can serve as a basis for
				9	parsing text files formatted in HTML (HyperText Mark-up Language).
				10	The classes are not directly concerned with I/O --- the have to be fed
				11	their input in string form, and will make calls to methods of a
				12	``formatter'' object in order to produce output. The classes are
				13	designed to be used as base classes for other classes in order to add
				14	functionality, and allow most of their methods to be extended or
				15	overridden. In turn, the classes are derived from and extend the
				16	class \code{SGMLParser} defined in module \code{sgmllib}.
				17	\index{SGML}
				18	\stmodindex{sgmllib}
				19	\ttindex{SGMLParser}
				20	\index{formatter}
				21
				22	The following is a summary of the interface defined by
				23	\code{sgmllib.SGMLParser}:
				24
				25	\begin{itemize}
				26
				27	\item
				28	The interface to feed data to an instance is through the \code{feed()}
				29	method, which takes a string argument. This can be called with as
				30	little or as much text at a time. When the data contains complete
				31	HTML elements, these are processed immediately; incomplete elements
				32	are saved in a buffer. To force processing of all unprocessed data,
				33	call the \code{close()} method.
				34
				35	Example: to parse the entire contents of a file, do
				36	\code{parser.feed(open(file).read()); parser.close()}.
				37
				38	\item
				39	The interface to define semantics for HTML tags is very simple: derive
				40	a class and define methods called \code{start_\var{tag}()},
				41	\code{end_\var{tag}()}, or \code{do_\var{tag}()}. The parser will
				42	call these at appropriate moments: \code{start_\var{tag}} or
				43	\code{do_\var{tag}} is called when an opening tag of the form
				44	\code{<\var{tag} ...>} is encountered; \code{end_\var{tag}} is called
				45	when a closing tag of the form \code{<\var{tag}>} is encountered. If
				46	an opening tag requires a corresponding closing tag, like \code{<H1>}
				47	... \code{</H1>}, the class should define the \code{start_\var{tag}}
				48	method; if a tag requires no closing tag, like \code{<P>}, the class
				49	should define the \code{do_\var{tag}} method.
				50
				51	\end{itemize}
				52
				53	The module defines the following classes:
				54
				55	\begin{funcdesc}{HTMLParser}{}
				56	This is the most basic HTML parser class. It defines one additional
				57	entity name over the names defined by the \code{SGMLParser} base
				58	class, \code{\&bullet;}. It also defines handlers for the following
				59	tags: \code{<LISTING>...</LISTING>}, \code{<XMP>...</XMP>}, and
				60	\code{<PLAINTEXT>} (the latter is terminated only by end of file).
				61	\end{funcdesc}
				62
				63	\begin{funcdesc}{CollectingParser}{}
				64	This class, derived from \code{HTMLParser}, collects various useful
				65	bits of information from the HTML text. To this end it defines
				66	additional handlers for the following tags: \code{<A>...</A>},
				67	\code{<HEAD>...</HEAD>}, \code{<BODY>...</BODY>},
				68	\code{<TITLE>...</TITLE>}, \code{<NEXTID>}, and \code{<ISINDEX>}.
				69	\end{funcdesc}
				70
				71	\begin{funcdesc}{FormattingParser}{formatter\, stylesheet}
				72	This class, derived from \code{CollectingParser}, interprets a wide
				73	selection of HTML tags so it can produce formatted output from the
				74	parsed data. It is initialized with two objects, a \var{formatter}
				75	which should define a number of methods to format text into
				76	paragraphs, and a \var{stylesheet} which defines a number of static
				77	parameters for the formatting process. Formatters and style sheets
				78	are documented later in this section.
				79	\index{formatter}
				80	\index{style sheet}
				81	\end{funcdesc}
				82
				83	\begin{funcdesc}{AnchoringParser}{formatter\, stylesheet}
				84	This class, derived from \code{FormattingParser}, extends the handling
				85	of the \code{<A>...</A>} tag pair to call the formatter's
				86	\code{bgn_anchor()} and \code{end_anchor()} methods. This allows the
				87	formatter to display the anchor in a different font or color, etc.
				88	\end{funcdesc}
				89
				90	Instances of \code{CollectingParser} (and thus also instances of
				91	\code{FormattingParser} and \code{AnchoringParser}) have the following
				92	instance variables:
				93
				94	\begin{datadesc}{anchornames}
				95	A list of the values if the \code{NAME} attributes of the \code{<A>}
				96	tags encountered.
				97	\end{datadesc}
				98
				99	\begin{datadesc}{anchors}
				100	A list of the values of \code{HREF} attributes of the \code{<A>} tags
				101	encountered.
				102	\end{datadesc}
				103
				104	\begin{datadesc}{anchortypes}
				105	A list of the values if the \code{TYPE} attributes of the \code{<A>}
				106	tags encountered.
				107	\end{datadesc}
				108
				109	\begin{datadesc}{inanchor}
				110	Outside an \code{<A>...</A>} tag pair, this is zero. inside such a
				111	pair, it is a unique integer, which is positive if the anchor has a
				112	\code{HREF} attribute, negative if it hasn't. Its absolute value is
				113	one more than the index of the anchor in the \code{anchors},
				114	\code{anchornames} and \code{anchortypes} lists.
				115	\end{datadesc}
				116
				117	\begin{datadesc}{isindex}
				118	True if the \code{<ISINDEX>} tag has been encountered.
				119	\end{datadesc}
				120
				121	\begin{datadesc}{nextid}
				122	The attribute list of the last \code{<NEXTID>} tag encountered, or
				123	an empty list if none.
				124	\end{datadesc}
				125
				126	\begin{datadesc}{title}
				127	The text inside the last \code{<TITLE>...</TITLE>} tag pair, or
				128	\code{''} if no title has been encountered yet.
				129	\end{datadesc}
				130
				131	The \code{anchors}, \code{anchornames} and \code{anchortypes} lists
				132	are ``parallel arrays'': items in these lists with the same index
				133	pertain to the same anchor. Missing attributes default to the empty
				134	string. Anchors with neither a \code{HREF} not a \code{NAME}
				135	attribute are not entered in these lists at all.
				136
				137	The module also defines a number of style sheet classes. These should
				138	never be instantiated --- their class variables are the only behaviour
				139	required. Note that style sheets are specifically designed for a
				140	particular formatter implementation. The currently defined style
				141	sheets are:
				142	\index{style sheet}
				143
				144	\begin{datadesc}{NullStylesheet}
				145	A style sheet for use on a dumb output device such as an ASCII
				146	terminal.
				147	\end{datadesc}
				148
				149	\begin{datadesc}{X11Stylesheet}
				150	A style sheet for use with an X11 server.
				151	\end{datadesc}
				152
				153	\begin{datadesc}{MacStylesheet}
				154	A style sheet for use on Apple Macintosh computers.
				155	\end{datadesc}
				156
				157	\begin{datadesc}{StdwinStylesheet}
				158	A style sheet for use with the \code{stdwin} module; it is an alias
				159	for either \code{X11Stylesheet} or \code{MacStylesheet}.
				160	\bimodindex{stdwin}
				161	\end{datadesc}
				162
				163	\begin{datadesc}{GLStylesheet}
				164	A style sheet for use with the SGI Graphics Library and its font
				165	manager (the SGI-specific built-in modules \code{gl} and \code{fm}).
				166	\bimodindex{gl}
				167	\bimodindex{fm}
				168	\end{datadesc}
				169
				170	Style sheets have the following class variables:
				171
				172	\begin{datadesc}{stdfontset}
				173	A list of up to four font definititions, respectively for the roman,
				174	italic, bold and constant-width variant of a font for normal text. If
				175	the list contains less than four font definitions, the last item is
				176	used as the default for missing items. The type of a font definition
				177	depends on the formatter in use; its only use is as a parameter to the
				178	formatter's \code{setfont()} method.
				179	\end{datadesc}
				180
				181	\begin{datadesc}{h1fontset}
				182	\dataline{h2fontset}
				183	\dataline{h3fontset}
				184	The font set used for various headers (text inside \code{<H1>...</H1>}
				185	tag pairs etc.).
				186	\end{datadesc}
				187
				188	\begin{datadesc}{stdindent}
				189	The indentation of normal text. This is measured in the ``native''
				190	units of the formatter in use; for some formatters these are
				191	characters, for others (especially those that actually support
				192	variable-spacing fonts) in pixels or printer points.
				193	\end{datadesc}
				194
				195	\begin{datadesc}{ddindent}
				196	The indentation used for the first level of \code{<DD>} tags.
				197	\end{datadesc}
				198
				199	\begin{datadesc}{ulindent}
				200	The indentation used for the first level of \code{<UL>} tags.
				201	\end{datadesc}
				202
				203	\begin{datadesc}{h1indent}
				204	The indentation used for level 1 headers.
				205	\end{datadesc}
				206
				207	\begin{datadesc}{h2indent}
				208	The indentation used for level 2 headers.
				209	\end{datadesc}
				210
				211	\begin{datadesc}{literalindent}
				212	The indentation used for literal text (text inside
				213	\code{<PRE>...</PRE>} and similar tag pairs).
				214	\end{datadesc}
				215
				216	Although no documented implementation of a formatter exists, the
				217	\code{FormattingParser} class assumes that formatters have a
				218	certain interface. This interface requires the following methods:
				219	\index{formatter}
				220
				221	\begin{funcdesc}{setfont}{fontspec}
				222	Set the font to be used subsequently. The \var{fontspec} argument is
				223	an item in a style sheet's font set.
				224	\end{funcdesc}
				225
				226	\begin{funcdesc}{flush}{}
				227	Finish the current line, if not empty, and begin a new one.
				228	\end{funcdesc}
				229
				230	\begin{funcdesc}{setleftindent}{n}
				231	Set the left indentation of the following lines to \var{n} units.
				232	\end{funcdesc}
				233
				234	\begin{funcdesc}{needvspace}{n}
				235	Require at least \var{n} blank lines before the next line. Implies
				236	\code{flush()}.
				237	\end{funcdesc}
				238
				239	\begin{funcdesc}{addword}{word\, space}
				240	Add a var{word} to the current paragraph, followed by \var{space}
				241	spaces.
				242	\end{funcdesc}
				243
				244	\begin{datadesc}{nospace}
				245	If this instance variable is true, empty words are ignored by
				246	\code{addword}. It is set to false after a non-empty word has been
				247	added.
				248	\end{datadesc}
				249
				250	\begin{funcdesc}{setjust}{justification}
				251	Set the justification of the current paragraph. The
				252	\var{justification} can be \code{'c'} (center), \code{'l'} (left
				253	justified), \code{'r'} (right justified) or \code{'lr'} (left and
				254	right justified).
				255	\end{funcdesc}
				256
				257	\begin{funcdesc}{bgn_anchor}{id}
				258	Begin an anchor. The \var{id} parameter is the value of the parser's
				259	\code{inanchor} attribute.
				260	\end{funcdesc}
				261
				262	\begin{funcdesc}{end_anchor}{id}
				263	End an anchor. The \var{id} parameter is the value of the parser's
				264	\code{inanchor} attribute.
				265	\end{funcdesc}
				266
				267	A sample formatters implementation can be found in the module
				268	\code{fmt}, which in turn uses the module \code{Para}. These are
				269	currently not intended as a
				270	\ttindex{fmt}
				271	\ttindex{Para}