blob: fd7eeaafbed61539993a917f49417d1bbcf00e0b [file] [log] [blame]
Guido van Rossum470be141995-03-17 16:07:09 +00001\section{Standard Module \sectcode{sgmllib}}
Guido van Rossume47da0a1997-07-17 16:34:52 +00002\label{module-sgmllib}
Guido van Rossuma12ef941995-02-27 17:53:25 +00003\stmodindex{sgmllib}
Guido van Rossum86751151995-02-28 17:14:32 +00004\index{SGML}
5
Guido van Rossum86751151995-02-28 17:14:32 +00006This module defines a class \code{SGMLParser} which serves as the
7basis for parsing text files formatted in SGML (Standard Generalized
8Mark-up Language). In fact, it does not provide a full SGML parser
Fred Drake8f925951996-10-09 16:13:22 +00009--- it only parses SGML insofar as it is used by HTML, and the module
10only exists as a base for the \code{htmllib} module.
Fred Drake356818e1997-12-15 22:20:33 +000011\refstmodindex{htmllib}
Guido van Rossum86751151995-02-28 17:14:32 +000012
13In particular, the parser is hardcoded to recognize the following
Fred Drake42439ad1996-10-08 21:51:49 +000014constructs:
Guido van Rossum86751151995-02-28 17:14:32 +000015
16\begin{itemize}
17
18\item
19Opening and closing tags of the form
Fred Drakeb441eb81998-02-13 14:37:12 +000020\samp{<\var{tag} \var{attr}="\var{value}" ...>} and
21\samp{</\var{tag}>}, respectively.
Guido van Rossum86751151995-02-28 17:14:32 +000022
23\item
Fred Drakeb441eb81998-02-13 14:37:12 +000024Numeric character references of the form \samp{\&\#\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000025
26\item
Fred Drakeb441eb81998-02-13 14:37:12 +000027Entity references of the form \samp{\&\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000028
29\item
Fred Drakeb441eb81998-02-13 14:37:12 +000030SGML comments of the form \samp{<!--\var{text}-->}. Note that
Fred Drake42439ad1996-10-08 21:51:49 +000031spaces, tabs, and newlines are allowed between the trailing
Fred Drakeb441eb81998-02-13 14:37:12 +000032\samp{>} and the immediately preceeding \samp{--}.
Guido van Rossum86751151995-02-28 17:14:32 +000033
34\end{itemize}
35
36The \code{SGMLParser} class must be instantiated without arguments.
37It has the following interface methods:
38
Fred Drake19479911998-02-13 06:58:54 +000039\setindexsubitem{(SGMLParser method)}
Fred Drake8f925951996-10-09 16:13:22 +000040
Guido van Rossum86751151995-02-28 17:14:32 +000041\begin{funcdesc}{reset}{}
42Reset the instance. Loses all unprocessed data. This is called
43implicitly at instantiation time.
44\end{funcdesc}
45
46\begin{funcdesc}{setnomoretags}{}
47Stop processing tags. Treat all following input as literal input
48(CDATA). (This is only provided so the HTML tag \code{<PLAINTEXT>}
49can be implemented.)
50\end{funcdesc}
51
52\begin{funcdesc}{setliteral}{}
53Enter literal mode (CDATA mode).
54\end{funcdesc}
55
56\begin{funcdesc}{feed}{data}
57Feed some text to the parser. It is processed insofar as it consists
58of complete elements; incomplete data is buffered until more data is
59fed or \code{close()} is called.
60\end{funcdesc}
61
62\begin{funcdesc}{close}{}
63Force processing of all buffered data as if it were followed by an
64end-of-file mark. This method may be redefined by a derived class to
65define additional processing at the end of the input, but the
66redefined version should always call \code{SGMLParser.close()}.
67\end{funcdesc}
68
Fred Drake42439ad1996-10-08 21:51:49 +000069\begin{funcdesc}{handle_starttag}{tag\, method\, attributes}
70This method is called to handle start tags for which either a
71\code{start_\var{tag}()} or \code{do_\var{tag}()} method has been
72defined. The \code{tag} argument is the name of the tag converted to
73lower case, and the \code{method} argument is the bound method which
74should be used to support semantic interpretation of the start tag.
75The \var{attributes} argument is a list of (\var{name}, \var{value})
76pairs containing the attributes found inside the tag's \code{<>}
77brackets. The \var{name} has been translated to lower case and double
78quotes and backslashes in the \var{value} have been interpreted. For
79instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this
80method would be called as \code{unknown_starttag('a', [('href',
81'http://www.cwi.nl/')])}. The base implementation simply calls
82\code{method} with \code{attributes} as the only argument.
Guido van Rossum86751151995-02-28 17:14:32 +000083\end{funcdesc}
84
Fred Drake42439ad1996-10-08 21:51:49 +000085\begin{funcdesc}{handle_endtag}{tag\, method}
86
87This method is called to handle endtags for which an
88\code{end_\var{tag}()} method has been defined. The \code{tag}
89argument is the name of the tag converted to lower case, and the
90\code{method} argument is the bound method which should be used to
91support semantic interpretation of the end tag. If no
92\code{end_\var{tag}()} method is defined for the closing element, this
93handler is not called. The base implementation simply calls
94\code{method}.
Guido van Rossum86751151995-02-28 17:14:32 +000095\end{funcdesc}
96
97\begin{funcdesc}{handle_data}{data}
98This method is called to process arbitrary data. It is intended to be
99overridden by a derived class; the base class implementation does
100nothing.
101\end{funcdesc}
102
Fred Drake42439ad1996-10-08 21:51:49 +0000103\begin{funcdesc}{handle_charref}{ref}
104This method is called to process a character reference of the form
Fred Drakeb441eb81998-02-13 14:37:12 +0000105\samp{\&\#\var{ref};}. In the base implementation, \var{ref} must
Fred Drake42439ad1996-10-08 21:51:49 +0000106be a decimal number in the
107range 0-255. It translates the character to \ASCII{} and calls the
108method \code{handle_data()} with the character as argument. If
109\var{ref} is invalid or out of range, the method
110\code{unknown_charref(\var{ref})} is called to handle the error. A
111subclass must override this method to provide support for named
112character entities.
113\end{funcdesc}
114
115\begin{funcdesc}{handle_entityref}{ref}
116This method is called to process a general entity reference of the form
Fred Drakeb441eb81998-02-13 14:37:12 +0000117\samp{\&\var{ref};} where \var{ref} is an general entity
Fred Drake42439ad1996-10-08 21:51:49 +0000118reference. It looks for \var{ref} in the instance (or class)
119variable \code{entitydefs} which should be a mapping from entity names
120to corresponding translations.
121If a translation is found, it calls the method \code{handle_data()}
122with the translation; otherwise, it calls the method
123\code{unknown_entityref(\var{ref})}. The default \code{entitydefs}
124defines translations for \code{\&amp;}, \code{\&apos}, \code{\&gt;},
125\code{\&lt;}, and \code{\&quot;}.
126\end{funcdesc}
127
128\begin{funcdesc}{handle_comment}{comment}
129This method is called when a comment is encountered. The
130\code{comment} argument is a string containing the text between the
Fred Drakeb441eb81998-02-13 14:37:12 +0000131\samp{<!--} and \samp{-->} delimiters, but not the delimiters
132themselves. For example, the comment \samp{<!--text-->} will
Fred Drake42439ad1996-10-08 21:51:49 +0000133cause this method to be called with the argument \code{'text'}. The
134default method does nothing.
135\end{funcdesc}
136
137\begin{funcdesc}{report_unbalanced}{tag}
138This method is called when an end tag is found which does not
139correspond to any open element.
140\end{funcdesc}
141
Guido van Rossum86751151995-02-28 17:14:32 +0000142\begin{funcdesc}{unknown_starttag}{tag\, attributes}
143This method is called to process an unknown start tag. It is intended
144to be overridden by a derived class; the base class implementation
Fred Drake42439ad1996-10-08 21:51:49 +0000145does nothing.
Guido van Rossum86751151995-02-28 17:14:32 +0000146\end{funcdesc}
147
148\begin{funcdesc}{unknown_endtag}{tag}
149This method is called to process an unknown end tag. It is intended
150to be overridden by a derived class; the base class implementation
151does nothing.
152\end{funcdesc}
153
154\begin{funcdesc}{unknown_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000155This method is called to process unresolvable numeric character
156references. It is intended to be overridden by a derived class; the
157base class implementation does nothing.
Guido van Rossum86751151995-02-28 17:14:32 +0000158\end{funcdesc}
159
160\begin{funcdesc}{unknown_entityref}{ref}
161This method is called to process an unknown entity reference. It is
162intended to be overridden by a derived class; the base class
163implementation does nothing.
164\end{funcdesc}
165
166Apart from overriding or extending the methods listed above, derived
167classes may also define methods of the following form to define
168processing of specific tags. Tag names in the input stream are case
169independent; the \var{tag} occurring in method names must be in lower
170case:
171
Fred Drakeb441eb81998-02-13 14:37:12 +0000172\begin{funcdescni}{start_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000173This method is called to process an opening tag \var{tag}. It has
174preference over \code{do_\var{tag}()}. The \var{attributes} argument
Fred Drake42439ad1996-10-08 21:51:49 +0000175has the same meaning as described for \code{handle_starttag()} above.
Fred Drakeb441eb81998-02-13 14:37:12 +0000176\end{funcdescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000177
Fred Drakeb441eb81998-02-13 14:37:12 +0000178\begin{funcdescni}{do_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000179This method is called to process an opening tag \var{tag} that does
180not come with a matching closing tag. The \var{attributes} argument
Fred Drake42439ad1996-10-08 21:51:49 +0000181has the same meaning as described for \code{handle_starttag()} above.
Fred Drakeb441eb81998-02-13 14:37:12 +0000182\end{funcdescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000183
Fred Drakeb441eb81998-02-13 14:37:12 +0000184\begin{funcdescni}{end_\var{tag}}{}
Guido van Rossum86751151995-02-28 17:14:32 +0000185This method is called to process a closing tag \var{tag}.
Fred Drakeb441eb81998-02-13 14:37:12 +0000186\end{funcdescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000187
Fred Drake42439ad1996-10-08 21:51:49 +0000188Note that the parser maintains a stack of open elements for which no
189end tag has been found yet. Only tags processed by
190\code{start_\var{tag}()} are pushed on this stack. Definition of an
Guido van Rossum86751151995-02-28 17:14:32 +0000191\code{end_\var{tag}()} method is optional for these tags. For tags
192processed by \code{do_\var{tag}()} or by \code{unknown_tag()}, no
Fred Drake42439ad1996-10-08 21:51:49 +0000193\code{end_\var{tag}()} method must be defined; if defined, it will not
194be used. If both \code{start_\var{tag}()} and \code{do_\var{tag}()}
195methods exist for a tag, the \code{start_\var{tag}()} method takes
196precedence.