blob: 78060ecf5168e1311c7f03cc2f11023893b299c6 [file] [log] [blame]
Guido van Rossum470be141995-03-17 16:07:09 +00001\section{Standard Module \sectcode{sgmllib}}
Guido van Rossume47da0a1997-07-17 16:34:52 +00002\label{module-sgmllib}
Guido van Rossuma12ef941995-02-27 17:53:25 +00003\stmodindex{sgmllib}
Guido van Rossum86751151995-02-28 17:14:32 +00004\index{SGML}
5
Fred Drake2dde74c1998-03-12 14:42:23 +00006This module defines a class \class{SGMLParser} which serves as the
Guido van Rossum86751151995-02-28 17:14:32 +00007basis for parsing text files formatted in SGML (Standard Generalized
8Mark-up Language). In fact, it does not provide a full SGML parser
Fred Drake8f925951996-10-09 16:13:22 +00009--- it only parses SGML insofar as it is used by HTML, and the module
Fred Drake2dde74c1998-03-12 14:42:23 +000010only exists as a base for the \module{htmllib}\refstmodindex{htmllib}
11module.
Guido van Rossum86751151995-02-28 17:14:32 +000012
Fred Drake2dde74c1998-03-12 14:42:23 +000013
14\begin{classdesc}{SGMLParser}{}
15The \class{SGMLParser} class is instantiated without arguments.
16The parser is hardcoded to recognize the following
Fred Drake42439ad1996-10-08 21:51:49 +000017constructs:
Guido van Rossum86751151995-02-28 17:14:32 +000018
19\begin{itemize}
Guido van Rossum86751151995-02-28 17:14:32 +000020\item
21Opening and closing tags of the form
Fred Drakeb441eb81998-02-13 14:37:12 +000022\samp{<\var{tag} \var{attr}="\var{value}" ...>} and
23\samp{</\var{tag}>}, respectively.
Guido van Rossum86751151995-02-28 17:14:32 +000024
25\item
Fred Drakeb441eb81998-02-13 14:37:12 +000026Numeric character references of the form \samp{\&\#\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000027
28\item
Fred Drakeb441eb81998-02-13 14:37:12 +000029Entity references of the form \samp{\&\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000030
31\item
Fred Drakeb441eb81998-02-13 14:37:12 +000032SGML comments of the form \samp{<!--\var{text}-->}. Note that
Fred Drake42439ad1996-10-08 21:51:49 +000033spaces, tabs, and newlines are allowed between the trailing
Fred Drakeb441eb81998-02-13 14:37:12 +000034\samp{>} and the immediately preceeding \samp{--}.
Guido van Rossum86751151995-02-28 17:14:32 +000035
36\end{itemize}
Fred Drake2dde74c1998-03-12 14:42:23 +000037\end{classdesc}
Guido van Rossum86751151995-02-28 17:14:32 +000038
Fred Drake2dde74c1998-03-12 14:42:23 +000039\class{SGMLParser} instances have the following interface methods:
Guido van Rossum86751151995-02-28 17:14:32 +000040
Fred Drake19479911998-02-13 06:58:54 +000041\setindexsubitem{(SGMLParser method)}
Fred Drake8f925951996-10-09 16:13:22 +000042
Guido van Rossum86751151995-02-28 17:14:32 +000043\begin{funcdesc}{reset}{}
44Reset the instance. Loses all unprocessed data. This is called
45implicitly at instantiation time.
46\end{funcdesc}
47
48\begin{funcdesc}{setnomoretags}{}
49Stop processing tags. Treat all following input as literal input
50(CDATA). (This is only provided so the HTML tag \code{<PLAINTEXT>}
51can be implemented.)
52\end{funcdesc}
53
54\begin{funcdesc}{setliteral}{}
55Enter literal mode (CDATA mode).
56\end{funcdesc}
57
58\begin{funcdesc}{feed}{data}
59Feed some text to the parser. It is processed insofar as it consists
60of complete elements; incomplete data is buffered until more data is
Fred Drake2dde74c1998-03-12 14:42:23 +000061fed or \method{close()} is called.
Guido van Rossum86751151995-02-28 17:14:32 +000062\end{funcdesc}
63
64\begin{funcdesc}{close}{}
65Force processing of all buffered data as if it were followed by an
66end-of-file mark. This method may be redefined by a derived class to
67define additional processing at the end of the input, but the
Fred Drake2dde74c1998-03-12 14:42:23 +000068redefined version should always call \method{close()}.
Guido van Rossum86751151995-02-28 17:14:32 +000069\end{funcdesc}
70
Fred Drake2dde74c1998-03-12 14:42:23 +000071\begin{funcdesc}{handle_starttag}{tag, method, attributes}
Fred Drake42439ad1996-10-08 21:51:49 +000072This method is called to handle start tags for which either a
73\code{start_\var{tag}()} or \code{do_\var{tag}()} method has been
Fred Drake2dde74c1998-03-12 14:42:23 +000074defined. The \var{tag} argument is the name of the tag converted to
75lower case, and the \var{method} argument is the bound method which
Fred Drake42439ad1996-10-08 21:51:49 +000076should be used to support semantic interpretation of the start tag.
Fred Drake2dde74c1998-03-12 14:42:23 +000077The \var{attributes} argument is a list of \code{(\var{name}, \var{value})}
Fred Drake42439ad1996-10-08 21:51:49 +000078pairs containing the attributes found inside the tag's \code{<>}
79brackets. The \var{name} has been translated to lower case and double
80quotes and backslashes in the \var{value} have been interpreted. For
81instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this
Fred Drake2dde74c1998-03-12 14:42:23 +000082method would be called as \samp{unknown_starttag('a', [('href',
Fred Drake42439ad1996-10-08 21:51:49 +000083'http://www.cwi.nl/')])}. The base implementation simply calls
Fred Drake2dde74c1998-03-12 14:42:23 +000084\var{method} with \var{attributes} as the only argument.
Guido van Rossum86751151995-02-28 17:14:32 +000085\end{funcdesc}
86
Fred Drake2dde74c1998-03-12 14:42:23 +000087\begin{funcdesc}{handle_endtag}{tag, method}
Fred Drake42439ad1996-10-08 21:51:49 +000088This method is called to handle endtags for which an
Fred Drake2dde74c1998-03-12 14:42:23 +000089\code{end_\var{tag}()} method has been defined. The \var{tag}
Fred Drake42439ad1996-10-08 21:51:49 +000090argument is the name of the tag converted to lower case, and the
Fred Drake2dde74c1998-03-12 14:42:23 +000091\var{method} argument is the bound method which should be used to
Fred Drake42439ad1996-10-08 21:51:49 +000092support semantic interpretation of the end tag. If no
Fred Drake2dde74c1998-03-12 14:42:23 +000093\code{end_\var{tag}()} method is defined for the closing element,
94this handler is not called. The base implementation simply calls
95\var{method}.
Guido van Rossum86751151995-02-28 17:14:32 +000096\end{funcdesc}
97
98\begin{funcdesc}{handle_data}{data}
99This method is called to process arbitrary data. It is intended to be
100overridden by a derived class; the base class implementation does
101nothing.
102\end{funcdesc}
103
Fred Drake42439ad1996-10-08 21:51:49 +0000104\begin{funcdesc}{handle_charref}{ref}
105This method is called to process a character reference of the form
Fred Drakeb441eb81998-02-13 14:37:12 +0000106\samp{\&\#\var{ref};}. In the base implementation, \var{ref} must
Fred Drake42439ad1996-10-08 21:51:49 +0000107be a decimal number in the
108range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake2dde74c1998-03-12 14:42:23 +0000109method \method{handle_data()} with the character as argument. If
Fred Drake42439ad1996-10-08 21:51:49 +0000110\var{ref} is invalid or out of range, the method
111\code{unknown_charref(\var{ref})} is called to handle the error. A
112subclass must override this method to provide support for named
113character entities.
114\end{funcdesc}
115
116\begin{funcdesc}{handle_entityref}{ref}
Fred Drake2dde74c1998-03-12 14:42:23 +0000117This method is called to process a general entity reference of the
118form \samp{\&\var{ref};} where \var{ref} is an general entity
Fred Drake42439ad1996-10-08 21:51:49 +0000119reference. It looks for \var{ref} in the instance (or class)
Fred Drake2dde74c1998-03-12 14:42:23 +0000120variable \member{entitydefs} which should be a mapping from entity
121names to corresponding translations.
122If a translation is found, it calls the method \method{handle_data()}
Fred Drake42439ad1996-10-08 21:51:49 +0000123with the translation; otherwise, it calls the method
Fred Drake2dde74c1998-03-12 14:42:23 +0000124\code{unknown_entityref(\var{ref})}. The default \member{entitydefs}
Fred Drake42439ad1996-10-08 21:51:49 +0000125defines translations for \code{\&amp;}, \code{\&apos}, \code{\&gt;},
126\code{\&lt;}, and \code{\&quot;}.
127\end{funcdesc}
128
129\begin{funcdesc}{handle_comment}{comment}
130This method is called when a comment is encountered. The
Fred Drake2dde74c1998-03-12 14:42:23 +0000131\var{comment} argument is a string containing the text between the
Fred Drakeb441eb81998-02-13 14:37:12 +0000132\samp{<!--} and \samp{-->} delimiters, but not the delimiters
133themselves. For example, the comment \samp{<!--text-->} will
Fred Drake42439ad1996-10-08 21:51:49 +0000134cause this method to be called with the argument \code{'text'}. The
135default method does nothing.
136\end{funcdesc}
137
138\begin{funcdesc}{report_unbalanced}{tag}
139This method is called when an end tag is found which does not
140correspond to any open element.
141\end{funcdesc}
142
Guido van Rossum86751151995-02-28 17:14:32 +0000143\begin{funcdesc}{unknown_starttag}{tag\, attributes}
144This method is called to process an unknown start tag. It is intended
145to be overridden by a derived class; the base class implementation
Fred Drake42439ad1996-10-08 21:51:49 +0000146does nothing.
Guido van Rossum86751151995-02-28 17:14:32 +0000147\end{funcdesc}
148
149\begin{funcdesc}{unknown_endtag}{tag}
150This method is called to process an unknown end tag. It is intended
151to be overridden by a derived class; the base class implementation
152does nothing.
153\end{funcdesc}
154
155\begin{funcdesc}{unknown_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000156This method is called to process unresolvable numeric character
Fred Drake2dde74c1998-03-12 14:42:23 +0000157references. Refer to \method{handle_charref()} to determine what is
158handled by default. It is intended to be overridden by a derived
159class; the base class implementation does nothing.
Guido van Rossum86751151995-02-28 17:14:32 +0000160\end{funcdesc}
161
162\begin{funcdesc}{unknown_entityref}{ref}
163This method is called to process an unknown entity reference. It is
164intended to be overridden by a derived class; the base class
165implementation does nothing.
166\end{funcdesc}
167
168Apart from overriding or extending the methods listed above, derived
169classes may also define methods of the following form to define
170processing of specific tags. Tag names in the input stream are case
171independent; the \var{tag} occurring in method names must be in lower
172case:
173
Fred Drakeb441eb81998-02-13 14:37:12 +0000174\begin{funcdescni}{start_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000175This method is called to process an opening tag \var{tag}. It has
Fred Drake2dde74c1998-03-12 14:42:23 +0000176preference over \code{do_\var{tag}()}. The \var{attributes}
177argument has the same meaning as described for
178\method{handle_starttag()} above.
Fred Drakeb441eb81998-02-13 14:37:12 +0000179\end{funcdescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000180
Fred Drakeb441eb81998-02-13 14:37:12 +0000181\begin{funcdescni}{do_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000182This method is called to process an opening tag \var{tag} that does
183not come with a matching closing tag. The \var{attributes} argument
Fred Drake2dde74c1998-03-12 14:42:23 +0000184has the same meaning as described for \method{handle_starttag()} above.
Fred Drakeb441eb81998-02-13 14:37:12 +0000185\end{funcdescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000186
Fred Drakeb441eb81998-02-13 14:37:12 +0000187\begin{funcdescni}{end_\var{tag}}{}
Guido van Rossum86751151995-02-28 17:14:32 +0000188This method is called to process a closing tag \var{tag}.
Fred Drakeb441eb81998-02-13 14:37:12 +0000189\end{funcdescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000190
Fred Drake42439ad1996-10-08 21:51:49 +0000191Note that the parser maintains a stack of open elements for which no
192end tag has been found yet. Only tags processed by
193\code{start_\var{tag}()} are pushed on this stack. Definition of an
Guido van Rossum86751151995-02-28 17:14:32 +0000194\code{end_\var{tag}()} method is optional for these tags. For tags
Fred Drake2dde74c1998-03-12 14:42:23 +0000195processed by \code{do_\var{tag}()} or by \method{unknown_tag()}, no
Fred Drake42439ad1996-10-08 21:51:49 +0000196\code{end_\var{tag}()} method must be defined; if defined, it will not
197be used. If both \code{start_\var{tag}()} and \code{do_\var{tag}()}
198methods exist for a tag, the \code{start_\var{tag}()} method takes
199precedence.