blob: f86e729e08e57e41ddcf981d39b91c66894a470a [file] [log] [blame]
Guido van Rossum470be141995-03-17 16:07:09 +00001\section{Standard Module \sectcode{sgmllib}}
Guido van Rossume47da0a1997-07-17 16:34:52 +00002\label{module-sgmllib}
Guido van Rossuma12ef941995-02-27 17:53:25 +00003\stmodindex{sgmllib}
Guido van Rossum86751151995-02-28 17:14:32 +00004\index{SGML}
5
Fred Drake2dde74c1998-03-12 14:42:23 +00006This module defines a class \class{SGMLParser} which serves as the
Guido van Rossum86751151995-02-28 17:14:32 +00007basis for parsing text files formatted in SGML (Standard Generalized
8Mark-up Language). In fact, it does not provide a full SGML parser
Fred Drake8f925951996-10-09 16:13:22 +00009--- it only parses SGML insofar as it is used by HTML, and the module
Fred Drake2dde74c1998-03-12 14:42:23 +000010only exists as a base for the \module{htmllib}\refstmodindex{htmllib}
11module.
Guido van Rossum86751151995-02-28 17:14:32 +000012
Fred Drake2dde74c1998-03-12 14:42:23 +000013
14\begin{classdesc}{SGMLParser}{}
15The \class{SGMLParser} class is instantiated without arguments.
16The parser is hardcoded to recognize the following
Fred Drake42439ad1996-10-08 21:51:49 +000017constructs:
Guido van Rossum86751151995-02-28 17:14:32 +000018
19\begin{itemize}
Guido van Rossum86751151995-02-28 17:14:32 +000020\item
21Opening and closing tags of the form
Fred Drakeb441eb81998-02-13 14:37:12 +000022\samp{<\var{tag} \var{attr}="\var{value}" ...>} and
23\samp{</\var{tag}>}, respectively.
Guido van Rossum86751151995-02-28 17:14:32 +000024
25\item
Fred Drakeb441eb81998-02-13 14:37:12 +000026Numeric character references of the form \samp{\&\#\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000027
28\item
Fred Drakeb441eb81998-02-13 14:37:12 +000029Entity references of the form \samp{\&\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000030
31\item
Fred Drakeb441eb81998-02-13 14:37:12 +000032SGML comments of the form \samp{<!--\var{text}-->}. Note that
Fred Drake42439ad1996-10-08 21:51:49 +000033spaces, tabs, and newlines are allowed between the trailing
Fred Drakeb441eb81998-02-13 14:37:12 +000034\samp{>} and the immediately preceeding \samp{--}.
Guido van Rossum86751151995-02-28 17:14:32 +000035
36\end{itemize}
Fred Drake2dde74c1998-03-12 14:42:23 +000037\end{classdesc}
Guido van Rossum86751151995-02-28 17:14:32 +000038
Fred Drake2dde74c1998-03-12 14:42:23 +000039\class{SGMLParser} instances have the following interface methods:
Guido van Rossum86751151995-02-28 17:14:32 +000040
Fred Drake8f925951996-10-09 16:13:22 +000041
Fred Drake8fe533e1998-03-27 05:27:08 +000042\begin{methoddesc}{reset}{}
Guido van Rossum86751151995-02-28 17:14:32 +000043Reset the instance. Loses all unprocessed data. This is called
44implicitly at instantiation time.
Fred Drake8fe533e1998-03-27 05:27:08 +000045\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000046
Fred Drake8fe533e1998-03-27 05:27:08 +000047\begin{methoddesc}{setnomoretags}{}
Guido van Rossum86751151995-02-28 17:14:32 +000048Stop processing tags. Treat all following input as literal input
49(CDATA). (This is only provided so the HTML tag \code{<PLAINTEXT>}
50can be implemented.)
Fred Drake8fe533e1998-03-27 05:27:08 +000051\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000052
Fred Drake8fe533e1998-03-27 05:27:08 +000053\begin{methoddesc}{setliteral}{}
Guido van Rossum86751151995-02-28 17:14:32 +000054Enter literal mode (CDATA mode).
Fred Drake8fe533e1998-03-27 05:27:08 +000055\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000056
Fred Drake8fe533e1998-03-27 05:27:08 +000057\begin{methoddesc}{feed}{data}
Guido van Rossum86751151995-02-28 17:14:32 +000058Feed some text to the parser. It is processed insofar as it consists
59of complete elements; incomplete data is buffered until more data is
Fred Drake2dde74c1998-03-12 14:42:23 +000060fed or \method{close()} is called.
Fred Drake8fe533e1998-03-27 05:27:08 +000061\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000062
Fred Drake8fe533e1998-03-27 05:27:08 +000063\begin{methoddesc}{close}{}
Guido van Rossum86751151995-02-28 17:14:32 +000064Force processing of all buffered data as if it were followed by an
65end-of-file mark. This method may be redefined by a derived class to
66define additional processing at the end of the input, but the
Fred Drake2dde74c1998-03-12 14:42:23 +000067redefined version should always call \method{close()}.
Fred Drake8fe533e1998-03-27 05:27:08 +000068\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000069
Fred Drake8fe533e1998-03-27 05:27:08 +000070\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Fred Drake42439ad1996-10-08 21:51:49 +000071This method is called to handle start tags for which either a
72\code{start_\var{tag}()} or \code{do_\var{tag}()} method has been
Fred Drake2dde74c1998-03-12 14:42:23 +000073defined. The \var{tag} argument is the name of the tag converted to
74lower case, and the \var{method} argument is the bound method which
Fred Drake42439ad1996-10-08 21:51:49 +000075should be used to support semantic interpretation of the start tag.
Fred Drake2dde74c1998-03-12 14:42:23 +000076The \var{attributes} argument is a list of \code{(\var{name}, \var{value})}
Fred Drake42439ad1996-10-08 21:51:49 +000077pairs containing the attributes found inside the tag's \code{<>}
78brackets. The \var{name} has been translated to lower case and double
79quotes and backslashes in the \var{value} have been interpreted. For
80instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this
Fred Drake2dde74c1998-03-12 14:42:23 +000081method would be called as \samp{unknown_starttag('a', [('href',
Fred Drake42439ad1996-10-08 21:51:49 +000082'http://www.cwi.nl/')])}. The base implementation simply calls
Fred Drake2dde74c1998-03-12 14:42:23 +000083\var{method} with \var{attributes} as the only argument.
Fred Drake8fe533e1998-03-27 05:27:08 +000084\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000085
Fred Drake8fe533e1998-03-27 05:27:08 +000086\begin{methoddesc}{handle_endtag}{tag, method}
Fred Drake42439ad1996-10-08 21:51:49 +000087This method is called to handle endtags for which an
Fred Drake2dde74c1998-03-12 14:42:23 +000088\code{end_\var{tag}()} method has been defined. The \var{tag}
Fred Drake42439ad1996-10-08 21:51:49 +000089argument is the name of the tag converted to lower case, and the
Fred Drake2dde74c1998-03-12 14:42:23 +000090\var{method} argument is the bound method which should be used to
Fred Drake42439ad1996-10-08 21:51:49 +000091support semantic interpretation of the end tag. If no
Fred Drake2dde74c1998-03-12 14:42:23 +000092\code{end_\var{tag}()} method is defined for the closing element,
93this handler is not called. The base implementation simply calls
94\var{method}.
Fred Drake8fe533e1998-03-27 05:27:08 +000095\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000096
Fred Drake8fe533e1998-03-27 05:27:08 +000097\begin{methoddesc}{handle_data}{data}
Guido van Rossum86751151995-02-28 17:14:32 +000098This method is called to process arbitrary data. It is intended to be
99overridden by a derived class; the base class implementation does
100nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000101\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000102
Fred Drake8fe533e1998-03-27 05:27:08 +0000103\begin{methoddesc}{handle_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000104This method is called to process a character reference of the form
Fred Drakeb441eb81998-02-13 14:37:12 +0000105\samp{\&\#\var{ref};}. In the base implementation, \var{ref} must
Fred Drake42439ad1996-10-08 21:51:49 +0000106be a decimal number in the
107range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake2dde74c1998-03-12 14:42:23 +0000108method \method{handle_data()} with the character as argument. If
Fred Drake42439ad1996-10-08 21:51:49 +0000109\var{ref} is invalid or out of range, the method
110\code{unknown_charref(\var{ref})} is called to handle the error. A
111subclass must override this method to provide support for named
112character entities.
Fred Drake8fe533e1998-03-27 05:27:08 +0000113\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000114
Fred Drake8fe533e1998-03-27 05:27:08 +0000115\begin{methoddesc}{handle_entityref}{ref}
Fred Drake2dde74c1998-03-12 14:42:23 +0000116This method is called to process a general entity reference of the
117form \samp{\&\var{ref};} where \var{ref} is an general entity
Fred Drake42439ad1996-10-08 21:51:49 +0000118reference. It looks for \var{ref} in the instance (or class)
Fred Drake2dde74c1998-03-12 14:42:23 +0000119variable \member{entitydefs} which should be a mapping from entity
120names to corresponding translations.
121If a translation is found, it calls the method \method{handle_data()}
Fred Drake42439ad1996-10-08 21:51:49 +0000122with the translation; otherwise, it calls the method
Fred Drake2dde74c1998-03-12 14:42:23 +0000123\code{unknown_entityref(\var{ref})}. The default \member{entitydefs}
Fred Drake42439ad1996-10-08 21:51:49 +0000124defines translations for \code{\&amp;}, \code{\&apos}, \code{\&gt;},
125\code{\&lt;}, and \code{\&quot;}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000126\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000127
Fred Drake8fe533e1998-03-27 05:27:08 +0000128\begin{methoddesc}{handle_comment}{comment}
Fred Drake42439ad1996-10-08 21:51:49 +0000129This method is called when a comment is encountered. The
Fred Drake2dde74c1998-03-12 14:42:23 +0000130\var{comment} argument is a string containing the text between the
Fred Drakeb441eb81998-02-13 14:37:12 +0000131\samp{<!--} and \samp{-->} delimiters, but not the delimiters
132themselves. For example, the comment \samp{<!--text-->} will
Fred Drake42439ad1996-10-08 21:51:49 +0000133cause this method to be called with the argument \code{'text'}. The
134default method does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000135\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000136
Fred Drake8fe533e1998-03-27 05:27:08 +0000137\begin{methoddesc}{report_unbalanced}{tag}
Fred Drake42439ad1996-10-08 21:51:49 +0000138This method is called when an end tag is found which does not
139correspond to any open element.
Fred Drake8fe533e1998-03-27 05:27:08 +0000140\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000141
Fred Drake8fe533e1998-03-27 05:27:08 +0000142\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000143This method is called to process an unknown start tag. It is intended
144to be overridden by a derived class; the base class implementation
Fred Drake42439ad1996-10-08 21:51:49 +0000145does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000146\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000147
Fred Drake8fe533e1998-03-27 05:27:08 +0000148\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossum86751151995-02-28 17:14:32 +0000149This method is called to process an unknown end tag. It is intended
150to be overridden by a derived class; the base class implementation
151does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000152\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000153
Fred Drake8fe533e1998-03-27 05:27:08 +0000154\begin{methoddesc}{unknown_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000155This method is called to process unresolvable numeric character
Fred Drake2dde74c1998-03-12 14:42:23 +0000156references. Refer to \method{handle_charref()} to determine what is
157handled by default. It is intended to be overridden by a derived
158class; the base class implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000159\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000160
Fred Drake8fe533e1998-03-27 05:27:08 +0000161\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossum86751151995-02-28 17:14:32 +0000162This method is called to process an unknown entity reference. It is
163intended to be overridden by a derived class; the base class
164implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000165\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000166
167Apart from overriding or extending the methods listed above, derived
168classes may also define methods of the following form to define
169processing of specific tags. Tag names in the input stream are case
170independent; the \var{tag} occurring in method names must be in lower
171case:
172
Fred Drake8fe533e1998-03-27 05:27:08 +0000173\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000174This method is called to process an opening tag \var{tag}. It has
Fred Drake2dde74c1998-03-12 14:42:23 +0000175preference over \code{do_\var{tag}()}. The \var{attributes}
176argument has the same meaning as described for
177\method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000178\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000179
Fred Drake8fe533e1998-03-27 05:27:08 +0000180\begin{methoddescni}{do_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000181This method is called to process an opening tag \var{tag} that does
182not come with a matching closing tag. The \var{attributes} argument
Fred Drake2dde74c1998-03-12 14:42:23 +0000183has the same meaning as described for \method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000184\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000185
Fred Drake8fe533e1998-03-27 05:27:08 +0000186\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossum86751151995-02-28 17:14:32 +0000187This method is called to process a closing tag \var{tag}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000188\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000189
Fred Drake42439ad1996-10-08 21:51:49 +0000190Note that the parser maintains a stack of open elements for which no
191end tag has been found yet. Only tags processed by
192\code{start_\var{tag}()} are pushed on this stack. Definition of an
Guido van Rossum86751151995-02-28 17:14:32 +0000193\code{end_\var{tag}()} method is optional for these tags. For tags
Fred Drake2dde74c1998-03-12 14:42:23 +0000194processed by \code{do_\var{tag}()} or by \method{unknown_tag()}, no
Fred Drake42439ad1996-10-08 21:51:49 +0000195\code{end_\var{tag}()} method must be defined; if defined, it will not
196be used. If both \code{start_\var{tag}()} and \code{do_\var{tag}()}
197methods exist for a tag, the \code{start_\var{tag}()} method takes
198precedence.