blob: a8dfa0bcde341ac81c5e107a020e52a53f25644b [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{sgmllib} ---
2 Simple SGML parser.}
Fred Drakeb91e9341998-07-23 17:59:49 +00003\declaremodule{standard}{sgmllib}
4
5\modulesynopsis{Only as much of an SGML parser as needed to parse HTML.}
6
Guido van Rossum86751151995-02-28 17:14:32 +00007\index{SGML}
8
Fred Drake2dde74c1998-03-12 14:42:23 +00009This module defines a class \class{SGMLParser} which serves as the
Guido van Rossum86751151995-02-28 17:14:32 +000010basis for parsing text files formatted in SGML (Standard Generalized
11Mark-up Language). In fact, it does not provide a full SGML parser
Fred Drake8f925951996-10-09 16:13:22 +000012--- it only parses SGML insofar as it is used by HTML, and the module
Fred Drake2dde74c1998-03-12 14:42:23 +000013only exists as a base for the \module{htmllib}\refstmodindex{htmllib}
14module.
Guido van Rossum86751151995-02-28 17:14:32 +000015
Fred Drake2dde74c1998-03-12 14:42:23 +000016
17\begin{classdesc}{SGMLParser}{}
18The \class{SGMLParser} class is instantiated without arguments.
19The parser is hardcoded to recognize the following
Fred Drake42439ad1996-10-08 21:51:49 +000020constructs:
Guido van Rossum86751151995-02-28 17:14:32 +000021
22\begin{itemize}
Guido van Rossum86751151995-02-28 17:14:32 +000023\item
24Opening and closing tags of the form
Fred Drakeb441eb81998-02-13 14:37:12 +000025\samp{<\var{tag} \var{attr}="\var{value}" ...>} and
26\samp{</\var{tag}>}, respectively.
Guido van Rossum86751151995-02-28 17:14:32 +000027
28\item
Fred Drakeb441eb81998-02-13 14:37:12 +000029Numeric character references of the form \samp{\&\#\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000030
31\item
Fred Drakeb441eb81998-02-13 14:37:12 +000032Entity references of the form \samp{\&\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000033
34\item
Fred Drakeb441eb81998-02-13 14:37:12 +000035SGML comments of the form \samp{<!--\var{text}-->}. Note that
Fred Drake42439ad1996-10-08 21:51:49 +000036spaces, tabs, and newlines are allowed between the trailing
Fred Drakeb441eb81998-02-13 14:37:12 +000037\samp{>} and the immediately preceeding \samp{--}.
Guido van Rossum86751151995-02-28 17:14:32 +000038
39\end{itemize}
Fred Drake2dde74c1998-03-12 14:42:23 +000040\end{classdesc}
Guido van Rossum86751151995-02-28 17:14:32 +000041
Fred Drake2dde74c1998-03-12 14:42:23 +000042\class{SGMLParser} instances have the following interface methods:
Guido van Rossum86751151995-02-28 17:14:32 +000043
Fred Drake8f925951996-10-09 16:13:22 +000044
Fred Drake8fe533e1998-03-27 05:27:08 +000045\begin{methoddesc}{reset}{}
Guido van Rossum86751151995-02-28 17:14:32 +000046Reset the instance. Loses all unprocessed data. This is called
47implicitly at instantiation time.
Fred Drake8fe533e1998-03-27 05:27:08 +000048\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000049
Fred Drake8fe533e1998-03-27 05:27:08 +000050\begin{methoddesc}{setnomoretags}{}
Guido van Rossum86751151995-02-28 17:14:32 +000051Stop processing tags. Treat all following input as literal input
52(CDATA). (This is only provided so the HTML tag \code{<PLAINTEXT>}
53can be implemented.)
Fred Drake8fe533e1998-03-27 05:27:08 +000054\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000055
Fred Drake8fe533e1998-03-27 05:27:08 +000056\begin{methoddesc}{setliteral}{}
Guido van Rossum86751151995-02-28 17:14:32 +000057Enter literal mode (CDATA mode).
Fred Drake8fe533e1998-03-27 05:27:08 +000058\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000059
Fred Drake8fe533e1998-03-27 05:27:08 +000060\begin{methoddesc}{feed}{data}
Guido van Rossum86751151995-02-28 17:14:32 +000061Feed some text to the parser. It is processed insofar as it consists
62of complete elements; incomplete data is buffered until more data is
Fred Drake2dde74c1998-03-12 14:42:23 +000063fed or \method{close()} is called.
Fred Drake8fe533e1998-03-27 05:27:08 +000064\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000065
Fred Drake8fe533e1998-03-27 05:27:08 +000066\begin{methoddesc}{close}{}
Guido van Rossum86751151995-02-28 17:14:32 +000067Force processing of all buffered data as if it were followed by an
68end-of-file mark. This method may be redefined by a derived class to
69define additional processing at the end of the input, but the
Fred Drake2dde74c1998-03-12 14:42:23 +000070redefined version should always call \method{close()}.
Fred Drake8fe533e1998-03-27 05:27:08 +000071\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000072
Fred Drake8fe533e1998-03-27 05:27:08 +000073\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Fred Drake42439ad1996-10-08 21:51:49 +000074This method is called to handle start tags for which either a
75\code{start_\var{tag}()} or \code{do_\var{tag}()} method has been
Fred Drake2dde74c1998-03-12 14:42:23 +000076defined. The \var{tag} argument is the name of the tag converted to
77lower case, and the \var{method} argument is the bound method which
Fred Drake42439ad1996-10-08 21:51:49 +000078should be used to support semantic interpretation of the start tag.
Fred Drake2dde74c1998-03-12 14:42:23 +000079The \var{attributes} argument is a list of \code{(\var{name}, \var{value})}
Fred Drake42439ad1996-10-08 21:51:49 +000080pairs containing the attributes found inside the tag's \code{<>}
81brackets. The \var{name} has been translated to lower case and double
82quotes and backslashes in the \var{value} have been interpreted. For
83instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this
Fred Drake2dde74c1998-03-12 14:42:23 +000084method would be called as \samp{unknown_starttag('a', [('href',
Fred Drake42439ad1996-10-08 21:51:49 +000085'http://www.cwi.nl/')])}. The base implementation simply calls
Fred Drake2dde74c1998-03-12 14:42:23 +000086\var{method} with \var{attributes} as the only argument.
Fred Drake8fe533e1998-03-27 05:27:08 +000087\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000088
Fred Drake8fe533e1998-03-27 05:27:08 +000089\begin{methoddesc}{handle_endtag}{tag, method}
Fred Drake42439ad1996-10-08 21:51:49 +000090This method is called to handle endtags for which an
Fred Drake2dde74c1998-03-12 14:42:23 +000091\code{end_\var{tag}()} method has been defined. The \var{tag}
Fred Drake42439ad1996-10-08 21:51:49 +000092argument is the name of the tag converted to lower case, and the
Fred Drake2dde74c1998-03-12 14:42:23 +000093\var{method} argument is the bound method which should be used to
Fred Drake42439ad1996-10-08 21:51:49 +000094support semantic interpretation of the end tag. If no
Fred Drake2dde74c1998-03-12 14:42:23 +000095\code{end_\var{tag}()} method is defined for the closing element,
96this handler is not called. The base implementation simply calls
97\var{method}.
Fred Drake8fe533e1998-03-27 05:27:08 +000098\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000099
Fred Drake8fe533e1998-03-27 05:27:08 +0000100\begin{methoddesc}{handle_data}{data}
Guido van Rossum86751151995-02-28 17:14:32 +0000101This method is called to process arbitrary data. It is intended to be
102overridden by a derived class; the base class implementation does
103nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000104\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000105
Fred Drake8fe533e1998-03-27 05:27:08 +0000106\begin{methoddesc}{handle_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000107This method is called to process a character reference of the form
Fred Drakeb441eb81998-02-13 14:37:12 +0000108\samp{\&\#\var{ref};}. In the base implementation, \var{ref} must
Fred Drake42439ad1996-10-08 21:51:49 +0000109be a decimal number in the
110range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake2dde74c1998-03-12 14:42:23 +0000111method \method{handle_data()} with the character as argument. If
Fred Drake42439ad1996-10-08 21:51:49 +0000112\var{ref} is invalid or out of range, the method
113\code{unknown_charref(\var{ref})} is called to handle the error. A
114subclass must override this method to provide support for named
115character entities.
Fred Drake8fe533e1998-03-27 05:27:08 +0000116\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000117
Fred Drake8fe533e1998-03-27 05:27:08 +0000118\begin{methoddesc}{handle_entityref}{ref}
Fred Drake2dde74c1998-03-12 14:42:23 +0000119This method is called to process a general entity reference of the
120form \samp{\&\var{ref};} where \var{ref} is an general entity
Fred Drake42439ad1996-10-08 21:51:49 +0000121reference. It looks for \var{ref} in the instance (or class)
Fred Drake2dde74c1998-03-12 14:42:23 +0000122variable \member{entitydefs} which should be a mapping from entity
123names to corresponding translations.
124If a translation is found, it calls the method \method{handle_data()}
Fred Drake42439ad1996-10-08 21:51:49 +0000125with the translation; otherwise, it calls the method
Fred Drake2dde74c1998-03-12 14:42:23 +0000126\code{unknown_entityref(\var{ref})}. The default \member{entitydefs}
Fred Drake42439ad1996-10-08 21:51:49 +0000127defines translations for \code{\&amp;}, \code{\&apos}, \code{\&gt;},
128\code{\&lt;}, and \code{\&quot;}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000129\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000130
Fred Drake8fe533e1998-03-27 05:27:08 +0000131\begin{methoddesc}{handle_comment}{comment}
Fred Drake42439ad1996-10-08 21:51:49 +0000132This method is called when a comment is encountered. The
Fred Drake2dde74c1998-03-12 14:42:23 +0000133\var{comment} argument is a string containing the text between the
Fred Drakeb441eb81998-02-13 14:37:12 +0000134\samp{<!--} and \samp{-->} delimiters, but not the delimiters
135themselves. For example, the comment \samp{<!--text-->} will
Fred Drake42439ad1996-10-08 21:51:49 +0000136cause this method to be called with the argument \code{'text'}. The
137default method does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000138\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000139
Fred Drake8fe533e1998-03-27 05:27:08 +0000140\begin{methoddesc}{report_unbalanced}{tag}
Fred Drake42439ad1996-10-08 21:51:49 +0000141This method is called when an end tag is found which does not
142correspond to any open element.
Fred Drake8fe533e1998-03-27 05:27:08 +0000143\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000144
Fred Drake8fe533e1998-03-27 05:27:08 +0000145\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000146This method is called to process an unknown start tag. It is intended
147to be overridden by a derived class; the base class implementation
Fred Drake42439ad1996-10-08 21:51:49 +0000148does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000149\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000150
Fred Drake8fe533e1998-03-27 05:27:08 +0000151\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossum86751151995-02-28 17:14:32 +0000152This method is called to process an unknown end tag. It is intended
153to be overridden by a derived class; the base class implementation
154does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000155\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000156
Fred Drake8fe533e1998-03-27 05:27:08 +0000157\begin{methoddesc}{unknown_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000158This method is called to process unresolvable numeric character
Fred Drake2dde74c1998-03-12 14:42:23 +0000159references. Refer to \method{handle_charref()} to determine what is
160handled by default. It is intended to be overridden by a derived
161class; the base class implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000162\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000163
Fred Drake8fe533e1998-03-27 05:27:08 +0000164\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossum86751151995-02-28 17:14:32 +0000165This method is called to process an unknown entity reference. It is
166intended to be overridden by a derived class; the base class
167implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000168\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000169
170Apart from overriding or extending the methods listed above, derived
171classes may also define methods of the following form to define
172processing of specific tags. Tag names in the input stream are case
173independent; the \var{tag} occurring in method names must be in lower
174case:
175
Fred Drake8fe533e1998-03-27 05:27:08 +0000176\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000177This method is called to process an opening tag \var{tag}. It has
Fred Drake2dde74c1998-03-12 14:42:23 +0000178preference over \code{do_\var{tag}()}. The \var{attributes}
179argument has the same meaning as described for
180\method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000181\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000182
Fred Drake8fe533e1998-03-27 05:27:08 +0000183\begin{methoddescni}{do_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000184This method is called to process an opening tag \var{tag} that does
185not come with a matching closing tag. The \var{attributes} argument
Fred Drake2dde74c1998-03-12 14:42:23 +0000186has the same meaning as described for \method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000187\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000188
Fred Drake8fe533e1998-03-27 05:27:08 +0000189\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossum86751151995-02-28 17:14:32 +0000190This method is called to process a closing tag \var{tag}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000191\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000192
Fred Drake42439ad1996-10-08 21:51:49 +0000193Note that the parser maintains a stack of open elements for which no
194end tag has been found yet. Only tags processed by
195\code{start_\var{tag}()} are pushed on this stack. Definition of an
Guido van Rossum86751151995-02-28 17:14:32 +0000196\code{end_\var{tag}()} method is optional for these tags. For tags
Fred Drake2dde74c1998-03-12 14:42:23 +0000197processed by \code{do_\var{tag}()} or by \method{unknown_tag()}, no
Fred Drake42439ad1996-10-08 21:51:49 +0000198\code{end_\var{tag}()} method must be defined; if defined, it will not
199be used. If both \code{start_\var{tag}()} and \code{do_\var{tag}()}
200methods exist for a tag, the \code{start_\var{tag}()} method takes
201precedence.