blob: 27bf0b0ff3d4869888187e721e7f92bbe397565c [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{sgmllib} ---
Fred Drake4e28c591999-04-22 18:25:47 +00002 Simple SGML parser}
Fred Drakeb91e9341998-07-23 17:59:49 +00003
Fred Drake4e28c591999-04-22 18:25:47 +00004\declaremodule{standard}{sgmllib}
Fred Drakeb91e9341998-07-23 17:59:49 +00005\modulesynopsis{Only as much of an SGML parser as needed to parse HTML.}
6
Guido van Rossum86751151995-02-28 17:14:32 +00007\index{SGML}
8
Fred Drake2dde74c1998-03-12 14:42:23 +00009This module defines a class \class{SGMLParser} which serves as the
Guido van Rossum86751151995-02-28 17:14:32 +000010basis for parsing text files formatted in SGML (Standard Generalized
11Mark-up Language). In fact, it does not provide a full SGML parser
Fred Drake8f925951996-10-09 16:13:22 +000012--- it only parses SGML insofar as it is used by HTML, and the module
Fred Drake25211f52001-07-05 16:34:36 +000013only exists as a base for the \refmodule{htmllib} module. Another
14HTML parser which supports XHTML and offers a somewhat different
15interface is available in the \refmodule{HTMLParser} module.
Guido van Rossum86751151995-02-28 17:14:32 +000016
Fred Drake2dde74c1998-03-12 14:42:23 +000017\begin{classdesc}{SGMLParser}{}
18The \class{SGMLParser} class is instantiated without arguments.
19The parser is hardcoded to recognize the following
Fred Drake42439ad1996-10-08 21:51:49 +000020constructs:
Guido van Rossum86751151995-02-28 17:14:32 +000021
22\begin{itemize}
Guido van Rossum86751151995-02-28 17:14:32 +000023\item
24Opening and closing tags of the form
Fred Drakeb441eb81998-02-13 14:37:12 +000025\samp{<\var{tag} \var{attr}="\var{value}" ...>} and
26\samp{</\var{tag}>}, respectively.
Guido van Rossum86751151995-02-28 17:14:32 +000027
28\item
Fred Drakeb441eb81998-02-13 14:37:12 +000029Numeric character references of the form \samp{\&\#\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000030
31\item
Fred Drakeb441eb81998-02-13 14:37:12 +000032Entity references of the form \samp{\&\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000033
34\item
Fred Drakeb441eb81998-02-13 14:37:12 +000035SGML comments of the form \samp{<!--\var{text}-->}. Note that
Fred Drake42439ad1996-10-08 21:51:49 +000036spaces, tabs, and newlines are allowed between the trailing
Thomas Woutersf8316632000-07-16 19:01:10 +000037\samp{>} and the immediately preceding \samp{--}.
Guido van Rossum86751151995-02-28 17:14:32 +000038
39\end{itemize}
Fred Drake2dde74c1998-03-12 14:42:23 +000040\end{classdesc}
Guido van Rossum86751151995-02-28 17:14:32 +000041
Fred Drake961c2882004-09-10 01:20:21 +000042A single exception is defined as well:
43
44\begin{excdesc}{SGMLParseError}
45Exception raised by the \class{SGMLParser} class when it encounters an
46error while parsing.
47\versionadded{2.1}
48\end{excdesc}
49
50
51\class{SGMLParser} instances have the following methods:
Guido van Rossum86751151995-02-28 17:14:32 +000052
Fred Drake8f925951996-10-09 16:13:22 +000053
Fred Drake8fe533e1998-03-27 05:27:08 +000054\begin{methoddesc}{reset}{}
Guido van Rossum86751151995-02-28 17:14:32 +000055Reset the instance. Loses all unprocessed data. This is called
56implicitly at instantiation time.
Fred Drake8fe533e1998-03-27 05:27:08 +000057\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000058
Fred Drake8fe533e1998-03-27 05:27:08 +000059\begin{methoddesc}{setnomoretags}{}
Guido van Rossum86751151995-02-28 17:14:32 +000060Stop processing tags. Treat all following input as literal input
Fred Drake4e28c591999-04-22 18:25:47 +000061(CDATA). (This is only provided so the HTML tag
62\code{<PLAINTEXT>} can be implemented.)
Fred Drake8fe533e1998-03-27 05:27:08 +000063\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000064
Fred Drake8fe533e1998-03-27 05:27:08 +000065\begin{methoddesc}{setliteral}{}
Guido van Rossum86751151995-02-28 17:14:32 +000066Enter literal mode (CDATA mode).
Fred Drake8fe533e1998-03-27 05:27:08 +000067\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000068
Fred Drake8fe533e1998-03-27 05:27:08 +000069\begin{methoddesc}{feed}{data}
Guido van Rossum86751151995-02-28 17:14:32 +000070Feed some text to the parser. It is processed insofar as it consists
71of complete elements; incomplete data is buffered until more data is
Fred Drake2dde74c1998-03-12 14:42:23 +000072fed or \method{close()} is called.
Fred Drake8fe533e1998-03-27 05:27:08 +000073\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000074
Fred Drake8fe533e1998-03-27 05:27:08 +000075\begin{methoddesc}{close}{}
Guido van Rossum86751151995-02-28 17:14:32 +000076Force processing of all buffered data as if it were followed by an
77end-of-file mark. This method may be redefined by a derived class to
78define additional processing at the end of the input, but the
Fred Drake2dde74c1998-03-12 14:42:23 +000079redefined version should always call \method{close()}.
Fred Drake8fe533e1998-03-27 05:27:08 +000080\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000081
Fred Drake25e7cee2000-07-03 14:32:04 +000082\begin{methoddesc}{get_starttag_text}{}
83Return the text of the most recently opened start tag. This should
84not normally be needed for structured processing, but may be useful in
85dealing with HTML ``as deployed'' or for re-generating input with
86minimal changes (whitespace between attributes can be preserved,
87etc.).
88\end{methoddesc}
89
Fred Drake8fe533e1998-03-27 05:27:08 +000090\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Fred Drake42439ad1996-10-08 21:51:49 +000091This method is called to handle start tags for which either a
Fred Drake4e28c591999-04-22 18:25:47 +000092\method{start_\var{tag}()} or \method{do_\var{tag}()} method has been
Fred Drake2dde74c1998-03-12 14:42:23 +000093defined. The \var{tag} argument is the name of the tag converted to
94lower case, and the \var{method} argument is the bound method which
Fred Drake42439ad1996-10-08 21:51:49 +000095should be used to support semantic interpretation of the start tag.
Fred Drake4e28c591999-04-22 18:25:47 +000096The \var{attributes} argument is a list of \code{(\var{name},
97\var{value})} pairs containing the attributes found inside the tag's
98\code{<>} brackets. The \var{name} has been translated to lower case
99and double quotes and backslashes in the \var{value} have been interpreted.
100For instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this
Fred Drake2dde74c1998-03-12 14:42:23 +0000101method would be called as \samp{unknown_starttag('a', [('href',
Fred Drake42439ad1996-10-08 21:51:49 +0000102'http://www.cwi.nl/')])}. The base implementation simply calls
Fred Drake2dde74c1998-03-12 14:42:23 +0000103\var{method} with \var{attributes} as the only argument.
Fred Drake8fe533e1998-03-27 05:27:08 +0000104\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000105
Fred Drake8fe533e1998-03-27 05:27:08 +0000106\begin{methoddesc}{handle_endtag}{tag, method}
Fred Drake42439ad1996-10-08 21:51:49 +0000107This method is called to handle endtags for which an
Fred Drake4e28c591999-04-22 18:25:47 +0000108\method{end_\var{tag}()} method has been defined. The
109\var{tag} argument is the name of the tag converted to lower case, and
110the \var{method} argument is the bound method which should be used to
Fred Drake42439ad1996-10-08 21:51:49 +0000111support semantic interpretation of the end tag. If no
Fred Drake4e28c591999-04-22 18:25:47 +0000112\method{end_\var{tag}()} method is defined for the closing element,
Fred Drake2dde74c1998-03-12 14:42:23 +0000113this handler is not called. The base implementation simply calls
114\var{method}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000115\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000116
Fred Drake8fe533e1998-03-27 05:27:08 +0000117\begin{methoddesc}{handle_data}{data}
Guido van Rossum86751151995-02-28 17:14:32 +0000118This method is called to process arbitrary data. It is intended to be
119overridden by a derived class; the base class implementation does
120nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000121\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000122
Fred Drake8fe533e1998-03-27 05:27:08 +0000123\begin{methoddesc}{handle_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000124This method is called to process a character reference of the form
Fred Drakeb441eb81998-02-13 14:37:12 +0000125\samp{\&\#\var{ref};}. In the base implementation, \var{ref} must
Fred Drake42439ad1996-10-08 21:51:49 +0000126be a decimal number in the
127range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake2dde74c1998-03-12 14:42:23 +0000128method \method{handle_data()} with the character as argument. If
Fred Drake42439ad1996-10-08 21:51:49 +0000129\var{ref} is invalid or out of range, the method
130\code{unknown_charref(\var{ref})} is called to handle the error. A
131subclass must override this method to provide support for named
132character entities.
Fred Drake8fe533e1998-03-27 05:27:08 +0000133\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000134
Fred Drake8fe533e1998-03-27 05:27:08 +0000135\begin{methoddesc}{handle_entityref}{ref}
Fred Drake2dde74c1998-03-12 14:42:23 +0000136This method is called to process a general entity reference of the
137form \samp{\&\var{ref};} where \var{ref} is an general entity
Fred Drake42439ad1996-10-08 21:51:49 +0000138reference. It looks for \var{ref} in the instance (or class)
Fred Drake2dde74c1998-03-12 14:42:23 +0000139variable \member{entitydefs} which should be a mapping from entity
Fred Drake4e28c591999-04-22 18:25:47 +0000140names to corresponding translations. If a translation is found, it
141calls the method \method{handle_data()} with the translation;
142otherwise, it calls the method \code{unknown_entityref(\var{ref})}.
143The default \member{entitydefs} defines translations for
144\code{\&amp;}, \code{\&apos}, \code{\&gt;}, \code{\&lt;}, and
145\code{\&quot;}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000146\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000147
Fred Drake8fe533e1998-03-27 05:27:08 +0000148\begin{methoddesc}{handle_comment}{comment}
Fred Drake42439ad1996-10-08 21:51:49 +0000149This method is called when a comment is encountered. The
Fred Drake2dde74c1998-03-12 14:42:23 +0000150\var{comment} argument is a string containing the text between the
Fred Drakeb441eb81998-02-13 14:37:12 +0000151\samp{<!--} and \samp{-->} delimiters, but not the delimiters
152themselves. For example, the comment \samp{<!--text-->} will
Fred Drake42439ad1996-10-08 21:51:49 +0000153cause this method to be called with the argument \code{'text'}. The
154default method does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000155\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000156
Fred Drakeb15bbc82001-03-16 20:39:41 +0000157\begin{methoddesc}{handle_decl}{data}
158Method called when an SGML declaration is read by the parser. In
159practice, the \code{DOCTYPE} declaration is the only thing observed in
160HTML, but the parser does not discriminate among different (or broken)
161declarations. Internal subsets in a \code{DOCTYPE} declaration are
162not supported. The \var{data} parameter will be the entire contents
163of the declaration inside the \code{<!}...\code{>} markup. The
164default implementation does nothing.
165\end{methoddesc}
166
Fred Drake8fe533e1998-03-27 05:27:08 +0000167\begin{methoddesc}{report_unbalanced}{tag}
Fred Drake42439ad1996-10-08 21:51:49 +0000168This method is called when an end tag is found which does not
169correspond to any open element.
Fred Drake8fe533e1998-03-27 05:27:08 +0000170\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000171
Fred Drake8fe533e1998-03-27 05:27:08 +0000172\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000173This method is called to process an unknown start tag. It is intended
174to be overridden by a derived class; the base class implementation
Fred Drake42439ad1996-10-08 21:51:49 +0000175does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000176\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000177
Fred Drake8fe533e1998-03-27 05:27:08 +0000178\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossum86751151995-02-28 17:14:32 +0000179This method is called to process an unknown end tag. It is intended
180to be overridden by a derived class; the base class implementation
181does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000182\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000183
Fred Drake8fe533e1998-03-27 05:27:08 +0000184\begin{methoddesc}{unknown_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000185This method is called to process unresolvable numeric character
Fred Drake2dde74c1998-03-12 14:42:23 +0000186references. Refer to \method{handle_charref()} to determine what is
187handled by default. It is intended to be overridden by a derived
188class; the base class implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000189\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000190
Fred Drake8fe533e1998-03-27 05:27:08 +0000191\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossum86751151995-02-28 17:14:32 +0000192This method is called to process an unknown entity reference. It is
193intended to be overridden by a derived class; the base class
194implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000195\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000196
197Apart from overriding or extending the methods listed above, derived
198classes may also define methods of the following form to define
199processing of specific tags. Tag names in the input stream are case
200independent; the \var{tag} occurring in method names must be in lower
201case:
202
Fred Drake8fe533e1998-03-27 05:27:08 +0000203\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000204This method is called to process an opening tag \var{tag}. It has
Fred Drake4e28c591999-04-22 18:25:47 +0000205preference over \method{do_\var{tag}()}. The
206\var{attributes} argument has the same meaning as described for
Fred Drake2dde74c1998-03-12 14:42:23 +0000207\method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000208\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000209
Fred Drake8fe533e1998-03-27 05:27:08 +0000210\begin{methoddescni}{do_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000211This method is called to process an opening tag \var{tag} that does
212not come with a matching closing tag. The \var{attributes} argument
Fred Drake2dde74c1998-03-12 14:42:23 +0000213has the same meaning as described for \method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000214\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000215
Fred Drake8fe533e1998-03-27 05:27:08 +0000216\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossum86751151995-02-28 17:14:32 +0000217This method is called to process a closing tag \var{tag}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000218\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000219
Fred Drake42439ad1996-10-08 21:51:49 +0000220Note that the parser maintains a stack of open elements for which no
221end tag has been found yet. Only tags processed by
Fred Drake4e28c591999-04-22 18:25:47 +0000222\method{start_\var{tag}()} are pushed on this stack. Definition of an
223\method{end_\var{tag}()} method is optional for these tags. For tags
224processed by \method{do_\var{tag}()} or by \method{unknown_tag()}, no
225\method{end_\var{tag}()} method must be defined; if defined, it will
226not be used. If both \method{start_\var{tag}()} and
227\method{do_\var{tag}()} methods exist for a tag, the
228\method{start_\var{tag}()} method takes precedence.