blob: 5fe0c8d4dd8af991210744160819ada5d6b51f28 [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{sgmllib} ---
Fred Drake4e28c591999-04-22 18:25:47 +00002 Simple SGML parser}
Fred Drakeb91e9341998-07-23 17:59:49 +00003
Fred Drake4e28c591999-04-22 18:25:47 +00004\declaremodule{standard}{sgmllib}
Fred Drakeb91e9341998-07-23 17:59:49 +00005\modulesynopsis{Only as much of an SGML parser as needed to parse HTML.}
6
Guido van Rossum86751151995-02-28 17:14:32 +00007\index{SGML}
8
Fred Drake2dde74c1998-03-12 14:42:23 +00009This module defines a class \class{SGMLParser} which serves as the
Guido van Rossum86751151995-02-28 17:14:32 +000010basis for parsing text files formatted in SGML (Standard Generalized
11Mark-up Language). In fact, it does not provide a full SGML parser
Fred Drake8f925951996-10-09 16:13:22 +000012--- it only parses SGML insofar as it is used by HTML, and the module
Fred Drake4e28c591999-04-22 18:25:47 +000013only exists as a base for the \refmodule{htmllib}\refstmodindex{htmllib}
Fred Drake2dde74c1998-03-12 14:42:23 +000014module.
Guido van Rossum86751151995-02-28 17:14:32 +000015
Fred Drake2dde74c1998-03-12 14:42:23 +000016
17\begin{classdesc}{SGMLParser}{}
18The \class{SGMLParser} class is instantiated without arguments.
19The parser is hardcoded to recognize the following
Fred Drake42439ad1996-10-08 21:51:49 +000020constructs:
Guido van Rossum86751151995-02-28 17:14:32 +000021
22\begin{itemize}
Guido van Rossum86751151995-02-28 17:14:32 +000023\item
24Opening and closing tags of the form
Fred Drakeb441eb81998-02-13 14:37:12 +000025\samp{<\var{tag} \var{attr}="\var{value}" ...>} and
26\samp{</\var{tag}>}, respectively.
Guido van Rossum86751151995-02-28 17:14:32 +000027
28\item
Fred Drakeb441eb81998-02-13 14:37:12 +000029Numeric character references of the form \samp{\&\#\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000030
31\item
Fred Drakeb441eb81998-02-13 14:37:12 +000032Entity references of the form \samp{\&\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000033
34\item
Fred Drakeb441eb81998-02-13 14:37:12 +000035SGML comments of the form \samp{<!--\var{text}-->}. Note that
Fred Drake42439ad1996-10-08 21:51:49 +000036spaces, tabs, and newlines are allowed between the trailing
Thomas Woutersf8316632000-07-16 19:01:10 +000037\samp{>} and the immediately preceding \samp{--}.
Guido van Rossum86751151995-02-28 17:14:32 +000038
39\end{itemize}
Fred Drake2dde74c1998-03-12 14:42:23 +000040\end{classdesc}
Guido van Rossum86751151995-02-28 17:14:32 +000041
Fred Drake2dde74c1998-03-12 14:42:23 +000042\class{SGMLParser} instances have the following interface methods:
Guido van Rossum86751151995-02-28 17:14:32 +000043
Fred Drake8f925951996-10-09 16:13:22 +000044
Fred Drake8fe533e1998-03-27 05:27:08 +000045\begin{methoddesc}{reset}{}
Guido van Rossum86751151995-02-28 17:14:32 +000046Reset the instance. Loses all unprocessed data. This is called
47implicitly at instantiation time.
Fred Drake8fe533e1998-03-27 05:27:08 +000048\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000049
Fred Drake8fe533e1998-03-27 05:27:08 +000050\begin{methoddesc}{setnomoretags}{}
Guido van Rossum86751151995-02-28 17:14:32 +000051Stop processing tags. Treat all following input as literal input
Fred Drake4e28c591999-04-22 18:25:47 +000052(CDATA). (This is only provided so the HTML tag
53\code{<PLAINTEXT>} can be implemented.)
Fred Drake8fe533e1998-03-27 05:27:08 +000054\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000055
Fred Drake8fe533e1998-03-27 05:27:08 +000056\begin{methoddesc}{setliteral}{}
Guido van Rossum86751151995-02-28 17:14:32 +000057Enter literal mode (CDATA mode).
Fred Drake8fe533e1998-03-27 05:27:08 +000058\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000059
Fred Drake8fe533e1998-03-27 05:27:08 +000060\begin{methoddesc}{feed}{data}
Guido van Rossum86751151995-02-28 17:14:32 +000061Feed some text to the parser. It is processed insofar as it consists
62of complete elements; incomplete data is buffered until more data is
Fred Drake2dde74c1998-03-12 14:42:23 +000063fed or \method{close()} is called.
Fred Drake8fe533e1998-03-27 05:27:08 +000064\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000065
Fred Drake8fe533e1998-03-27 05:27:08 +000066\begin{methoddesc}{close}{}
Guido van Rossum86751151995-02-28 17:14:32 +000067Force processing of all buffered data as if it were followed by an
68end-of-file mark. This method may be redefined by a derived class to
69define additional processing at the end of the input, but the
Fred Drake2dde74c1998-03-12 14:42:23 +000070redefined version should always call \method{close()}.
Fred Drake8fe533e1998-03-27 05:27:08 +000071\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000072
Fred Drake25e7cee2000-07-03 14:32:04 +000073\begin{methoddesc}{get_starttag_text}{}
74Return the text of the most recently opened start tag. This should
75not normally be needed for structured processing, but may be useful in
76dealing with HTML ``as deployed'' or for re-generating input with
77minimal changes (whitespace between attributes can be preserved,
78etc.).
79\end{methoddesc}
80
Fred Drake8fe533e1998-03-27 05:27:08 +000081\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Fred Drake42439ad1996-10-08 21:51:49 +000082This method is called to handle start tags for which either a
Fred Drake4e28c591999-04-22 18:25:47 +000083\method{start_\var{tag}()} or \method{do_\var{tag}()} method has been
Fred Drake2dde74c1998-03-12 14:42:23 +000084defined. The \var{tag} argument is the name of the tag converted to
85lower case, and the \var{method} argument is the bound method which
Fred Drake42439ad1996-10-08 21:51:49 +000086should be used to support semantic interpretation of the start tag.
Fred Drake4e28c591999-04-22 18:25:47 +000087The \var{attributes} argument is a list of \code{(\var{name},
88\var{value})} pairs containing the attributes found inside the tag's
89\code{<>} brackets. The \var{name} has been translated to lower case
90and double quotes and backslashes in the \var{value} have been interpreted.
91For instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this
Fred Drake2dde74c1998-03-12 14:42:23 +000092method would be called as \samp{unknown_starttag('a', [('href',
Fred Drake42439ad1996-10-08 21:51:49 +000093'http://www.cwi.nl/')])}. The base implementation simply calls
Fred Drake2dde74c1998-03-12 14:42:23 +000094\var{method} with \var{attributes} as the only argument.
Fred Drake8fe533e1998-03-27 05:27:08 +000095\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000096
Fred Drake8fe533e1998-03-27 05:27:08 +000097\begin{methoddesc}{handle_endtag}{tag, method}
Fred Drake42439ad1996-10-08 21:51:49 +000098This method is called to handle endtags for which an
Fred Drake4e28c591999-04-22 18:25:47 +000099\method{end_\var{tag}()} method has been defined. The
100\var{tag} argument is the name of the tag converted to lower case, and
101the \var{method} argument is the bound method which should be used to
Fred Drake42439ad1996-10-08 21:51:49 +0000102support semantic interpretation of the end tag. If no
Fred Drake4e28c591999-04-22 18:25:47 +0000103\method{end_\var{tag}()} method is defined for the closing element,
Fred Drake2dde74c1998-03-12 14:42:23 +0000104this handler is not called. The base implementation simply calls
105\var{method}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000106\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000107
Fred Drake8fe533e1998-03-27 05:27:08 +0000108\begin{methoddesc}{handle_data}{data}
Guido van Rossum86751151995-02-28 17:14:32 +0000109This method is called to process arbitrary data. It is intended to be
110overridden by a derived class; the base class implementation does
111nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000112\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000113
Fred Drake8fe533e1998-03-27 05:27:08 +0000114\begin{methoddesc}{handle_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000115This method is called to process a character reference of the form
Fred Drakeb441eb81998-02-13 14:37:12 +0000116\samp{\&\#\var{ref};}. In the base implementation, \var{ref} must
Fred Drake42439ad1996-10-08 21:51:49 +0000117be a decimal number in the
118range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake2dde74c1998-03-12 14:42:23 +0000119method \method{handle_data()} with the character as argument. If
Fred Drake42439ad1996-10-08 21:51:49 +0000120\var{ref} is invalid or out of range, the method
121\code{unknown_charref(\var{ref})} is called to handle the error. A
122subclass must override this method to provide support for named
123character entities.
Fred Drake8fe533e1998-03-27 05:27:08 +0000124\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000125
Fred Drake8fe533e1998-03-27 05:27:08 +0000126\begin{methoddesc}{handle_entityref}{ref}
Fred Drake2dde74c1998-03-12 14:42:23 +0000127This method is called to process a general entity reference of the
128form \samp{\&\var{ref};} where \var{ref} is an general entity
Fred Drake42439ad1996-10-08 21:51:49 +0000129reference. It looks for \var{ref} in the instance (or class)
Fred Drake2dde74c1998-03-12 14:42:23 +0000130variable \member{entitydefs} which should be a mapping from entity
Fred Drake4e28c591999-04-22 18:25:47 +0000131names to corresponding translations. If a translation is found, it
132calls the method \method{handle_data()} with the translation;
133otherwise, it calls the method \code{unknown_entityref(\var{ref})}.
134The default \member{entitydefs} defines translations for
135\code{\&amp;}, \code{\&apos}, \code{\&gt;}, \code{\&lt;}, and
136\code{\&quot;}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000137\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000138
Fred Drake8fe533e1998-03-27 05:27:08 +0000139\begin{methoddesc}{handle_comment}{comment}
Fred Drake42439ad1996-10-08 21:51:49 +0000140This method is called when a comment is encountered. The
Fred Drake2dde74c1998-03-12 14:42:23 +0000141\var{comment} argument is a string containing the text between the
Fred Drakeb441eb81998-02-13 14:37:12 +0000142\samp{<!--} and \samp{-->} delimiters, but not the delimiters
143themselves. For example, the comment \samp{<!--text-->} will
Fred Drake42439ad1996-10-08 21:51:49 +0000144cause this method to be called with the argument \code{'text'}. The
145default method does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000146\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000147
Fred Drakeb15bbc82001-03-16 20:39:41 +0000148\begin{methoddesc}{handle_decl}{data}
149Method called when an SGML declaration is read by the parser. In
150practice, the \code{DOCTYPE} declaration is the only thing observed in
151HTML, but the parser does not discriminate among different (or broken)
152declarations. Internal subsets in a \code{DOCTYPE} declaration are
153not supported. The \var{data} parameter will be the entire contents
154of the declaration inside the \code{<!}...\code{>} markup. The
155default implementation does nothing.
156\end{methoddesc}
157
Fred Drake8fe533e1998-03-27 05:27:08 +0000158\begin{methoddesc}{report_unbalanced}{tag}
Fred Drake42439ad1996-10-08 21:51:49 +0000159This method is called when an end tag is found which does not
160correspond to any open element.
Fred Drake8fe533e1998-03-27 05:27:08 +0000161\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000162
Fred Drake8fe533e1998-03-27 05:27:08 +0000163\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000164This method is called to process an unknown start tag. It is intended
165to be overridden by a derived class; the base class implementation
Fred Drake42439ad1996-10-08 21:51:49 +0000166does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000167\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000168
Fred Drake8fe533e1998-03-27 05:27:08 +0000169\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossum86751151995-02-28 17:14:32 +0000170This method is called to process an unknown end tag. It is intended
171to be overridden by a derived class; the base class implementation
172does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000173\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000174
Fred Drake8fe533e1998-03-27 05:27:08 +0000175\begin{methoddesc}{unknown_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000176This method is called to process unresolvable numeric character
Fred Drake2dde74c1998-03-12 14:42:23 +0000177references. Refer to \method{handle_charref()} to determine what is
178handled by default. It is intended to be overridden by a derived
179class; the base class implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000180\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000181
Fred Drake8fe533e1998-03-27 05:27:08 +0000182\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossum86751151995-02-28 17:14:32 +0000183This method is called to process an unknown entity reference. It is
184intended to be overridden by a derived class; the base class
185implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000186\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000187
188Apart from overriding or extending the methods listed above, derived
189classes may also define methods of the following form to define
190processing of specific tags. Tag names in the input stream are case
191independent; the \var{tag} occurring in method names must be in lower
192case:
193
Fred Drake8fe533e1998-03-27 05:27:08 +0000194\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000195This method is called to process an opening tag \var{tag}. It has
Fred Drake4e28c591999-04-22 18:25:47 +0000196preference over \method{do_\var{tag}()}. The
197\var{attributes} argument has the same meaning as described for
Fred Drake2dde74c1998-03-12 14:42:23 +0000198\method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000199\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000200
Fred Drake8fe533e1998-03-27 05:27:08 +0000201\begin{methoddescni}{do_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000202This method is called to process an opening tag \var{tag} that does
203not come with a matching closing tag. The \var{attributes} argument
Fred Drake2dde74c1998-03-12 14:42:23 +0000204has the same meaning as described for \method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000205\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000206
Fred Drake8fe533e1998-03-27 05:27:08 +0000207\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossum86751151995-02-28 17:14:32 +0000208This method is called to process a closing tag \var{tag}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000209\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000210
Fred Drake42439ad1996-10-08 21:51:49 +0000211Note that the parser maintains a stack of open elements for which no
212end tag has been found yet. Only tags processed by
Fred Drake4e28c591999-04-22 18:25:47 +0000213\method{start_\var{tag}()} are pushed on this stack. Definition of an
214\method{end_\var{tag}()} method is optional for these tags. For tags
215processed by \method{do_\var{tag}()} or by \method{unknown_tag()}, no
216\method{end_\var{tag}()} method must be defined; if defined, it will
217not be used. If both \method{start_\var{tag}()} and
218\method{do_\var{tag}()} methods exist for a tag, the
219\method{start_\var{tag}()} method takes precedence.