blob: 9907caafb9054847d7293679ba9240eba9e33312 [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{sgmllib} ---
Fred Drake4e28c591999-04-22 18:25:47 +00002 Simple SGML parser}
Fred Drakeb91e9341998-07-23 17:59:49 +00003
Fred Drake4e28c591999-04-22 18:25:47 +00004\declaremodule{standard}{sgmllib}
Fred Drakeb91e9341998-07-23 17:59:49 +00005\modulesynopsis{Only as much of an SGML parser as needed to parse HTML.}
6
Guido van Rossum86751151995-02-28 17:14:32 +00007\index{SGML}
8
Fred Drake2dde74c1998-03-12 14:42:23 +00009This module defines a class \class{SGMLParser} which serves as the
Guido van Rossum86751151995-02-28 17:14:32 +000010basis for parsing text files formatted in SGML (Standard Generalized
11Mark-up Language). In fact, it does not provide a full SGML parser
Fred Drake8f925951996-10-09 16:13:22 +000012--- it only parses SGML insofar as it is used by HTML, and the module
Fred Drake4e28c591999-04-22 18:25:47 +000013only exists as a base for the \refmodule{htmllib}\refstmodindex{htmllib}
Fred Drake2dde74c1998-03-12 14:42:23 +000014module.
Guido van Rossum86751151995-02-28 17:14:32 +000015
Fred Drake2dde74c1998-03-12 14:42:23 +000016
17\begin{classdesc}{SGMLParser}{}
18The \class{SGMLParser} class is instantiated without arguments.
19The parser is hardcoded to recognize the following
Fred Drake42439ad1996-10-08 21:51:49 +000020constructs:
Guido van Rossum86751151995-02-28 17:14:32 +000021
22\begin{itemize}
Guido van Rossum86751151995-02-28 17:14:32 +000023\item
24Opening and closing tags of the form
Fred Drakeb441eb81998-02-13 14:37:12 +000025\samp{<\var{tag} \var{attr}="\var{value}" ...>} and
26\samp{</\var{tag}>}, respectively.
Guido van Rossum86751151995-02-28 17:14:32 +000027
28\item
Fred Drakeb441eb81998-02-13 14:37:12 +000029Numeric character references of the form \samp{\&\#\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000030
31\item
Fred Drakeb441eb81998-02-13 14:37:12 +000032Entity references of the form \samp{\&\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000033
34\item
Fred Drakeb441eb81998-02-13 14:37:12 +000035SGML comments of the form \samp{<!--\var{text}-->}. Note that
Fred Drake42439ad1996-10-08 21:51:49 +000036spaces, tabs, and newlines are allowed between the trailing
Thomas Woutersf8316632000-07-16 19:01:10 +000037\samp{>} and the immediately preceding \samp{--}.
Guido van Rossum86751151995-02-28 17:14:32 +000038
39\end{itemize}
Fred Drake2dde74c1998-03-12 14:42:23 +000040\end{classdesc}
Guido van Rossum86751151995-02-28 17:14:32 +000041
Fred Drake2dde74c1998-03-12 14:42:23 +000042\class{SGMLParser} instances have the following interface methods:
Guido van Rossum86751151995-02-28 17:14:32 +000043
Fred Drake8f925951996-10-09 16:13:22 +000044
Fred Drake8fe533e1998-03-27 05:27:08 +000045\begin{methoddesc}{reset}{}
Guido van Rossum86751151995-02-28 17:14:32 +000046Reset the instance. Loses all unprocessed data. This is called
47implicitly at instantiation time.
Fred Drake8fe533e1998-03-27 05:27:08 +000048\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000049
Fred Drake8fe533e1998-03-27 05:27:08 +000050\begin{methoddesc}{setnomoretags}{}
Guido van Rossum86751151995-02-28 17:14:32 +000051Stop processing tags. Treat all following input as literal input
Fred Drake4e28c591999-04-22 18:25:47 +000052(CDATA). (This is only provided so the HTML tag
53\code{<PLAINTEXT>} can be implemented.)
Fred Drake8fe533e1998-03-27 05:27:08 +000054\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000055
Fred Drake8fe533e1998-03-27 05:27:08 +000056\begin{methoddesc}{setliteral}{}
Guido van Rossum86751151995-02-28 17:14:32 +000057Enter literal mode (CDATA mode).
Fred Drake8fe533e1998-03-27 05:27:08 +000058\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000059
Fred Drake8fe533e1998-03-27 05:27:08 +000060\begin{methoddesc}{feed}{data}
Guido van Rossum86751151995-02-28 17:14:32 +000061Feed some text to the parser. It is processed insofar as it consists
62of complete elements; incomplete data is buffered until more data is
Fred Drake2dde74c1998-03-12 14:42:23 +000063fed or \method{close()} is called.
Fred Drake8fe533e1998-03-27 05:27:08 +000064\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000065
Fred Drake8fe533e1998-03-27 05:27:08 +000066\begin{methoddesc}{close}{}
Guido van Rossum86751151995-02-28 17:14:32 +000067Force processing of all buffered data as if it were followed by an
68end-of-file mark. This method may be redefined by a derived class to
69define additional processing at the end of the input, but the
Fred Drake2dde74c1998-03-12 14:42:23 +000070redefined version should always call \method{close()}.
Fred Drake8fe533e1998-03-27 05:27:08 +000071\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000072
Fred Drake25e7cee2000-07-03 14:32:04 +000073\begin{methoddesc}{get_starttag_text}{}
74Return the text of the most recently opened start tag. This should
75not normally be needed for structured processing, but may be useful in
76dealing with HTML ``as deployed'' or for re-generating input with
77minimal changes (whitespace between attributes can be preserved,
78etc.).
79\end{methoddesc}
80
Fred Drake8fe533e1998-03-27 05:27:08 +000081\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Fred Drake42439ad1996-10-08 21:51:49 +000082This method is called to handle start tags for which either a
Fred Drake4e28c591999-04-22 18:25:47 +000083\method{start_\var{tag}()} or \method{do_\var{tag}()} method has been
Fred Drake2dde74c1998-03-12 14:42:23 +000084defined. The \var{tag} argument is the name of the tag converted to
85lower case, and the \var{method} argument is the bound method which
Fred Drake42439ad1996-10-08 21:51:49 +000086should be used to support semantic interpretation of the start tag.
Fred Drake4e28c591999-04-22 18:25:47 +000087The \var{attributes} argument is a list of \code{(\var{name},
88\var{value})} pairs containing the attributes found inside the tag's
89\code{<>} brackets. The \var{name} has been translated to lower case
90and double quotes and backslashes in the \var{value} have been interpreted.
91For instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this
Fred Drake2dde74c1998-03-12 14:42:23 +000092method would be called as \samp{unknown_starttag('a', [('href',
Fred Drake42439ad1996-10-08 21:51:49 +000093'http://www.cwi.nl/')])}. The base implementation simply calls
Fred Drake2dde74c1998-03-12 14:42:23 +000094\var{method} with \var{attributes} as the only argument.
Fred Drake8fe533e1998-03-27 05:27:08 +000095\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000096
Fred Drake8fe533e1998-03-27 05:27:08 +000097\begin{methoddesc}{handle_endtag}{tag, method}
Fred Drake42439ad1996-10-08 21:51:49 +000098This method is called to handle endtags for which an
Fred Drake4e28c591999-04-22 18:25:47 +000099\method{end_\var{tag}()} method has been defined. The
100\var{tag} argument is the name of the tag converted to lower case, and
101the \var{method} argument is the bound method which should be used to
Fred Drake42439ad1996-10-08 21:51:49 +0000102support semantic interpretation of the end tag. If no
Fred Drake4e28c591999-04-22 18:25:47 +0000103\method{end_\var{tag}()} method is defined for the closing element,
Fred Drake2dde74c1998-03-12 14:42:23 +0000104this handler is not called. The base implementation simply calls
105\var{method}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000106\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000107
Fred Drake8fe533e1998-03-27 05:27:08 +0000108\begin{methoddesc}{handle_data}{data}
Guido van Rossum86751151995-02-28 17:14:32 +0000109This method is called to process arbitrary data. It is intended to be
110overridden by a derived class; the base class implementation does
111nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000112\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000113
Fred Drake8fe533e1998-03-27 05:27:08 +0000114\begin{methoddesc}{handle_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000115This method is called to process a character reference of the form
Fred Drakeb441eb81998-02-13 14:37:12 +0000116\samp{\&\#\var{ref};}. In the base implementation, \var{ref} must
Fred Drake42439ad1996-10-08 21:51:49 +0000117be a decimal number in the
118range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake2dde74c1998-03-12 14:42:23 +0000119method \method{handle_data()} with the character as argument. If
Fred Drake42439ad1996-10-08 21:51:49 +0000120\var{ref} is invalid or out of range, the method
121\code{unknown_charref(\var{ref})} is called to handle the error. A
122subclass must override this method to provide support for named
123character entities.
Fred Drake8fe533e1998-03-27 05:27:08 +0000124\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000125
Fred Drake8fe533e1998-03-27 05:27:08 +0000126\begin{methoddesc}{handle_entityref}{ref}
Fred Drake2dde74c1998-03-12 14:42:23 +0000127This method is called to process a general entity reference of the
128form \samp{\&\var{ref};} where \var{ref} is an general entity
Fred Drake42439ad1996-10-08 21:51:49 +0000129reference. It looks for \var{ref} in the instance (or class)
Fred Drake2dde74c1998-03-12 14:42:23 +0000130variable \member{entitydefs} which should be a mapping from entity
Fred Drake4e28c591999-04-22 18:25:47 +0000131names to corresponding translations. If a translation is found, it
132calls the method \method{handle_data()} with the translation;
133otherwise, it calls the method \code{unknown_entityref(\var{ref})}.
134The default \member{entitydefs} defines translations for
135\code{\&amp;}, \code{\&apos}, \code{\&gt;}, \code{\&lt;}, and
136\code{\&quot;}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000137\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000138
Fred Drake8fe533e1998-03-27 05:27:08 +0000139\begin{methoddesc}{handle_comment}{comment}
Fred Drake42439ad1996-10-08 21:51:49 +0000140This method is called when a comment is encountered. The
Fred Drake2dde74c1998-03-12 14:42:23 +0000141\var{comment} argument is a string containing the text between the
Fred Drakeb441eb81998-02-13 14:37:12 +0000142\samp{<!--} and \samp{-->} delimiters, but not the delimiters
143themselves. For example, the comment \samp{<!--text-->} will
Fred Drake42439ad1996-10-08 21:51:49 +0000144cause this method to be called with the argument \code{'text'}. The
145default method does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000146\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000147
Fred Drake8fe533e1998-03-27 05:27:08 +0000148\begin{methoddesc}{report_unbalanced}{tag}
Fred Drake42439ad1996-10-08 21:51:49 +0000149This method is called when an end tag is found which does not
150correspond to any open element.
Fred Drake8fe533e1998-03-27 05:27:08 +0000151\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000152
Fred Drake8fe533e1998-03-27 05:27:08 +0000153\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000154This method is called to process an unknown start tag. It is intended
155to be overridden by a derived class; the base class implementation
Fred Drake42439ad1996-10-08 21:51:49 +0000156does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000157\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000158
Fred Drake8fe533e1998-03-27 05:27:08 +0000159\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossum86751151995-02-28 17:14:32 +0000160This method is called to process an unknown end tag. It is intended
161to be overridden by a derived class; the base class implementation
162does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000163\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000164
Fred Drake8fe533e1998-03-27 05:27:08 +0000165\begin{methoddesc}{unknown_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000166This method is called to process unresolvable numeric character
Fred Drake2dde74c1998-03-12 14:42:23 +0000167references. Refer to \method{handle_charref()} to determine what is
168handled by default. It is intended to be overridden by a derived
169class; the base class implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000170\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000171
Fred Drake8fe533e1998-03-27 05:27:08 +0000172\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossum86751151995-02-28 17:14:32 +0000173This method is called to process an unknown entity reference. It is
174intended to be overridden by a derived class; the base class
175implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000176\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000177
178Apart from overriding or extending the methods listed above, derived
179classes may also define methods of the following form to define
180processing of specific tags. Tag names in the input stream are case
181independent; the \var{tag} occurring in method names must be in lower
182case:
183
Fred Drake8fe533e1998-03-27 05:27:08 +0000184\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000185This method is called to process an opening tag \var{tag}. It has
Fred Drake4e28c591999-04-22 18:25:47 +0000186preference over \method{do_\var{tag}()}. The
187\var{attributes} argument has the same meaning as described for
Fred Drake2dde74c1998-03-12 14:42:23 +0000188\method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000189\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000190
Fred Drake8fe533e1998-03-27 05:27:08 +0000191\begin{methoddescni}{do_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000192This method is called to process an opening tag \var{tag} that does
193not come with a matching closing tag. The \var{attributes} argument
Fred Drake2dde74c1998-03-12 14:42:23 +0000194has the same meaning as described for \method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000195\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000196
Fred Drake8fe533e1998-03-27 05:27:08 +0000197\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossum86751151995-02-28 17:14:32 +0000198This method is called to process a closing tag \var{tag}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000199\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000200
Fred Drake42439ad1996-10-08 21:51:49 +0000201Note that the parser maintains a stack of open elements for which no
202end tag has been found yet. Only tags processed by
Fred Drake4e28c591999-04-22 18:25:47 +0000203\method{start_\var{tag}()} are pushed on this stack. Definition of an
204\method{end_\var{tag}()} method is optional for these tags. For tags
205processed by \method{do_\var{tag}()} or by \method{unknown_tag()}, no
206\method{end_\var{tag}()} method must be defined; if defined, it will
207not be used. If both \method{start_\var{tag}()} and
208\method{do_\var{tag}()} methods exist for a tag, the
209\method{start_\var{tag}()} method takes precedence.