blob: 3699d241e09c7b9b89338a7f873e67766e17edea [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{sgmllib} ---
Fred Drake4e28c591999-04-22 18:25:47 +00002 Simple SGML parser}
Fred Drakeb91e9341998-07-23 17:59:49 +00003
Fred Drake4e28c591999-04-22 18:25:47 +00004\declaremodule{standard}{sgmllib}
Fred Drakeb91e9341998-07-23 17:59:49 +00005\modulesynopsis{Only as much of an SGML parser as needed to parse HTML.}
6
Guido van Rossum86751151995-02-28 17:14:32 +00007\index{SGML}
8
Fred Drake2dde74c1998-03-12 14:42:23 +00009This module defines a class \class{SGMLParser} which serves as the
Guido van Rossum86751151995-02-28 17:14:32 +000010basis for parsing text files formatted in SGML (Standard Generalized
11Mark-up Language). In fact, it does not provide a full SGML parser
Fred Drake8f925951996-10-09 16:13:22 +000012--- it only parses SGML insofar as it is used by HTML, and the module
Fred Drake25211f52001-07-05 16:34:36 +000013only exists as a base for the \refmodule{htmllib} module. Another
14HTML parser which supports XHTML and offers a somewhat different
15interface is available in the \refmodule{HTMLParser} module.
Guido van Rossum86751151995-02-28 17:14:32 +000016
Fred Drake2dde74c1998-03-12 14:42:23 +000017
18\begin{classdesc}{SGMLParser}{}
19The \class{SGMLParser} class is instantiated without arguments.
20The parser is hardcoded to recognize the following
Fred Drake42439ad1996-10-08 21:51:49 +000021constructs:
Guido van Rossum86751151995-02-28 17:14:32 +000022
23\begin{itemize}
Guido van Rossum86751151995-02-28 17:14:32 +000024\item
25Opening and closing tags of the form
Fred Drakeb441eb81998-02-13 14:37:12 +000026\samp{<\var{tag} \var{attr}="\var{value}" ...>} and
27\samp{</\var{tag}>}, respectively.
Guido van Rossum86751151995-02-28 17:14:32 +000028
29\item
Fred Drakeb441eb81998-02-13 14:37:12 +000030Numeric character references of the form \samp{\&\#\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000031
32\item
Fred Drakeb441eb81998-02-13 14:37:12 +000033Entity references of the form \samp{\&\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000034
35\item
Fred Drakeb441eb81998-02-13 14:37:12 +000036SGML comments of the form \samp{<!--\var{text}-->}. Note that
Fred Drake42439ad1996-10-08 21:51:49 +000037spaces, tabs, and newlines are allowed between the trailing
Thomas Woutersf8316632000-07-16 19:01:10 +000038\samp{>} and the immediately preceding \samp{--}.
Guido van Rossum86751151995-02-28 17:14:32 +000039
40\end{itemize}
Fred Drake2dde74c1998-03-12 14:42:23 +000041\end{classdesc}
Guido van Rossum86751151995-02-28 17:14:32 +000042
Fred Drake2dde74c1998-03-12 14:42:23 +000043\class{SGMLParser} instances have the following interface methods:
Guido van Rossum86751151995-02-28 17:14:32 +000044
Fred Drake8f925951996-10-09 16:13:22 +000045
Fred Drake8fe533e1998-03-27 05:27:08 +000046\begin{methoddesc}{reset}{}
Guido van Rossum86751151995-02-28 17:14:32 +000047Reset the instance. Loses all unprocessed data. This is called
48implicitly at instantiation time.
Fred Drake8fe533e1998-03-27 05:27:08 +000049\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000050
Fred Drake8fe533e1998-03-27 05:27:08 +000051\begin{methoddesc}{setnomoretags}{}
Guido van Rossum86751151995-02-28 17:14:32 +000052Stop processing tags. Treat all following input as literal input
Fred Drake4e28c591999-04-22 18:25:47 +000053(CDATA). (This is only provided so the HTML tag
54\code{<PLAINTEXT>} can be implemented.)
Fred Drake8fe533e1998-03-27 05:27:08 +000055\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000056
Fred Drake8fe533e1998-03-27 05:27:08 +000057\begin{methoddesc}{setliteral}{}
Guido van Rossum86751151995-02-28 17:14:32 +000058Enter literal mode (CDATA mode).
Fred Drake8fe533e1998-03-27 05:27:08 +000059\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000060
Fred Drake8fe533e1998-03-27 05:27:08 +000061\begin{methoddesc}{feed}{data}
Guido van Rossum86751151995-02-28 17:14:32 +000062Feed some text to the parser. It is processed insofar as it consists
63of complete elements; incomplete data is buffered until more data is
Fred Drake2dde74c1998-03-12 14:42:23 +000064fed or \method{close()} is called.
Fred Drake8fe533e1998-03-27 05:27:08 +000065\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000066
Fred Drake8fe533e1998-03-27 05:27:08 +000067\begin{methoddesc}{close}{}
Guido van Rossum86751151995-02-28 17:14:32 +000068Force processing of all buffered data as if it were followed by an
69end-of-file mark. This method may be redefined by a derived class to
70define additional processing at the end of the input, but the
Fred Drake2dde74c1998-03-12 14:42:23 +000071redefined version should always call \method{close()}.
Fred Drake8fe533e1998-03-27 05:27:08 +000072\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000073
Fred Drake25e7cee2000-07-03 14:32:04 +000074\begin{methoddesc}{get_starttag_text}{}
75Return the text of the most recently opened start tag. This should
76not normally be needed for structured processing, but may be useful in
77dealing with HTML ``as deployed'' or for re-generating input with
78minimal changes (whitespace between attributes can be preserved,
79etc.).
80\end{methoddesc}
81
Fred Drake8fe533e1998-03-27 05:27:08 +000082\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Fred Drake42439ad1996-10-08 21:51:49 +000083This method is called to handle start tags for which either a
Fred Drake4e28c591999-04-22 18:25:47 +000084\method{start_\var{tag}()} or \method{do_\var{tag}()} method has been
Fred Drake2dde74c1998-03-12 14:42:23 +000085defined. The \var{tag} argument is the name of the tag converted to
86lower case, and the \var{method} argument is the bound method which
Fred Drake42439ad1996-10-08 21:51:49 +000087should be used to support semantic interpretation of the start tag.
Fred Drake4e28c591999-04-22 18:25:47 +000088The \var{attributes} argument is a list of \code{(\var{name},
89\var{value})} pairs containing the attributes found inside the tag's
90\code{<>} brackets. The \var{name} has been translated to lower case
91and double quotes and backslashes in the \var{value} have been interpreted.
92For instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this
Fred Drake2dde74c1998-03-12 14:42:23 +000093method would be called as \samp{unknown_starttag('a', [('href',
Fred Drake42439ad1996-10-08 21:51:49 +000094'http://www.cwi.nl/')])}. The base implementation simply calls
Fred Drake2dde74c1998-03-12 14:42:23 +000095\var{method} with \var{attributes} as the only argument.
Fred Drake8fe533e1998-03-27 05:27:08 +000096\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000097
Fred Drake8fe533e1998-03-27 05:27:08 +000098\begin{methoddesc}{handle_endtag}{tag, method}
Fred Drake42439ad1996-10-08 21:51:49 +000099This method is called to handle endtags for which an
Fred Drake4e28c591999-04-22 18:25:47 +0000100\method{end_\var{tag}()} method has been defined. The
101\var{tag} argument is the name of the tag converted to lower case, and
102the \var{method} argument is the bound method which should be used to
Fred Drake42439ad1996-10-08 21:51:49 +0000103support semantic interpretation of the end tag. If no
Fred Drake4e28c591999-04-22 18:25:47 +0000104\method{end_\var{tag}()} method is defined for the closing element,
Fred Drake2dde74c1998-03-12 14:42:23 +0000105this handler is not called. The base implementation simply calls
106\var{method}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000107\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000108
Fred Drake8fe533e1998-03-27 05:27:08 +0000109\begin{methoddesc}{handle_data}{data}
Guido van Rossum86751151995-02-28 17:14:32 +0000110This method is called to process arbitrary data. It is intended to be
111overridden by a derived class; the base class implementation does
112nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000113\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000114
Fred Drake8fe533e1998-03-27 05:27:08 +0000115\begin{methoddesc}{handle_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000116This method is called to process a character reference of the form
Fred Drakeb441eb81998-02-13 14:37:12 +0000117\samp{\&\#\var{ref};}. In the base implementation, \var{ref} must
Fred Drake42439ad1996-10-08 21:51:49 +0000118be a decimal number in the
119range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake2dde74c1998-03-12 14:42:23 +0000120method \method{handle_data()} with the character as argument. If
Fred Drake42439ad1996-10-08 21:51:49 +0000121\var{ref} is invalid or out of range, the method
122\code{unknown_charref(\var{ref})} is called to handle the error. A
123subclass must override this method to provide support for named
124character entities.
Fred Drake8fe533e1998-03-27 05:27:08 +0000125\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000126
Fred Drake8fe533e1998-03-27 05:27:08 +0000127\begin{methoddesc}{handle_entityref}{ref}
Fred Drake2dde74c1998-03-12 14:42:23 +0000128This method is called to process a general entity reference of the
129form \samp{\&\var{ref};} where \var{ref} is an general entity
Fred Drake42439ad1996-10-08 21:51:49 +0000130reference. It looks for \var{ref} in the instance (or class)
Fred Drake2dde74c1998-03-12 14:42:23 +0000131variable \member{entitydefs} which should be a mapping from entity
Fred Drake4e28c591999-04-22 18:25:47 +0000132names to corresponding translations. If a translation is found, it
133calls the method \method{handle_data()} with the translation;
134otherwise, it calls the method \code{unknown_entityref(\var{ref})}.
135The default \member{entitydefs} defines translations for
136\code{\&amp;}, \code{\&apos}, \code{\&gt;}, \code{\&lt;}, and
137\code{\&quot;}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000138\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000139
Fred Drake8fe533e1998-03-27 05:27:08 +0000140\begin{methoddesc}{handle_comment}{comment}
Fred Drake42439ad1996-10-08 21:51:49 +0000141This method is called when a comment is encountered. The
Fred Drake2dde74c1998-03-12 14:42:23 +0000142\var{comment} argument is a string containing the text between the
Fred Drakeb441eb81998-02-13 14:37:12 +0000143\samp{<!--} and \samp{-->} delimiters, but not the delimiters
144themselves. For example, the comment \samp{<!--text-->} will
Fred Drake42439ad1996-10-08 21:51:49 +0000145cause this method to be called with the argument \code{'text'}. The
146default method does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000147\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000148
Fred Drakeb15bbc82001-03-16 20:39:41 +0000149\begin{methoddesc}{handle_decl}{data}
150Method called when an SGML declaration is read by the parser. In
151practice, the \code{DOCTYPE} declaration is the only thing observed in
152HTML, but the parser does not discriminate among different (or broken)
153declarations. Internal subsets in a \code{DOCTYPE} declaration are
154not supported. The \var{data} parameter will be the entire contents
155of the declaration inside the \code{<!}...\code{>} markup. The
156default implementation does nothing.
157\end{methoddesc}
158
Fred Drake8fe533e1998-03-27 05:27:08 +0000159\begin{methoddesc}{report_unbalanced}{tag}
Fred Drake42439ad1996-10-08 21:51:49 +0000160This method is called when an end tag is found which does not
161correspond to any open element.
Fred Drake8fe533e1998-03-27 05:27:08 +0000162\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000163
Fred Drake8fe533e1998-03-27 05:27:08 +0000164\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000165This method is called to process an unknown start tag. It is intended
166to be overridden by a derived class; the base class implementation
Fred Drake42439ad1996-10-08 21:51:49 +0000167does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000168\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000169
Fred Drake8fe533e1998-03-27 05:27:08 +0000170\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossum86751151995-02-28 17:14:32 +0000171This method is called to process an unknown end tag. It is intended
172to be overridden by a derived class; the base class implementation
173does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000174\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000175
Fred Drake8fe533e1998-03-27 05:27:08 +0000176\begin{methoddesc}{unknown_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000177This method is called to process unresolvable numeric character
Fred Drake2dde74c1998-03-12 14:42:23 +0000178references. Refer to \method{handle_charref()} to determine what is
179handled by default. It is intended to be overridden by a derived
180class; the base class implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000181\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000182
Fred Drake8fe533e1998-03-27 05:27:08 +0000183\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossum86751151995-02-28 17:14:32 +0000184This method is called to process an unknown entity reference. It is
185intended to be overridden by a derived class; the base class
186implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000187\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000188
189Apart from overriding or extending the methods listed above, derived
190classes may also define methods of the following form to define
191processing of specific tags. Tag names in the input stream are case
192independent; the \var{tag} occurring in method names must be in lower
193case:
194
Fred Drake8fe533e1998-03-27 05:27:08 +0000195\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000196This method is called to process an opening tag \var{tag}. It has
Fred Drake4e28c591999-04-22 18:25:47 +0000197preference over \method{do_\var{tag}()}. The
198\var{attributes} argument has the same meaning as described for
Fred Drake2dde74c1998-03-12 14:42:23 +0000199\method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000200\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000201
Fred Drake8fe533e1998-03-27 05:27:08 +0000202\begin{methoddescni}{do_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000203This method is called to process an opening tag \var{tag} that does
204not come with a matching closing tag. The \var{attributes} argument
Fred Drake2dde74c1998-03-12 14:42:23 +0000205has the same meaning as described for \method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000206\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000207
Fred Drake8fe533e1998-03-27 05:27:08 +0000208\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossum86751151995-02-28 17:14:32 +0000209This method is called to process a closing tag \var{tag}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000210\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000211
Fred Drake42439ad1996-10-08 21:51:49 +0000212Note that the parser maintains a stack of open elements for which no
213end tag has been found yet. Only tags processed by
Fred Drake4e28c591999-04-22 18:25:47 +0000214\method{start_\var{tag}()} are pushed on this stack. Definition of an
215\method{end_\var{tag}()} method is optional for these tags. For tags
216processed by \method{do_\var{tag}()} or by \method{unknown_tag()}, no
217\method{end_\var{tag}()} method must be defined; if defined, it will
218not be used. If both \method{start_\var{tag}()} and
219\method{do_\var{tag}()} methods exist for a tag, the
220\method{start_\var{tag}()} method takes precedence.