blob: 3ec101810065cca3c198f7ece0f3aed56db31f57 [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{sgmllib} ---
Fred Drake4e28c591999-04-22 18:25:47 +00002 Simple SGML parser}
Fred Drakeb91e9341998-07-23 17:59:49 +00003
Fred Drake4e28c591999-04-22 18:25:47 +00004\declaremodule{standard}{sgmllib}
Fred Drakeb91e9341998-07-23 17:59:49 +00005\modulesynopsis{Only as much of an SGML parser as needed to parse HTML.}
6
Guido van Rossum86751151995-02-28 17:14:32 +00007\index{SGML}
8
Fred Drake2dde74c1998-03-12 14:42:23 +00009This module defines a class \class{SGMLParser} which serves as the
Guido van Rossum86751151995-02-28 17:14:32 +000010basis for parsing text files formatted in SGML (Standard Generalized
11Mark-up Language). In fact, it does not provide a full SGML parser
Fred Drake8f925951996-10-09 16:13:22 +000012--- it only parses SGML insofar as it is used by HTML, and the module
Fred Drake25211f52001-07-05 16:34:36 +000013only exists as a base for the \refmodule{htmllib} module. Another
14HTML parser which supports XHTML and offers a somewhat different
15interface is available in the \refmodule{HTMLParser} module.
Guido van Rossum86751151995-02-28 17:14:32 +000016
Fred Drake2dde74c1998-03-12 14:42:23 +000017\begin{classdesc}{SGMLParser}{}
18The \class{SGMLParser} class is instantiated without arguments.
19The parser is hardcoded to recognize the following
Fred Drake42439ad1996-10-08 21:51:49 +000020constructs:
Guido van Rossum86751151995-02-28 17:14:32 +000021
22\begin{itemize}
Guido van Rossum86751151995-02-28 17:14:32 +000023\item
24Opening and closing tags of the form
Fred Drakeb441eb81998-02-13 14:37:12 +000025\samp{<\var{tag} \var{attr}="\var{value}" ...>} and
26\samp{</\var{tag}>}, respectively.
Guido van Rossum86751151995-02-28 17:14:32 +000027
28\item
Fred Drakeb441eb81998-02-13 14:37:12 +000029Numeric character references of the form \samp{\&\#\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000030
31\item
Fred Drakeb441eb81998-02-13 14:37:12 +000032Entity references of the form \samp{\&\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000033
34\item
Fred Drakeb441eb81998-02-13 14:37:12 +000035SGML comments of the form \samp{<!--\var{text}-->}. Note that
Fred Drake42439ad1996-10-08 21:51:49 +000036spaces, tabs, and newlines are allowed between the trailing
Thomas Woutersf8316632000-07-16 19:01:10 +000037\samp{>} and the immediately preceding \samp{--}.
Guido van Rossum86751151995-02-28 17:14:32 +000038
39\end{itemize}
Fred Drake2dde74c1998-03-12 14:42:23 +000040\end{classdesc}
Guido van Rossum86751151995-02-28 17:14:32 +000041
Fred Drake961c2882004-09-10 01:20:21 +000042A single exception is defined as well:
43
44\begin{excdesc}{SGMLParseError}
45Exception raised by the \class{SGMLParser} class when it encounters an
46error while parsing.
47\versionadded{2.1}
48\end{excdesc}
49
50
51\class{SGMLParser} instances have the following methods:
Guido van Rossum86751151995-02-28 17:14:32 +000052
Fred Drake8f925951996-10-09 16:13:22 +000053
Fred Drake8fe533e1998-03-27 05:27:08 +000054\begin{methoddesc}{reset}{}
Guido van Rossum86751151995-02-28 17:14:32 +000055Reset the instance. Loses all unprocessed data. This is called
56implicitly at instantiation time.
Fred Drake8fe533e1998-03-27 05:27:08 +000057\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000058
Fred Drake8fe533e1998-03-27 05:27:08 +000059\begin{methoddesc}{setnomoretags}{}
Guido van Rossum86751151995-02-28 17:14:32 +000060Stop processing tags. Treat all following input as literal input
Fred Drake4e28c591999-04-22 18:25:47 +000061(CDATA). (This is only provided so the HTML tag
62\code{<PLAINTEXT>} can be implemented.)
Fred Drake8fe533e1998-03-27 05:27:08 +000063\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000064
Fred Drake8fe533e1998-03-27 05:27:08 +000065\begin{methoddesc}{setliteral}{}
Guido van Rossum86751151995-02-28 17:14:32 +000066Enter literal mode (CDATA mode).
Fred Drake8fe533e1998-03-27 05:27:08 +000067\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000068
Fred Drake8fe533e1998-03-27 05:27:08 +000069\begin{methoddesc}{feed}{data}
Guido van Rossum86751151995-02-28 17:14:32 +000070Feed some text to the parser. It is processed insofar as it consists
71of complete elements; incomplete data is buffered until more data is
Fred Drake2dde74c1998-03-12 14:42:23 +000072fed or \method{close()} is called.
Fred Drake8fe533e1998-03-27 05:27:08 +000073\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000074
Fred Drake8fe533e1998-03-27 05:27:08 +000075\begin{methoddesc}{close}{}
Guido van Rossum86751151995-02-28 17:14:32 +000076Force processing of all buffered data as if it were followed by an
77end-of-file mark. This method may be redefined by a derived class to
78define additional processing at the end of the input, but the
Fred Drake2dde74c1998-03-12 14:42:23 +000079redefined version should always call \method{close()}.
Fred Drake8fe533e1998-03-27 05:27:08 +000080\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000081
Fred Drake25e7cee2000-07-03 14:32:04 +000082\begin{methoddesc}{get_starttag_text}{}
83Return the text of the most recently opened start tag. This should
84not normally be needed for structured processing, but may be useful in
85dealing with HTML ``as deployed'' or for re-generating input with
86minimal changes (whitespace between attributes can be preserved,
87etc.).
88\end{methoddesc}
89
Fred Drake8fe533e1998-03-27 05:27:08 +000090\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Fred Drake42439ad1996-10-08 21:51:49 +000091This method is called to handle start tags for which either a
Fred Drake4e28c591999-04-22 18:25:47 +000092\method{start_\var{tag}()} or \method{do_\var{tag}()} method has been
Fred Drake2dde74c1998-03-12 14:42:23 +000093defined. The \var{tag} argument is the name of the tag converted to
94lower case, and the \var{method} argument is the bound method which
Fred Drake42439ad1996-10-08 21:51:49 +000095should be used to support semantic interpretation of the start tag.
Fred Drake4e28c591999-04-22 18:25:47 +000096The \var{attributes} argument is a list of \code{(\var{name},
97\var{value})} pairs containing the attributes found inside the tag's
Georg Brandl9cdf5632006-04-01 08:39:50 +000098\code{<>} brackets.
99
100The \var{name} has been translated to lower case.
Georg Brandl7f6b67c2006-04-01 08:35:18 +0000101Double quotes and backslashes in the \var{value} have been interpreted,
Georg Brandl9cdf5632006-04-01 08:39:50 +0000102as well as known character references and known entity references
103terminated by a semicolon (normally, entity references can be terminated
104by any non-alphanumerical character, but this would break the very
Georg Brandlcd103472006-04-01 20:40:16 +0000105common case of \code{<A HREF="url?spam=1\&eggs=2">} when \code{eggs}
Georg Brandl9cdf5632006-04-01 08:39:50 +0000106is a valid entity name).
107
Fred Drake4e28c591999-04-22 18:25:47 +0000108For instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this
Fred Drake2dde74c1998-03-12 14:42:23 +0000109method would be called as \samp{unknown_starttag('a', [('href',
Fred Drake42439ad1996-10-08 21:51:49 +0000110'http://www.cwi.nl/')])}. The base implementation simply calls
Fred Drake2dde74c1998-03-12 14:42:23 +0000111\var{method} with \var{attributes} as the only argument.
Georg Brandl7f6b67c2006-04-01 08:35:18 +0000112\versionadded[Handling of entity and character references within
113 attribute values]{2.5}
Fred Drake8fe533e1998-03-27 05:27:08 +0000114\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000115
Fred Drake8fe533e1998-03-27 05:27:08 +0000116\begin{methoddesc}{handle_endtag}{tag, method}
Fred Drake42439ad1996-10-08 21:51:49 +0000117This method is called to handle endtags for which an
Fred Drake4e28c591999-04-22 18:25:47 +0000118\method{end_\var{tag}()} method has been defined. The
119\var{tag} argument is the name of the tag converted to lower case, and
120the \var{method} argument is the bound method which should be used to
Fred Drake42439ad1996-10-08 21:51:49 +0000121support semantic interpretation of the end tag. If no
Fred Drake4e28c591999-04-22 18:25:47 +0000122\method{end_\var{tag}()} method is defined for the closing element,
Fred Drake2dde74c1998-03-12 14:42:23 +0000123this handler is not called. The base implementation simply calls
124\var{method}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000125\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000126
Fred Drake8fe533e1998-03-27 05:27:08 +0000127\begin{methoddesc}{handle_data}{data}
Guido van Rossum86751151995-02-28 17:14:32 +0000128This method is called to process arbitrary data. It is intended to be
129overridden by a derived class; the base class implementation does
130nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000131\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000132
Fred Drake8fe533e1998-03-27 05:27:08 +0000133\begin{methoddesc}{handle_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000134This method is called to process a character reference of the form
Fred Drakeb441eb81998-02-13 14:37:12 +0000135\samp{\&\#\var{ref};}. In the base implementation, \var{ref} must
Fred Drake42439ad1996-10-08 21:51:49 +0000136be a decimal number in the
137range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake2dde74c1998-03-12 14:42:23 +0000138method \method{handle_data()} with the character as argument. If
Fred Drake42439ad1996-10-08 21:51:49 +0000139\var{ref} is invalid or out of range, the method
140\code{unknown_charref(\var{ref})} is called to handle the error. A
141subclass must override this method to provide support for named
142character entities.
Fred Drake8fe533e1998-03-27 05:27:08 +0000143\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000144
Fred Drake8fe533e1998-03-27 05:27:08 +0000145\begin{methoddesc}{handle_entityref}{ref}
Fred Drake2dde74c1998-03-12 14:42:23 +0000146This method is called to process a general entity reference of the
147form \samp{\&\var{ref};} where \var{ref} is an general entity
Fred Drake42439ad1996-10-08 21:51:49 +0000148reference. It looks for \var{ref} in the instance (or class)
Fred Drake2dde74c1998-03-12 14:42:23 +0000149variable \member{entitydefs} which should be a mapping from entity
Fred Drake4e28c591999-04-22 18:25:47 +0000150names to corresponding translations. If a translation is found, it
151calls the method \method{handle_data()} with the translation;
152otherwise, it calls the method \code{unknown_entityref(\var{ref})}.
153The default \member{entitydefs} defines translations for
154\code{\&amp;}, \code{\&apos}, \code{\&gt;}, \code{\&lt;}, and
155\code{\&quot;}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000156\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000157
Fred Drake8fe533e1998-03-27 05:27:08 +0000158\begin{methoddesc}{handle_comment}{comment}
Fred Drake42439ad1996-10-08 21:51:49 +0000159This method is called when a comment is encountered. The
Fred Drake2dde74c1998-03-12 14:42:23 +0000160\var{comment} argument is a string containing the text between the
Fred Drakeb441eb81998-02-13 14:37:12 +0000161\samp{<!--} and \samp{-->} delimiters, but not the delimiters
162themselves. For example, the comment \samp{<!--text-->} will
Fred Drake42439ad1996-10-08 21:51:49 +0000163cause this method to be called with the argument \code{'text'}. The
164default method does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000165\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000166
Fred Drakeb15bbc82001-03-16 20:39:41 +0000167\begin{methoddesc}{handle_decl}{data}
168Method called when an SGML declaration is read by the parser. In
169practice, the \code{DOCTYPE} declaration is the only thing observed in
170HTML, but the parser does not discriminate among different (or broken)
171declarations. Internal subsets in a \code{DOCTYPE} declaration are
172not supported. The \var{data} parameter will be the entire contents
173of the declaration inside the \code{<!}...\code{>} markup. The
174default implementation does nothing.
175\end{methoddesc}
176
Fred Drake8fe533e1998-03-27 05:27:08 +0000177\begin{methoddesc}{report_unbalanced}{tag}
Fred Drake42439ad1996-10-08 21:51:49 +0000178This method is called when an end tag is found which does not
179correspond to any open element.
Fred Drake8fe533e1998-03-27 05:27:08 +0000180\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000181
Fred Drake8fe533e1998-03-27 05:27:08 +0000182\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000183This method is called to process an unknown start tag. It is intended
184to be overridden by a derived class; the base class implementation
Fred Drake42439ad1996-10-08 21:51:49 +0000185does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000186\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000187
Fred Drake8fe533e1998-03-27 05:27:08 +0000188\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossum86751151995-02-28 17:14:32 +0000189This method is called to process an unknown end tag. It is intended
190to be overridden by a derived class; the base class implementation
191does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000192\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000193
Fred Drake8fe533e1998-03-27 05:27:08 +0000194\begin{methoddesc}{unknown_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000195This method is called to process unresolvable numeric character
Fred Drake2dde74c1998-03-12 14:42:23 +0000196references. Refer to \method{handle_charref()} to determine what is
197handled by default. It is intended to be overridden by a derived
198class; the base class implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000199\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000200
Fred Drake8fe533e1998-03-27 05:27:08 +0000201\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossum86751151995-02-28 17:14:32 +0000202This method is called to process an unknown entity reference. It is
203intended to be overridden by a derived class; the base class
204implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000205\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000206
207Apart from overriding or extending the methods listed above, derived
208classes may also define methods of the following form to define
209processing of specific tags. Tag names in the input stream are case
210independent; the \var{tag} occurring in method names must be in lower
211case:
212
Fred Drake8fe533e1998-03-27 05:27:08 +0000213\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000214This method is called to process an opening tag \var{tag}. It has
Fred Drake4e28c591999-04-22 18:25:47 +0000215preference over \method{do_\var{tag}()}. The
216\var{attributes} argument has the same meaning as described for
Fred Drake2dde74c1998-03-12 14:42:23 +0000217\method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000218\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000219
Fred Drake8fe533e1998-03-27 05:27:08 +0000220\begin{methoddescni}{do_\var{tag}}{attributes}
Andrew M. Kuchling29d530b2006-06-03 18:09:41 +0000221This method is called to process an opening tag \var{tag}
222for which no \method{start_\var{tag}} method is defined.
223The \var{attributes} argument
Fred Drake2dde74c1998-03-12 14:42:23 +0000224has the same meaning as described for \method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000225\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000226
Fred Drake8fe533e1998-03-27 05:27:08 +0000227\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossum86751151995-02-28 17:14:32 +0000228This method is called to process a closing tag \var{tag}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000229\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000230
Fred Drake42439ad1996-10-08 21:51:49 +0000231Note that the parser maintains a stack of open elements for which no
232end tag has been found yet. Only tags processed by
Fred Drake4e28c591999-04-22 18:25:47 +0000233\method{start_\var{tag}()} are pushed on this stack. Definition of an
234\method{end_\var{tag}()} method is optional for these tags. For tags
235processed by \method{do_\var{tag}()} or by \method{unknown_tag()}, no
236\method{end_\var{tag}()} method must be defined; if defined, it will
237not be used. If both \method{start_\var{tag}()} and
238\method{do_\var{tag}()} methods exist for a tag, the
239\method{start_\var{tag}()} method takes precedence.