blob: 642dacdf52792f916f23ba7e578525fab845edfc [file] [log] [blame]
Fred Drake3a0351c1998-04-04 07:23:21 +00001\section{Standard Module \module{sgmllib}}
Fred Drakeb91e9341998-07-23 17:59:49 +00002\declaremodule{standard}{sgmllib}
3
4\modulesynopsis{Only as much of an SGML parser as needed to parse HTML.}
5
Guido van Rossum86751151995-02-28 17:14:32 +00006\index{SGML}
7
Fred Drake2dde74c1998-03-12 14:42:23 +00008This module defines a class \class{SGMLParser} which serves as the
Guido van Rossum86751151995-02-28 17:14:32 +00009basis for parsing text files formatted in SGML (Standard Generalized
10Mark-up Language). In fact, it does not provide a full SGML parser
Fred Drake8f925951996-10-09 16:13:22 +000011--- it only parses SGML insofar as it is used by HTML, and the module
Fred Drake2dde74c1998-03-12 14:42:23 +000012only exists as a base for the \module{htmllib}\refstmodindex{htmllib}
13module.
Guido van Rossum86751151995-02-28 17:14:32 +000014
Fred Drake2dde74c1998-03-12 14:42:23 +000015
16\begin{classdesc}{SGMLParser}{}
17The \class{SGMLParser} class is instantiated without arguments.
18The parser is hardcoded to recognize the following
Fred Drake42439ad1996-10-08 21:51:49 +000019constructs:
Guido van Rossum86751151995-02-28 17:14:32 +000020
21\begin{itemize}
Guido van Rossum86751151995-02-28 17:14:32 +000022\item
23Opening and closing tags of the form
Fred Drakeb441eb81998-02-13 14:37:12 +000024\samp{<\var{tag} \var{attr}="\var{value}" ...>} and
25\samp{</\var{tag}>}, respectively.
Guido van Rossum86751151995-02-28 17:14:32 +000026
27\item
Fred Drakeb441eb81998-02-13 14:37:12 +000028Numeric character references of the form \samp{\&\#\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000029
30\item
Fred Drakeb441eb81998-02-13 14:37:12 +000031Entity references of the form \samp{\&\var{name};}.
Guido van Rossum86751151995-02-28 17:14:32 +000032
33\item
Fred Drakeb441eb81998-02-13 14:37:12 +000034SGML comments of the form \samp{<!--\var{text}-->}. Note that
Fred Drake42439ad1996-10-08 21:51:49 +000035spaces, tabs, and newlines are allowed between the trailing
Fred Drakeb441eb81998-02-13 14:37:12 +000036\samp{>} and the immediately preceeding \samp{--}.
Guido van Rossum86751151995-02-28 17:14:32 +000037
38\end{itemize}
Fred Drake2dde74c1998-03-12 14:42:23 +000039\end{classdesc}
Guido van Rossum86751151995-02-28 17:14:32 +000040
Fred Drake2dde74c1998-03-12 14:42:23 +000041\class{SGMLParser} instances have the following interface methods:
Guido van Rossum86751151995-02-28 17:14:32 +000042
Fred Drake8f925951996-10-09 16:13:22 +000043
Fred Drake8fe533e1998-03-27 05:27:08 +000044\begin{methoddesc}{reset}{}
Guido van Rossum86751151995-02-28 17:14:32 +000045Reset the instance. Loses all unprocessed data. This is called
46implicitly at instantiation time.
Fred Drake8fe533e1998-03-27 05:27:08 +000047\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000048
Fred Drake8fe533e1998-03-27 05:27:08 +000049\begin{methoddesc}{setnomoretags}{}
Guido van Rossum86751151995-02-28 17:14:32 +000050Stop processing tags. Treat all following input as literal input
51(CDATA). (This is only provided so the HTML tag \code{<PLAINTEXT>}
52can be implemented.)
Fred Drake8fe533e1998-03-27 05:27:08 +000053\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000054
Fred Drake8fe533e1998-03-27 05:27:08 +000055\begin{methoddesc}{setliteral}{}
Guido van Rossum86751151995-02-28 17:14:32 +000056Enter literal mode (CDATA mode).
Fred Drake8fe533e1998-03-27 05:27:08 +000057\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000058
Fred Drake8fe533e1998-03-27 05:27:08 +000059\begin{methoddesc}{feed}{data}
Guido van Rossum86751151995-02-28 17:14:32 +000060Feed some text to the parser. It is processed insofar as it consists
61of complete elements; incomplete data is buffered until more data is
Fred Drake2dde74c1998-03-12 14:42:23 +000062fed or \method{close()} is called.
Fred Drake8fe533e1998-03-27 05:27:08 +000063\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000064
Fred Drake8fe533e1998-03-27 05:27:08 +000065\begin{methoddesc}{close}{}
Guido van Rossum86751151995-02-28 17:14:32 +000066Force processing of all buffered data as if it were followed by an
67end-of-file mark. This method may be redefined by a derived class to
68define additional processing at the end of the input, but the
Fred Drake2dde74c1998-03-12 14:42:23 +000069redefined version should always call \method{close()}.
Fred Drake8fe533e1998-03-27 05:27:08 +000070\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000071
Fred Drake8fe533e1998-03-27 05:27:08 +000072\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Fred Drake42439ad1996-10-08 21:51:49 +000073This method is called to handle start tags for which either a
74\code{start_\var{tag}()} or \code{do_\var{tag}()} method has been
Fred Drake2dde74c1998-03-12 14:42:23 +000075defined. The \var{tag} argument is the name of the tag converted to
76lower case, and the \var{method} argument is the bound method which
Fred Drake42439ad1996-10-08 21:51:49 +000077should be used to support semantic interpretation of the start tag.
Fred Drake2dde74c1998-03-12 14:42:23 +000078The \var{attributes} argument is a list of \code{(\var{name}, \var{value})}
Fred Drake42439ad1996-10-08 21:51:49 +000079pairs containing the attributes found inside the tag's \code{<>}
80brackets. The \var{name} has been translated to lower case and double
81quotes and backslashes in the \var{value} have been interpreted. For
82instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this
Fred Drake2dde74c1998-03-12 14:42:23 +000083method would be called as \samp{unknown_starttag('a', [('href',
Fred Drake42439ad1996-10-08 21:51:49 +000084'http://www.cwi.nl/')])}. The base implementation simply calls
Fred Drake2dde74c1998-03-12 14:42:23 +000085\var{method} with \var{attributes} as the only argument.
Fred Drake8fe533e1998-03-27 05:27:08 +000086\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000087
Fred Drake8fe533e1998-03-27 05:27:08 +000088\begin{methoddesc}{handle_endtag}{tag, method}
Fred Drake42439ad1996-10-08 21:51:49 +000089This method is called to handle endtags for which an
Fred Drake2dde74c1998-03-12 14:42:23 +000090\code{end_\var{tag}()} method has been defined. The \var{tag}
Fred Drake42439ad1996-10-08 21:51:49 +000091argument is the name of the tag converted to lower case, and the
Fred Drake2dde74c1998-03-12 14:42:23 +000092\var{method} argument is the bound method which should be used to
Fred Drake42439ad1996-10-08 21:51:49 +000093support semantic interpretation of the end tag. If no
Fred Drake2dde74c1998-03-12 14:42:23 +000094\code{end_\var{tag}()} method is defined for the closing element,
95this handler is not called. The base implementation simply calls
96\var{method}.
Fred Drake8fe533e1998-03-27 05:27:08 +000097\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +000098
Fred Drake8fe533e1998-03-27 05:27:08 +000099\begin{methoddesc}{handle_data}{data}
Guido van Rossum86751151995-02-28 17:14:32 +0000100This method is called to process arbitrary data. It is intended to be
101overridden by a derived class; the base class implementation does
102nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000103\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000104
Fred Drake8fe533e1998-03-27 05:27:08 +0000105\begin{methoddesc}{handle_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000106This method is called to process a character reference of the form
Fred Drakeb441eb81998-02-13 14:37:12 +0000107\samp{\&\#\var{ref};}. In the base implementation, \var{ref} must
Fred Drake42439ad1996-10-08 21:51:49 +0000108be a decimal number in the
109range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake2dde74c1998-03-12 14:42:23 +0000110method \method{handle_data()} with the character as argument. If
Fred Drake42439ad1996-10-08 21:51:49 +0000111\var{ref} is invalid or out of range, the method
112\code{unknown_charref(\var{ref})} is called to handle the error. A
113subclass must override this method to provide support for named
114character entities.
Fred Drake8fe533e1998-03-27 05:27:08 +0000115\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000116
Fred Drake8fe533e1998-03-27 05:27:08 +0000117\begin{methoddesc}{handle_entityref}{ref}
Fred Drake2dde74c1998-03-12 14:42:23 +0000118This method is called to process a general entity reference of the
119form \samp{\&\var{ref};} where \var{ref} is an general entity
Fred Drake42439ad1996-10-08 21:51:49 +0000120reference. It looks for \var{ref} in the instance (or class)
Fred Drake2dde74c1998-03-12 14:42:23 +0000121variable \member{entitydefs} which should be a mapping from entity
122names to corresponding translations.
123If a translation is found, it calls the method \method{handle_data()}
Fred Drake42439ad1996-10-08 21:51:49 +0000124with the translation; otherwise, it calls the method
Fred Drake2dde74c1998-03-12 14:42:23 +0000125\code{unknown_entityref(\var{ref})}. The default \member{entitydefs}
Fred Drake42439ad1996-10-08 21:51:49 +0000126defines translations for \code{\&amp;}, \code{\&apos}, \code{\&gt;},
127\code{\&lt;}, and \code{\&quot;}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000128\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000129
Fred Drake8fe533e1998-03-27 05:27:08 +0000130\begin{methoddesc}{handle_comment}{comment}
Fred Drake42439ad1996-10-08 21:51:49 +0000131This method is called when a comment is encountered. The
Fred Drake2dde74c1998-03-12 14:42:23 +0000132\var{comment} argument is a string containing the text between the
Fred Drakeb441eb81998-02-13 14:37:12 +0000133\samp{<!--} and \samp{-->} delimiters, but not the delimiters
134themselves. For example, the comment \samp{<!--text-->} will
Fred Drake42439ad1996-10-08 21:51:49 +0000135cause this method to be called with the argument \code{'text'}. The
136default method does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000137\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000138
Fred Drake8fe533e1998-03-27 05:27:08 +0000139\begin{methoddesc}{report_unbalanced}{tag}
Fred Drake42439ad1996-10-08 21:51:49 +0000140This method is called when an end tag is found which does not
141correspond to any open element.
Fred Drake8fe533e1998-03-27 05:27:08 +0000142\end{methoddesc}
Fred Drake42439ad1996-10-08 21:51:49 +0000143
Fred Drake8fe533e1998-03-27 05:27:08 +0000144\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000145This method is called to process an unknown start tag. It is intended
146to be overridden by a derived class; the base class implementation
Fred Drake42439ad1996-10-08 21:51:49 +0000147does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000148\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000149
Fred Drake8fe533e1998-03-27 05:27:08 +0000150\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossum86751151995-02-28 17:14:32 +0000151This method is called to process an unknown end tag. It is intended
152to be overridden by a derived class; the base class implementation
153does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000154\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000155
Fred Drake8fe533e1998-03-27 05:27:08 +0000156\begin{methoddesc}{unknown_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000157This method is called to process unresolvable numeric character
Fred Drake2dde74c1998-03-12 14:42:23 +0000158references. Refer to \method{handle_charref()} to determine what is
159handled by default. It is intended to be overridden by a derived
160class; the base class implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000161\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000162
Fred Drake8fe533e1998-03-27 05:27:08 +0000163\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossum86751151995-02-28 17:14:32 +0000164This method is called to process an unknown entity reference. It is
165intended to be overridden by a derived class; the base class
166implementation does nothing.
Fred Drake8fe533e1998-03-27 05:27:08 +0000167\end{methoddesc}
Guido van Rossum86751151995-02-28 17:14:32 +0000168
169Apart from overriding or extending the methods listed above, derived
170classes may also define methods of the following form to define
171processing of specific tags. Tag names in the input stream are case
172independent; the \var{tag} occurring in method names must be in lower
173case:
174
Fred Drake8fe533e1998-03-27 05:27:08 +0000175\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000176This method is called to process an opening tag \var{tag}. It has
Fred Drake2dde74c1998-03-12 14:42:23 +0000177preference over \code{do_\var{tag}()}. The \var{attributes}
178argument has the same meaning as described for
179\method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000180\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000181
Fred Drake8fe533e1998-03-27 05:27:08 +0000182\begin{methoddescni}{do_\var{tag}}{attributes}
Guido van Rossum86751151995-02-28 17:14:32 +0000183This method is called to process an opening tag \var{tag} that does
184not come with a matching closing tag. The \var{attributes} argument
Fred Drake2dde74c1998-03-12 14:42:23 +0000185has the same meaning as described for \method{handle_starttag()} above.
Fred Drake8fe533e1998-03-27 05:27:08 +0000186\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000187
Fred Drake8fe533e1998-03-27 05:27:08 +0000188\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossum86751151995-02-28 17:14:32 +0000189This method is called to process a closing tag \var{tag}.
Fred Drake8fe533e1998-03-27 05:27:08 +0000190\end{methoddescni}
Guido van Rossum86751151995-02-28 17:14:32 +0000191
Fred Drake42439ad1996-10-08 21:51:49 +0000192Note that the parser maintains a stack of open elements for which no
193end tag has been found yet. Only tags processed by
194\code{start_\var{tag}()} are pushed on this stack. Definition of an
Guido van Rossum86751151995-02-28 17:14:32 +0000195\code{end_\var{tag}()} method is optional for these tags. For tags
Fred Drake2dde74c1998-03-12 14:42:23 +0000196processed by \code{do_\var{tag}()} or by \method{unknown_tag()}, no
Fred Drake42439ad1996-10-08 21:51:49 +0000197\code{end_\var{tag}()} method must be defined; if defined, it will not
198be used. If both \code{start_\var{tag}()} and \code{do_\var{tag}()}
199methods exist for a tag, the \code{start_\var{tag}()} method takes
200precedence.