blob: 7a7c85d7bee58bc4d0133ea04ea5e70383ba0a1f [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{xmllib} ---
2 A parser for XML documents.}
Guido van Rossuma10768a1997-11-18 15:11:22 +00003% Author: Sjoerd Mullender
Fred Drakeb91e9341998-07-23 17:59:49 +00004\declaremodule{standard}{xmllib}
5
6\modulesynopsis{A parser for XML documents.}
7
Guido van Rossuma10768a1997-11-18 15:11:22 +00008\index{XML}
9
Fred Drake3b5da761998-03-12 15:33:05 +000010This module defines a class \class{XMLParser} which serves as the basis
Guido van Rossuma10768a1997-11-18 15:11:22 +000011for parsing text files formatted in XML (eXtended Markup Language).
12
Fred Drake3b5da761998-03-12 15:33:05 +000013\begin{classdesc}{XMLParser}{}
14The \class{XMLParser} class must be instantiated without arguments.
15\end{classdesc}
16
17This class provides the following interface methods:
Guido van Rossuma10768a1997-11-18 15:11:22 +000018
Fred Drakefc576191998-04-04 07:15:02 +000019\begin{methoddesc}{reset}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000020Reset the instance. Loses all unprocessed data. This is called
21implicitly at the instantiation time.
Fred Drakefc576191998-04-04 07:15:02 +000022\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000023
Fred Drakefc576191998-04-04 07:15:02 +000024\begin{methoddesc}{setnomoretags}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000025Stop processing tags. Treat all following input as literal input
26(CDATA).
Fred Drakefc576191998-04-04 07:15:02 +000027\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000028
Fred Drakefc576191998-04-04 07:15:02 +000029\begin{methoddesc}{setliteral}{}
Guido van Rossumf484a331998-12-07 21:59:56 +000030Enter literal mode (CDATA mode). This mode is automatically exited
31when the close tag matching the last unclosed open tag is encountered.
Fred Drakefc576191998-04-04 07:15:02 +000032\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000033
Fred Drakefc576191998-04-04 07:15:02 +000034\begin{methoddesc}{feed}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +000035Feed some text to the parser. It is processed insofar as it consists
36of complete elements; incomplete data is buffered until more data is
Fred Drake3b5da761998-03-12 15:33:05 +000037fed or \method{close()} is called.
Fred Drakefc576191998-04-04 07:15:02 +000038\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000039
Fred Drakefc576191998-04-04 07:15:02 +000040\begin{methoddesc}{close}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000041Force processing of all buffered data as if it were followed by an
42end-of-file mark. This method may be redefined by a derived class to
43define additional processing at the end of the input, but the
Fred Drake3b5da761998-03-12 15:33:05 +000044redefined version should always call \method{close()}.
Fred Drakefc576191998-04-04 07:15:02 +000045\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000046
Fred Drakefc576191998-04-04 07:15:02 +000047\begin{methoddesc}{translate_references}{data}
Fred Drake3b5da761998-03-12 15:33:05 +000048Translate all entity and character references in \var{data} and
Guido van Rossum02505e41998-01-29 14:55:24 +000049returns the translated string.
Fred Drakefc576191998-04-04 07:15:02 +000050\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000051
Fred Drakefc576191998-04-04 07:15:02 +000052\begin{methoddesc}{handle_xml}{encoding, standalone}
53This method is called when the \samp{<?xml ...?>} tag is processed.
Guido van Rossum02505e41998-01-29 14:55:24 +000054The arguments are the values of the encoding and standalone attributes
55in the tag. Both encoding and standalone are optional. The values
Fred Drake3b5da761998-03-12 15:33:05 +000056passed to \method{handle_xml()} default to \code{None} and the string
Guido van Rossum02505e41998-01-29 14:55:24 +000057\code{'no'} respectively.
Fred Drakefc576191998-04-04 07:15:02 +000058\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000059
Fred Drakefc576191998-04-04 07:15:02 +000060\begin{methoddesc}{handle_doctype}{tag, data}
61This method is called when the \samp{<!DOCTYPE...>} tag is processed.
Guido van Rossum02505e41998-01-29 14:55:24 +000062The arguments are the name of the root element and the uninterpreted
63contents of the tag, starting after the white space after the name of
64the root element.
Fred Drakefc576191998-04-04 07:15:02 +000065\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000066
Fred Drakefc576191998-04-04 07:15:02 +000067\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +000068This method is called to handle start tags for which a
Fred Drakefc576191998-04-04 07:15:02 +000069\method{start_\var{tag}()} method has been defined. The \var{tag}
70argument is the name of the tag, and the \var{method} argument is the
Guido van Rossuma10768a1997-11-18 15:11:22 +000071bound method which should be used to support semantic interpretation
72of the start tag. The \var{attributes} argument is a dictionary of
73attributes, the key being the \var{name} and the value being the
74\var{value} of the attribute found inside the tag's \code{<>} brackets.
Guido van Rossum02505e41998-01-29 14:55:24 +000075Character and entity references in the \var{value} have
Guido van Rossuma10768a1997-11-18 15:11:22 +000076been interpreted. For instance, for the tag
77\code{<A HREF="http://www.cwi.nl/">}, this method would be called as
Fred Drakeb0744c51997-12-29 19:59:38 +000078\code{handle_starttag('A', self.start_A, \{'HREF': 'http://www.cwi.nl/'\})}.
Fred Drake3b5da761998-03-12 15:33:05 +000079The base implementation simply calls \var{method} with \var{attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +000080as the only argument.
Fred Drakefc576191998-04-04 07:15:02 +000081\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000082
Fred Drakefc576191998-04-04 07:15:02 +000083\begin{methoddesc}{handle_endtag}{tag, method}
Guido van Rossuma10768a1997-11-18 15:11:22 +000084This method is called to handle endtags for which an
Fred Drakefc576191998-04-04 07:15:02 +000085\method{end_\var{tag}()} method has been defined. The \var{tag}
Guido van Rossuma10768a1997-11-18 15:11:22 +000086argument is the name of the tag, and the
Fred Drake3b5da761998-03-12 15:33:05 +000087\var{method} argument is the bound method which should be used to
Guido van Rossuma10768a1997-11-18 15:11:22 +000088support semantic interpretation of the end tag. If no
Fred Drakefc576191998-04-04 07:15:02 +000089\method{end_\var{tag}()} method is defined for the closing element, this
Guido van Rossuma10768a1997-11-18 15:11:22 +000090handler is not called. The base implementation simply calls
Fred Drake3b5da761998-03-12 15:33:05 +000091\var{method}.
Fred Drakefc576191998-04-04 07:15:02 +000092\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000093
Fred Drakefc576191998-04-04 07:15:02 +000094\begin{methoddesc}{handle_data}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +000095This method is called to process arbitrary data. It is intended to be
96overridden by a derived class; the base class implementation does
97nothing.
Fred Drakefc576191998-04-04 07:15:02 +000098\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000099
Fred Drakefc576191998-04-04 07:15:02 +0000100\begin{methoddesc}{handle_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000101This method is called to process a character reference of the form
Fred Drake7f6e2c41998-02-13 14:38:23 +0000102\samp{\&\#\var{ref};}. \var{ref} can either be a decimal number,
Fred Drakefc576191998-04-04 07:15:02 +0000103or a hexadecimal number when preceded by an \character{x}.
Guido van Rossuma10768a1997-11-18 15:11:22 +0000104In the base implementation, \var{ref} must be a number in the
105range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake3b5da761998-03-12 15:33:05 +0000106method \method{handle_data()} with the character as argument. If
Guido van Rossuma10768a1997-11-18 15:11:22 +0000107\var{ref} is invalid or out of range, the method
108\code{unknown_charref(\var{ref})} is called to handle the error. A
109subclass must override this method to provide support for character
110references outside of the \ASCII{} range.
Fred Drakefc576191998-04-04 07:15:02 +0000111\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000112
Fred Drakefc576191998-04-04 07:15:02 +0000113\begin{methoddesc}{handle_entityref}{ref}
Fred Drake3b5da761998-03-12 15:33:05 +0000114This method is called to process a general entity reference of the
115form \samp{\&\var{ref};} where \var{ref} is an general entity
Guido van Rossuma10768a1997-11-18 15:11:22 +0000116reference. It looks for \var{ref} in the instance (or class)
Fred Drake3b5da761998-03-12 15:33:05 +0000117variable \member{entitydefs} which should be a mapping from entity
118names to corresponding translations.
119If a translation is found, it calls the method \method{handle_data()}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000120with the translation; otherwise, it calls the method
Fred Drake3b5da761998-03-12 15:33:05 +0000121\code{unknown_entityref(\var{ref})}. The default \member{entitydefs}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000122defines translations for \code{\&amp;}, \code{\&apos}, \code{\&gt;},
123\code{\&lt;}, and \code{\&quot;}.
Fred Drakefc576191998-04-04 07:15:02 +0000124\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000125
Fred Drakefc576191998-04-04 07:15:02 +0000126\begin{methoddesc}{handle_comment}{comment}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000127This method is called when a comment is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000128\var{comment} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000129\samp{<!--} and \samp{-->} delimiters, but not the delimiters
130themselves. For example, the comment \samp{<!--text-->} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000131cause this method to be called with the argument \code{'text'}. The
132default method does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000133\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000134
Fred Drakefc576191998-04-04 07:15:02 +0000135\begin{methoddesc}{handle_cdata}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000136This method is called when a CDATA element is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000137\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000138\samp{<![CDATA[} and \samp{]]>} delimiters, but not the delimiters
139themselves. For example, the entity \samp{<![CDATA[text]]>} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000140cause this method to be called with the argument \code{'text'}. The
Fred Drake3b5da761998-03-12 15:33:05 +0000141default method does nothing, and is intended to be overridden.
Fred Drakefc576191998-04-04 07:15:02 +0000142\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000143
Fred Drakefc576191998-04-04 07:15:02 +0000144\begin{methoddesc}{handle_proc}{name, data}
Fred Drake3b5da761998-03-12 15:33:05 +0000145This method is called when a processing instruction (PI) is
146encountered. The \var{name} is the PI target, and the \var{data}
147argument is a string containing the text between the PI target and the
148closing delimiter, but not the delimiter itself. For example, the
149instruction \samp{<?XML text?>} will cause this method to be called
150with the arguments \code{'XML'} and \code{'text'}. The default method
Fred Drakefc576191998-04-04 07:15:02 +0000151does nothing. Note that if a document starts with \samp{<?xml
Fred Drake3b5da761998-03-12 15:33:05 +0000152...?>}, \method{handle_xml()} is called to handle it.
Fred Drakefc576191998-04-04 07:15:02 +0000153\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000154
Fred Drakefc576191998-04-04 07:15:02 +0000155\begin{methoddesc}{handle_special}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000156This method is called when a declaration is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000157\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000158\samp{<!} and \samp{>} delimiters, but not the delimiters
159themselves. For example, the entity \samp{<!ENTITY text>} will
Guido van Rossum02505e41998-01-29 14:55:24 +0000160cause this method to be called with the argument \code{'ENTITY text'}. The
Fred Drakefc576191998-04-04 07:15:02 +0000161default method does nothing. Note that \samp{<!DOCTYPE ...>} is
Guido van Rossum02505e41998-01-29 14:55:24 +0000162handled separately if it is located at the start of the document.
Fred Drakefc576191998-04-04 07:15:02 +0000163\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000164
Fred Drakefc576191998-04-04 07:15:02 +0000165\begin{methoddesc}{syntax_error}{message}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000166This method is called when a syntax error is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000167\var{message} is a description of what was wrong. The default method
168raises a \exception{RuntimeError} exception. If this method is
169overridden, it is permissable for it to return. This method is only
170called when the error can be recovered from. Unrecoverable errors
171raise a \exception{RuntimeError} without first calling
172\method{syntax_error()}.
Fred Drakefc576191998-04-04 07:15:02 +0000173\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000174
Fred Drakefc576191998-04-04 07:15:02 +0000175\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000176This method is called to process an unknown start tag. It is intended
177to be overridden by a derived class; the base class implementation
178does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000179\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000180
Fred Drakefc576191998-04-04 07:15:02 +0000181\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000182This method is called to process an unknown end tag. It is intended
183to be overridden by a derived class; the base class implementation
184does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000185\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000186
Fred Drakefc576191998-04-04 07:15:02 +0000187\begin{methoddesc}{unknown_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000188This method is called to process unresolvable numeric character
189references. It is intended to be overridden by a derived class; the
190base class implementation does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000191\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000192
Fred Drakefc576191998-04-04 07:15:02 +0000193\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000194This method is called to process an unknown entity reference. It is
195intended to be overridden by a derived class; the base class
196implementation does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000197\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000198
199Apart from overriding or extending the methods listed above, derived
Guido van Rossum02505e41998-01-29 14:55:24 +0000200classes may also define methods and variables of the following form to
201define processing of specific tags. Tag names in the input stream are
202case dependent; the \var{tag} occurring in method names must be in the
Guido van Rossuma10768a1997-11-18 15:11:22 +0000203correct case:
204
Fred Drakefc576191998-04-04 07:15:02 +0000205\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000206This method is called to process an opening tag \var{tag}. The
207\var{attributes} argument has the same meaning as described for
Fred Drake3b5da761998-03-12 15:33:05 +0000208\method{handle_starttag()} above. In fact, the base implementation of
209\method{handle_starttag()} calls this method.
Fred Drakefc576191998-04-04 07:15:02 +0000210\end{methoddescni}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000211
Fred Drakefc576191998-04-04 07:15:02 +0000212\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000213This method is called to process a closing tag \var{tag}.
Fred Drakefc576191998-04-04 07:15:02 +0000214\end{methoddescni}
Guido van Rossum02505e41998-01-29 14:55:24 +0000215
Fred Drakefc576191998-04-04 07:15:02 +0000216\begin{memberdescni}{\var{tag}_attributes}
217If a class or instance variable \member{\var{tag}_attributes} exists, it
Guido van Rossum02505e41998-01-29 14:55:24 +0000218should be a list or a dictionary. If a list, the elements of the list
219are the valid attributes for the element \var{tag}; if a dictionary,
220the keys are the valid attributes for the element \var{tag}, and the
221values the default values of the attributes, or \code{None} if there
222is no default.
223In addition to the attributes that were present in the tag, the
Fred Drake3b5da761998-03-12 15:33:05 +0000224attribute dictionary that is passed to \method{handle_starttag()} and
225\method{unknown_starttag()} contains values for all attributes that
226have a default value.
Fred Drakefc576191998-04-04 07:15:02 +0000227\end{memberdescni}