blob: 02944e2ca563e370102c1666caeb4ff6f6f2fb1a [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{xmllib} ---
2 A parser for XML documents.}
Fred Drakeb91e9341998-07-23 17:59:49 +00003\declaremodule{standard}{xmllib}
Fred Drake191f2851998-12-22 18:06:02 +00004\moduleauthor{Sjoerd Mullender}{Sjoerd.Mullender@cwi.nl}
5\sectionauthor{Sjoerd Mullender}{Sjoerd.Mullender@cwi.nl}
Fred Drakeb91e9341998-07-23 17:59:49 +00006
7\modulesynopsis{A parser for XML documents.}
8
Guido van Rossuma10768a1997-11-18 15:11:22 +00009\index{XML}
Fred Drake5cb48a41998-12-22 18:46:13 +000010\index{Extensible Markup Language}
11
12\versionchanged{1.5.2}
Guido van Rossuma10768a1997-11-18 15:11:22 +000013
Fred Drake3b5da761998-03-12 15:33:05 +000014This module defines a class \class{XMLParser} which serves as the basis
Fred Drake5cb48a41998-12-22 18:46:13 +000015for parsing text files formatted in XML (Extensible Markup Language).
Guido van Rossuma10768a1997-11-18 15:11:22 +000016
Fred Drake3b5da761998-03-12 15:33:05 +000017\begin{classdesc}{XMLParser}{}
18The \class{XMLParser} class must be instantiated without arguments.
19\end{classdesc}
20
Guido van Rossumb083a9f1998-12-18 20:17:13 +000021This class provides the following interface methods and instance variables:
22
23\begin{memberdesc}{attributes}
24A mapping of element names to mappings. The latter mapping maps
25attribute names that are valid for the element to the default value of
26the attribute, or if there is no default to \code{None}. The default
27value is the empty dictionary.
28\end{memberdesc}
29
30\begin{memberdesc}{elements}
31A mapping of element names to tuples. The tuples contain a function
32for handling the start and end tag respectively of the element, or
33\code{None} if the method \method{unknown_starttag()} or
34\method{unknown_endtag()} is to be called. The default value is the
35empty dictionary.
36\end{memberdesc}
37
38\begin{memberdesc}{entitydefs}
39A mapping of entitynames to their values. The default value contains
40definitions for \code{'lt'}, \code{'gt'}, \code{'amp'}, \code{'quot'},
41and \code{'apos'}.
42\end{memberdesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000043
Fred Drakefc576191998-04-04 07:15:02 +000044\begin{methoddesc}{reset}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000045Reset the instance. Loses all unprocessed data. This is called
46implicitly at the instantiation time.
Fred Drakefc576191998-04-04 07:15:02 +000047\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000048
Fred Drakefc576191998-04-04 07:15:02 +000049\begin{methoddesc}{setnomoretags}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000050Stop processing tags. Treat all following input as literal input
51(CDATA).
Fred Drakefc576191998-04-04 07:15:02 +000052\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000053
Fred Drakefc576191998-04-04 07:15:02 +000054\begin{methoddesc}{setliteral}{}
Guido van Rossumf484a331998-12-07 21:59:56 +000055Enter literal mode (CDATA mode). This mode is automatically exited
56when the close tag matching the last unclosed open tag is encountered.
Fred Drakefc576191998-04-04 07:15:02 +000057\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000058
Fred Drakefc576191998-04-04 07:15:02 +000059\begin{methoddesc}{feed}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +000060Feed some text to the parser. It is processed insofar as it consists
Guido van Rossumb083a9f1998-12-18 20:17:13 +000061of complete tags; incomplete data is buffered until more data is
Fred Drake3b5da761998-03-12 15:33:05 +000062fed or \method{close()} is called.
Fred Drakefc576191998-04-04 07:15:02 +000063\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000064
Fred Drakefc576191998-04-04 07:15:02 +000065\begin{methoddesc}{close}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000066Force processing of all buffered data as if it were followed by an
67end-of-file mark. This method may be redefined by a derived class to
68define additional processing at the end of the input, but the
Fred Drake3b5da761998-03-12 15:33:05 +000069redefined version should always call \method{close()}.
Fred Drakefc576191998-04-04 07:15:02 +000070\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000071
Fred Drakefc576191998-04-04 07:15:02 +000072\begin{methoddesc}{translate_references}{data}
Fred Drake3b5da761998-03-12 15:33:05 +000073Translate all entity and character references in \var{data} and
Guido van Rossum02505e41998-01-29 14:55:24 +000074returns the translated string.
Fred Drakefc576191998-04-04 07:15:02 +000075\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000076
Fred Drakefc576191998-04-04 07:15:02 +000077\begin{methoddesc}{handle_xml}{encoding, standalone}
78This method is called when the \samp{<?xml ...?>} tag is processed.
Guido van Rossum02505e41998-01-29 14:55:24 +000079The arguments are the values of the encoding and standalone attributes
80in the tag. Both encoding and standalone are optional. The values
Fred Drake3b5da761998-03-12 15:33:05 +000081passed to \method{handle_xml()} default to \code{None} and the string
Guido van Rossum02505e41998-01-29 14:55:24 +000082\code{'no'} respectively.
Fred Drakefc576191998-04-04 07:15:02 +000083\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000084
Fred Drakefc576191998-04-04 07:15:02 +000085\begin{methoddesc}{handle_doctype}{tag, data}
86This method is called when the \samp{<!DOCTYPE...>} tag is processed.
Guido van Rossum02505e41998-01-29 14:55:24 +000087The arguments are the name of the root element and the uninterpreted
88contents of the tag, starting after the white space after the name of
89the root element.
Fred Drakefc576191998-04-04 07:15:02 +000090\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000091
Fred Drakefc576191998-04-04 07:15:02 +000092\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Guido van Rossumb083a9f1998-12-18 20:17:13 +000093This method is called to handle start tags for which a start tag
94handler is defined in the instance variable \member{elements}. The
95\var{tag} argument is the name of the tag, and the \var{method}
96argument is the function (method) which should be used to support semantic
97interpretation of the start tag. The \var{attributes} argument is a
98dictionary of attributes, the key being the \var{name} and the value
99being the \var{value} of the attribute found inside the tag's
100\code{<>} brackets. Character and entity references in the
101\var{value} have been interpreted. For instance, for the start tag
Guido van Rossuma10768a1997-11-18 15:11:22 +0000102\code{<A HREF="http://www.cwi.nl/">}, this method would be called as
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000103\code{handle_starttag('A', self.elements['A'][0], \{'HREF': 'http://www.cwi.nl/'\})}.
Fred Drake3b5da761998-03-12 15:33:05 +0000104The base implementation simply calls \var{method} with \var{attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000105as the only argument.
Fred Drakefc576191998-04-04 07:15:02 +0000106\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000107
Fred Drakefc576191998-04-04 07:15:02 +0000108\begin{methoddesc}{handle_endtag}{tag, method}
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000109This method is called to handle endtags for which an end tag handler
110is defined in the instance variable \member{elements}. The \var{tag}
111argument is the name of the tag, and the \var{method} argument is the
112function (method) which should be used to support semantic
113interpretation of the end tag. For instance, for the endtag
114\code{</A>}, this method would be called as \code{handle_endtag('A',
115self.elements['A'][1])}. The base implementation simply calls
Fred Drake3b5da761998-03-12 15:33:05 +0000116\var{method}.
Fred Drakefc576191998-04-04 07:15:02 +0000117\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000118
Fred Drakefc576191998-04-04 07:15:02 +0000119\begin{methoddesc}{handle_data}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000120This method is called to process arbitrary data. It is intended to be
121overridden by a derived class; the base class implementation does
122nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000123\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000124
Fred Drakefc576191998-04-04 07:15:02 +0000125\begin{methoddesc}{handle_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000126This method is called to process a character reference of the form
Fred Drake7f6e2c41998-02-13 14:38:23 +0000127\samp{\&\#\var{ref};}. \var{ref} can either be a decimal number,
Fred Drakefc576191998-04-04 07:15:02 +0000128or a hexadecimal number when preceded by an \character{x}.
Guido van Rossuma10768a1997-11-18 15:11:22 +0000129In the base implementation, \var{ref} must be a number in the
130range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake3b5da761998-03-12 15:33:05 +0000131method \method{handle_data()} with the character as argument. If
Guido van Rossuma10768a1997-11-18 15:11:22 +0000132\var{ref} is invalid or out of range, the method
133\code{unknown_charref(\var{ref})} is called to handle the error. A
134subclass must override this method to provide support for character
135references outside of the \ASCII{} range.
Fred Drakefc576191998-04-04 07:15:02 +0000136\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000137
Fred Drakefc576191998-04-04 07:15:02 +0000138\begin{methoddesc}{handle_entityref}{ref}
Fred Drake3b5da761998-03-12 15:33:05 +0000139This method is called to process a general entity reference of the
140form \samp{\&\var{ref};} where \var{ref} is an general entity
Guido van Rossuma10768a1997-11-18 15:11:22 +0000141reference. It looks for \var{ref} in the instance (or class)
Fred Drake3b5da761998-03-12 15:33:05 +0000142variable \member{entitydefs} which should be a mapping from entity
143names to corresponding translations.
144If a translation is found, it calls the method \method{handle_data()}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000145with the translation; otherwise, it calls the method
Fred Drake3b5da761998-03-12 15:33:05 +0000146\code{unknown_entityref(\var{ref})}. The default \member{entitydefs}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000147defines translations for \code{\&amp;}, \code{\&apos}, \code{\&gt;},
148\code{\&lt;}, and \code{\&quot;}.
Fred Drakefc576191998-04-04 07:15:02 +0000149\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000150
Fred Drakefc576191998-04-04 07:15:02 +0000151\begin{methoddesc}{handle_comment}{comment}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000152This method is called when a comment is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000153\var{comment} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000154\samp{<!--} and \samp{-->} delimiters, but not the delimiters
155themselves. For example, the comment \samp{<!--text-->} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000156cause this method to be called with the argument \code{'text'}. The
157default method does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000158\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000159
Fred Drakefc576191998-04-04 07:15:02 +0000160\begin{methoddesc}{handle_cdata}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000161This method is called when a CDATA element is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000162\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000163\samp{<![CDATA[} and \samp{]]>} delimiters, but not the delimiters
164themselves. For example, the entity \samp{<![CDATA[text]]>} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000165cause this method to be called with the argument \code{'text'}. The
Fred Drake3b5da761998-03-12 15:33:05 +0000166default method does nothing, and is intended to be overridden.
Fred Drakefc576191998-04-04 07:15:02 +0000167\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000168
Fred Drakefc576191998-04-04 07:15:02 +0000169\begin{methoddesc}{handle_proc}{name, data}
Fred Drake3b5da761998-03-12 15:33:05 +0000170This method is called when a processing instruction (PI) is
171encountered. The \var{name} is the PI target, and the \var{data}
172argument is a string containing the text between the PI target and the
173closing delimiter, but not the delimiter itself. For example, the
174instruction \samp{<?XML text?>} will cause this method to be called
175with the arguments \code{'XML'} and \code{'text'}. The default method
Fred Drakefc576191998-04-04 07:15:02 +0000176does nothing. Note that if a document starts with \samp{<?xml
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000177..?>}, \method{handle_xml()} is called to handle it.
Fred Drakefc576191998-04-04 07:15:02 +0000178\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000179
Fred Drakefc576191998-04-04 07:15:02 +0000180\begin{methoddesc}{handle_special}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000181This method is called when a declaration is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000182\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000183\samp{<!} and \samp{>} delimiters, but not the delimiters
184themselves. For example, the entity \samp{<!ENTITY text>} will
Guido van Rossum02505e41998-01-29 14:55:24 +0000185cause this method to be called with the argument \code{'ENTITY text'}. The
Fred Drakefc576191998-04-04 07:15:02 +0000186default method does nothing. Note that \samp{<!DOCTYPE ...>} is
Guido van Rossum02505e41998-01-29 14:55:24 +0000187handled separately if it is located at the start of the document.
Fred Drakefc576191998-04-04 07:15:02 +0000188\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000189
Fred Drakefc576191998-04-04 07:15:02 +0000190\begin{methoddesc}{syntax_error}{message}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000191This method is called when a syntax error is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000192\var{message} is a description of what was wrong. The default method
193raises a \exception{RuntimeError} exception. If this method is
194overridden, it is permissable for it to return. This method is only
195called when the error can be recovered from. Unrecoverable errors
196raise a \exception{RuntimeError} without first calling
197\method{syntax_error()}.
Fred Drakefc576191998-04-04 07:15:02 +0000198\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000199
Fred Drakefc576191998-04-04 07:15:02 +0000200\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000201This method is called to process an unknown start tag. It is intended
202to be overridden by a derived class; the base class implementation
203does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000204\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000205
Fred Drakefc576191998-04-04 07:15:02 +0000206\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000207This method is called to process an unknown end tag. It is intended
208to be overridden by a derived class; the base class implementation
209does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000210\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000211
Fred Drakefc576191998-04-04 07:15:02 +0000212\begin{methoddesc}{unknown_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000213This method is called to process unresolvable numeric character
214references. It is intended to be overridden by a derived class; the
215base class implementation does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000216\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000217
Fred Drakefc576191998-04-04 07:15:02 +0000218\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000219This method is called to process an unknown entity reference. It is
220intended to be overridden by a derived class; the base class
221implementation does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000222\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000223
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000224\subsection{XML Namespaces}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000225
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000226This module has support for XML namespaces as defined in the XML
227Namespaces proposed recommendation.
Guido van Rossuma10768a1997-11-18 15:11:22 +0000228
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000229Tag and attribute names that are defined in an XML namespace are
230handled as if the name of the tag or element consisted of the
231namespace (i.e. the URL that defines the namespace) followed by a
232space and the name of the tag or attribute. For instance, the tag
233\code{<html xmlns='http://www.w3.org/TR/REC-html40'>} is treated as if
234the tag name was \code{'http://www.w3.org/TR/REC-html40 html'}, and
235the tag \code{<html:a href='http://frob.com'>} inside the above
236mentioned element is treated as if the tag name were
237\code{'http://www.w3.org/TR/REC-html40 a'} and the attribute name as
238if it were \code{'http://www.w3.org/TR/REC-html40 src'}.
Guido van Rossum02505e41998-01-29 14:55:24 +0000239
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000240An older draft of the XML Namespaces proposal is also recognized, but
241triggers a warning.