blob: f7197cade04b60d1f4eb013e159cf17ee0b3c4e5 [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{xmllib} ---
Fred Drake34250111999-02-19 23:45:06 +00002 A parser for XML documents}
3
Fred Drakeb91e9341998-07-23 17:59:49 +00004\declaremodule{standard}{xmllib}
Fred Drake34250111999-02-19 23:45:06 +00005\modulesynopsis{A parser for XML documents.}
Fred Drake191f2851998-12-22 18:06:02 +00006\moduleauthor{Sjoerd Mullender}{Sjoerd.Mullender@cwi.nl}
7\sectionauthor{Sjoerd Mullender}{Sjoerd.Mullender@cwi.nl}
Fred Drakeb91e9341998-07-23 17:59:49 +00008
Fred Drakeb91e9341998-07-23 17:59:49 +00009
Guido van Rossuma10768a1997-11-18 15:11:22 +000010\index{XML}
Fred Drake5cb48a41998-12-22 18:46:13 +000011\index{Extensible Markup Language}
12
Fred Drake296b8f52000-10-02 22:14:09 +000013\deprecated{2.0}{Use \refmodule{xml.sax} instead. The newer XML
14 package includes full support for XML 1.0.}
15
Fred Drake806d3322001-11-06 22:10:47 +000016\versionchanged[Added namespace support]{1.5.2}
Guido van Rossuma10768a1997-11-18 15:11:22 +000017
Fred Drake3b5da761998-03-12 15:33:05 +000018This module defines a class \class{XMLParser} which serves as the basis
Fred Drake5cb48a41998-12-22 18:46:13 +000019for parsing text files formatted in XML (Extensible Markup Language).
Guido van Rossuma10768a1997-11-18 15:11:22 +000020
Fred Drake3b5da761998-03-12 15:33:05 +000021\begin{classdesc}{XMLParser}{}
Guido van Rossume7f19201999-08-26 15:57:44 +000022The \class{XMLParser} class must be instantiated without
23arguments.\footnote{Actually, a number of keyword arguments are
24recognized which influence the parser to accept certain non-standard
25constructs. The following keyword arguments are currently
Fred Drake011028c2000-07-06 04:45:14 +000026recognized. The defaults for all of these is \code{0} (false) except
27for the last one for which the default is \code{1} (true).
Guido van Rossume7f19201999-08-26 15:57:44 +000028\var{accept_unquoted_attributes} (accept certain attribute values
29without requiring quotes), \var{accept_missing_endtag_name} (accept
30end tags that look like \code{</>}), \var{map_case} (map upper case to
31lower case in tags and attributes), \var{accept_utf8} (allow UTF-8
32characters in input; this is required according to the XML standard,
33but Python does not as yet deal properly with these characters, so
Fred Drake011028c2000-07-06 04:45:14 +000034this is not the default), \var{translate_attribute_references} (don't
35attempt to translate character and entity references in attribute values).}
Fred Drake3b5da761998-03-12 15:33:05 +000036\end{classdesc}
37
Guido van Rossumb083a9f1998-12-18 20:17:13 +000038This class provides the following interface methods and instance variables:
39
40\begin{memberdesc}{attributes}
41A mapping of element names to mappings. The latter mapping maps
42attribute names that are valid for the element to the default value of
43the attribute, or if there is no default to \code{None}. The default
Guido van Rossum09da65e1999-02-02 17:55:12 +000044value is the empty dictionary. This variable is meant to be
45overridden, not extended since the default is shared by all instances
46of \class{XMLParser}.
Guido van Rossumb083a9f1998-12-18 20:17:13 +000047\end{memberdesc}
48
49\begin{memberdesc}{elements}
50A mapping of element names to tuples. The tuples contain a function
51for handling the start and end tag respectively of the element, or
52\code{None} if the method \method{unknown_starttag()} or
53\method{unknown_endtag()} is to be called. The default value is the
Guido van Rossum09da65e1999-02-02 17:55:12 +000054empty dictionary. This variable is meant to be overridden, not
55extended since the default is shared by all instances of
56\class{XMLParser}.
Guido van Rossumb083a9f1998-12-18 20:17:13 +000057\end{memberdesc}
58
59\begin{memberdesc}{entitydefs}
60A mapping of entitynames to their values. The default value contains
61definitions for \code{'lt'}, \code{'gt'}, \code{'amp'}, \code{'quot'},
62and \code{'apos'}.
63\end{memberdesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000064
Fred Drakefc576191998-04-04 07:15:02 +000065\begin{methoddesc}{reset}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000066Reset the instance. Loses all unprocessed data. This is called
67implicitly at the instantiation time.
Fred Drakefc576191998-04-04 07:15:02 +000068\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000069
Fred Drakefc576191998-04-04 07:15:02 +000070\begin{methoddesc}{setnomoretags}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000071Stop processing tags. Treat all following input as literal input
72(CDATA).
Fred Drakefc576191998-04-04 07:15:02 +000073\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000074
Fred Drakefc576191998-04-04 07:15:02 +000075\begin{methoddesc}{setliteral}{}
Guido van Rossumf484a331998-12-07 21:59:56 +000076Enter literal mode (CDATA mode). This mode is automatically exited
77when the close tag matching the last unclosed open tag is encountered.
Fred Drakefc576191998-04-04 07:15:02 +000078\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000079
Fred Drakefc576191998-04-04 07:15:02 +000080\begin{methoddesc}{feed}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +000081Feed some text to the parser. It is processed insofar as it consists
Guido van Rossumb083a9f1998-12-18 20:17:13 +000082of complete tags; incomplete data is buffered until more data is
Fred Drake3b5da761998-03-12 15:33:05 +000083fed or \method{close()} is called.
Fred Drakefc576191998-04-04 07:15:02 +000084\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000085
Fred Drakefc576191998-04-04 07:15:02 +000086\begin{methoddesc}{close}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000087Force processing of all buffered data as if it were followed by an
88end-of-file mark. This method may be redefined by a derived class to
89define additional processing at the end of the input, but the
Fred Drake3b5da761998-03-12 15:33:05 +000090redefined version should always call \method{close()}.
Fred Drakefc576191998-04-04 07:15:02 +000091\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000092
Fred Drakefc576191998-04-04 07:15:02 +000093\begin{methoddesc}{translate_references}{data}
Fred Drake3b5da761998-03-12 15:33:05 +000094Translate all entity and character references in \var{data} and
Fred Draked8a41e61999-02-19 17:54:10 +000095return the translated string.
Fred Drakefc576191998-04-04 07:15:02 +000096\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000097
Sjoerd Mullender1c8feae2000-08-31 10:27:00 +000098\begin{methoddesc}{getnamespace}{}
99Return a mapping of namespace abbreviations to namespace URIs that are
100currently in effect.
101\end{methoddesc}
102
Fred Drakefc576191998-04-04 07:15:02 +0000103\begin{methoddesc}{handle_xml}{encoding, standalone}
104This method is called when the \samp{<?xml ...?>} tag is processed.
Guido van Rossum02505e41998-01-29 14:55:24 +0000105The arguments are the values of the encoding and standalone attributes
106in the tag. Both encoding and standalone are optional. The values
Fred Drake3b5da761998-03-12 15:33:05 +0000107passed to \method{handle_xml()} default to \code{None} and the string
Guido van Rossum02505e41998-01-29 14:55:24 +0000108\code{'no'} respectively.
Fred Drakefc576191998-04-04 07:15:02 +0000109\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +0000110
Fred Drake38e5d272000-04-03 20:13:55 +0000111\begin{methoddesc}{handle_doctype}{tag, pubid, syslit, data}
Fred Drake46479d32000-08-11 20:34:27 +0000112This\index{DOCTYPE declaration} method is called when the
113\samp{<!DOCTYPE...>} declaration is processed. The arguments are the
114tag name of the root element, the Formal Public\index{Formal Public
115Identifier} Identifier (or \code{None} if not specified), the system
116identifier, and the uninterpreted contents of the internal DTD subset
117as a string (or \code{None} if not present).
Fred Drakefc576191998-04-04 07:15:02 +0000118\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +0000119
Fred Drakefc576191998-04-04 07:15:02 +0000120\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000121This method is called to handle start tags for which a start tag
122handler is defined in the instance variable \member{elements}. The
Fred Drake46479d32000-08-11 20:34:27 +0000123\var{tag} argument is the name of the tag, and the
124\var{method} argument is the function (method) which should be used to
125support semantic interpretation of the start tag. The
126\var{attributes} argument is a dictionary of attributes, the key being
127the \var{name} and the value being the \var{value} of the attribute
128found inside the tag's \code{<>} brackets. Character and entity
129references in the \var{value} have been interpreted. For instance,
130for the start tag \code{<A HREF="http://www.cwi.nl/">}, this method
131would be called as \code{handle_starttag('A', self.elements['A'][0],
132\{'HREF': 'http://www.cwi.nl/'\})}. The base implementation simply
133calls \var{method} with \var{attributes} as the only argument.
Fred Drakefc576191998-04-04 07:15:02 +0000134\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000135
Fred Drakefc576191998-04-04 07:15:02 +0000136\begin{methoddesc}{handle_endtag}{tag, method}
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000137This method is called to handle endtags for which an end tag handler
138is defined in the instance variable \member{elements}. The \var{tag}
139argument is the name of the tag, and the \var{method} argument is the
140function (method) which should be used to support semantic
141interpretation of the end tag. For instance, for the endtag
142\code{</A>}, this method would be called as \code{handle_endtag('A',
143self.elements['A'][1])}. The base implementation simply calls
Fred Drake3b5da761998-03-12 15:33:05 +0000144\var{method}.
Fred Drakefc576191998-04-04 07:15:02 +0000145\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000146
Fred Drakefc576191998-04-04 07:15:02 +0000147\begin{methoddesc}{handle_data}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000148This method is called to process arbitrary data. It is intended to be
149overridden by a derived class; the base class implementation does
150nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000151\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000152
Fred Drakefc576191998-04-04 07:15:02 +0000153\begin{methoddesc}{handle_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000154This method is called to process a character reference of the form
Fred Drake7f6e2c41998-02-13 14:38:23 +0000155\samp{\&\#\var{ref};}. \var{ref} can either be a decimal number,
Fred Drakefc576191998-04-04 07:15:02 +0000156or a hexadecimal number when preceded by an \character{x}.
Guido van Rossuma10768a1997-11-18 15:11:22 +0000157In the base implementation, \var{ref} must be a number in the
158range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake3b5da761998-03-12 15:33:05 +0000159method \method{handle_data()} with the character as argument. If
Guido van Rossuma10768a1997-11-18 15:11:22 +0000160\var{ref} is invalid or out of range, the method
161\code{unknown_charref(\var{ref})} is called to handle the error. A
162subclass must override this method to provide support for character
163references outside of the \ASCII{} range.
Fred Drakefc576191998-04-04 07:15:02 +0000164\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000165
Fred Drakefc576191998-04-04 07:15:02 +0000166\begin{methoddesc}{handle_comment}{comment}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000167This method is called when a comment is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000168\var{comment} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000169\samp{<!--} and \samp{-->} delimiters, but not the delimiters
170themselves. For example, the comment \samp{<!--text-->} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000171cause this method to be called with the argument \code{'text'}. The
172default method does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000173\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000174
Fred Drakefc576191998-04-04 07:15:02 +0000175\begin{methoddesc}{handle_cdata}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000176This method is called when a CDATA element is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000177\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000178\samp{<![CDATA[} and \samp{]]>} delimiters, but not the delimiters
179themselves. For example, the entity \samp{<![CDATA[text]]>} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000180cause this method to be called with the argument \code{'text'}. The
Fred Drake3b5da761998-03-12 15:33:05 +0000181default method does nothing, and is intended to be overridden.
Fred Drakefc576191998-04-04 07:15:02 +0000182\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000183
Fred Drakefc576191998-04-04 07:15:02 +0000184\begin{methoddesc}{handle_proc}{name, data}
Fred Drake3b5da761998-03-12 15:33:05 +0000185This method is called when a processing instruction (PI) is
186encountered. The \var{name} is the PI target, and the \var{data}
187argument is a string containing the text between the PI target and the
188closing delimiter, but not the delimiter itself. For example, the
189instruction \samp{<?XML text?>} will cause this method to be called
190with the arguments \code{'XML'} and \code{'text'}. The default method
Fred Drakefc576191998-04-04 07:15:02 +0000191does nothing. Note that if a document starts with \samp{<?xml
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000192..?>}, \method{handle_xml()} is called to handle it.
Fred Drakefc576191998-04-04 07:15:02 +0000193\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000194
Fred Drakefc576191998-04-04 07:15:02 +0000195\begin{methoddesc}{handle_special}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000196This method is called when a declaration is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000197\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000198\samp{<!} and \samp{>} delimiters, but not the delimiters
Fred Drake46479d32000-08-11 20:34:27 +0000199themselves. For example, the \index{ENTITY declaration}entity
200declaration \samp{<!ENTITY text>} will cause this method to be called
201with the argument \code{'ENTITY text'}. The default method does
202nothing. Note that \samp{<!DOCTYPE ...>} is handled separately if it
203is located at the start of the document.
Fred Drakefc576191998-04-04 07:15:02 +0000204\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000205
Fred Drakefc576191998-04-04 07:15:02 +0000206\begin{methoddesc}{syntax_error}{message}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000207This method is called when a syntax error is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000208\var{message} is a description of what was wrong. The default method
209raises a \exception{RuntimeError} exception. If this method is
Thomas Woutersf8316632000-07-16 19:01:10 +0000210overridden, it is permissible for it to return. This method is only
Fred Drake3b5da761998-03-12 15:33:05 +0000211called when the error can be recovered from. Unrecoverable errors
212raise a \exception{RuntimeError} without first calling
213\method{syntax_error()}.
Fred Drakefc576191998-04-04 07:15:02 +0000214\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000215
Fred Drakefc576191998-04-04 07:15:02 +0000216\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000217This method is called to process an unknown start tag. It is intended
218to be overridden by a derived class; the base class implementation
219does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000220\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000221
Fred Drakefc576191998-04-04 07:15:02 +0000222\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000223This method is called to process an unknown end tag. It is intended
224to be overridden by a derived class; the base class implementation
225does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000226\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000227
Fred Drakefc576191998-04-04 07:15:02 +0000228\begin{methoddesc}{unknown_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000229This method is called to process unresolvable numeric character
230references. It is intended to be overridden by a derived class; the
231base class implementation does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000232\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000233
Fred Drakefc576191998-04-04 07:15:02 +0000234\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000235This method is called to process an unknown entity reference. It is
236intended to be overridden by a derived class; the base class
Guido van Rossume7f19201999-08-26 15:57:44 +0000237implementation calls \method{syntax_error()} to signal an error.
Fred Drakefc576191998-04-04 07:15:02 +0000238\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000239
Fred Drake34250111999-02-19 23:45:06 +0000240
Fred Drakec8c40ff1999-04-22 20:16:02 +0000241\begin{seealso}
Fred Drakeae86d432000-09-12 17:53:48 +0000242 \seetitle[http://www.w3.org/TR/REC-xml]{Extensible Markup Language
243 (XML) 1.0}{The XML specification, published by the World
244 Wide Web Consortium (W3C), defines the syntax and
245 processor requirements for XML. References to additional
246 material on XML, including translations of the
247 specification, are available at
248 \url{http://www.w3.org/XML/}.}
Fred Drake38e5d272000-04-03 20:13:55 +0000249
Fred Drakeae86d432000-09-12 17:53:48 +0000250 \seetitle[http://www.python.org/topics/xml/]{Python and XML
251 Processing}{The Python XML Topic Guide provides a great
252 deal of information on using XML from Python and links to
253 other sources of information on XML.}
Fred Drakec8c40ff1999-04-22 20:16:02 +0000254
Fred Drakeae86d432000-09-12 17:53:48 +0000255 \seetitle[http://www.python.org/sigs/xml-sig/]{SIG for XML
256 Processing in Python}{The Python XML Special Interest
257 Group is developing substantial support for processing XML
258 from Python.}
Fred Drakec8c40ff1999-04-22 20:16:02 +0000259\end{seealso}
260
261
Fred Drake34250111999-02-19 23:45:06 +0000262\subsection{XML Namespaces \label{xml-namespace}}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000263
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000264This module has support for XML namespaces as defined in the XML
265Namespaces proposed recommendation.
Fred Drake34250111999-02-19 23:45:06 +0000266\indexii{XML}{namespaces}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000267
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000268Tag and attribute names that are defined in an XML namespace are
269handled as if the name of the tag or element consisted of the
Fred Drake907e76b2001-07-06 20:30:11 +0000270namespace (the URL that defines the namespace) followed by a
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000271space and the name of the tag or attribute. For instance, the tag
272\code{<html xmlns='http://www.w3.org/TR/REC-html40'>} is treated as if
273the tag name was \code{'http://www.w3.org/TR/REC-html40 html'}, and
274the tag \code{<html:a href='http://frob.com'>} inside the above
275mentioned element is treated as if the tag name were
276\code{'http://www.w3.org/TR/REC-html40 a'} and the attribute name as
Fred Drake2c4f5542000-10-10 22:00:03 +0000277if it were \code{'http://www.w3.org/TR/REC-html40 href'}.
Guido van Rossum02505e41998-01-29 14:55:24 +0000278
Guido van Rossumb083a9f1998-12-18 20:17:13 +0000279An older draft of the XML Namespaces proposal is also recognized, but
280triggers a warning.
Fred Drakeae86d432000-09-12 17:53:48 +0000281
282\begin{seealso}
283 \seetitle[http://www.w3.org/TR/REC-xml-names/]{Namespaces in XML}{
Fred Drake8ee679f2001-07-14 02:50:55 +0000284 This World Wide Web Consortium recommendation describes the
Fred Drakeae86d432000-09-12 17:53:48 +0000285 proper syntax and processing requirements for namespaces in
286 XML.}
287\end{seealso}