blob: d8aa3ed94ecf1dfd4ed0379f5b0059405ae49b05 [file] [log] [blame]
Fred Drake295da241998-08-10 19:42:37 +00001\section{\module{xmllib} ---
2 A parser for XML documents.}
Guido van Rossuma10768a1997-11-18 15:11:22 +00003% Author: Sjoerd Mullender
Fred Drakeb91e9341998-07-23 17:59:49 +00004\declaremodule{standard}{xmllib}
5
6\modulesynopsis{A parser for XML documents.}
7
Guido van Rossuma10768a1997-11-18 15:11:22 +00008\index{XML}
9
Fred Drake3b5da761998-03-12 15:33:05 +000010This module defines a class \class{XMLParser} which serves as the basis
Guido van Rossuma10768a1997-11-18 15:11:22 +000011for parsing text files formatted in XML (eXtended Markup Language).
12
Fred Drake3b5da761998-03-12 15:33:05 +000013\begin{classdesc}{XMLParser}{}
14The \class{XMLParser} class must be instantiated without arguments.
15\end{classdesc}
16
17This class provides the following interface methods:
Guido van Rossuma10768a1997-11-18 15:11:22 +000018
Fred Drakefc576191998-04-04 07:15:02 +000019\begin{methoddesc}{reset}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000020Reset the instance. Loses all unprocessed data. This is called
21implicitly at the instantiation time.
Fred Drakefc576191998-04-04 07:15:02 +000022\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000023
Fred Drakefc576191998-04-04 07:15:02 +000024\begin{methoddesc}{setnomoretags}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000025Stop processing tags. Treat all following input as literal input
26(CDATA).
Fred Drakefc576191998-04-04 07:15:02 +000027\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000028
Fred Drakefc576191998-04-04 07:15:02 +000029\begin{methoddesc}{setliteral}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000030Enter literal mode (CDATA mode).
Fred Drakefc576191998-04-04 07:15:02 +000031\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000032
Fred Drakefc576191998-04-04 07:15:02 +000033\begin{methoddesc}{feed}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +000034Feed some text to the parser. It is processed insofar as it consists
35of complete elements; incomplete data is buffered until more data is
Fred Drake3b5da761998-03-12 15:33:05 +000036fed or \method{close()} is called.
Fred Drakefc576191998-04-04 07:15:02 +000037\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000038
Fred Drakefc576191998-04-04 07:15:02 +000039\begin{methoddesc}{close}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000040Force processing of all buffered data as if it were followed by an
41end-of-file mark. This method may be redefined by a derived class to
42define additional processing at the end of the input, but the
Fred Drake3b5da761998-03-12 15:33:05 +000043redefined version should always call \method{close()}.
Fred Drakefc576191998-04-04 07:15:02 +000044\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000045
Fred Drakefc576191998-04-04 07:15:02 +000046\begin{methoddesc}{translate_references}{data}
Fred Drake3b5da761998-03-12 15:33:05 +000047Translate all entity and character references in \var{data} and
Guido van Rossum02505e41998-01-29 14:55:24 +000048returns the translated string.
Fred Drakefc576191998-04-04 07:15:02 +000049\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000050
Fred Drakefc576191998-04-04 07:15:02 +000051\begin{methoddesc}{handle_xml}{encoding, standalone}
52This method is called when the \samp{<?xml ...?>} tag is processed.
Guido van Rossum02505e41998-01-29 14:55:24 +000053The arguments are the values of the encoding and standalone attributes
54in the tag. Both encoding and standalone are optional. The values
Fred Drake3b5da761998-03-12 15:33:05 +000055passed to \method{handle_xml()} default to \code{None} and the string
Guido van Rossum02505e41998-01-29 14:55:24 +000056\code{'no'} respectively.
Fred Drakefc576191998-04-04 07:15:02 +000057\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000058
Fred Drakefc576191998-04-04 07:15:02 +000059\begin{methoddesc}{handle_doctype}{tag, data}
60This method is called when the \samp{<!DOCTYPE...>} tag is processed.
Guido van Rossum02505e41998-01-29 14:55:24 +000061The arguments are the name of the root element and the uninterpreted
62contents of the tag, starting after the white space after the name of
63the root element.
Fred Drakefc576191998-04-04 07:15:02 +000064\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000065
Fred Drakefc576191998-04-04 07:15:02 +000066\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +000067This method is called to handle start tags for which a
Fred Drakefc576191998-04-04 07:15:02 +000068\method{start_\var{tag}()} method has been defined. The \var{tag}
69argument is the name of the tag, and the \var{method} argument is the
Guido van Rossuma10768a1997-11-18 15:11:22 +000070bound method which should be used to support semantic interpretation
71of the start tag. The \var{attributes} argument is a dictionary of
72attributes, the key being the \var{name} and the value being the
73\var{value} of the attribute found inside the tag's \code{<>} brackets.
Guido van Rossum02505e41998-01-29 14:55:24 +000074Character and entity references in the \var{value} have
Guido van Rossuma10768a1997-11-18 15:11:22 +000075been interpreted. For instance, for the tag
76\code{<A HREF="http://www.cwi.nl/">}, this method would be called as
Fred Drakeb0744c51997-12-29 19:59:38 +000077\code{handle_starttag('A', self.start_A, \{'HREF': 'http://www.cwi.nl/'\})}.
Fred Drake3b5da761998-03-12 15:33:05 +000078The base implementation simply calls \var{method} with \var{attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +000079as the only argument.
Fred Drakefc576191998-04-04 07:15:02 +000080\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000081
Fred Drakefc576191998-04-04 07:15:02 +000082\begin{methoddesc}{handle_endtag}{tag, method}
Guido van Rossuma10768a1997-11-18 15:11:22 +000083This method is called to handle endtags for which an
Fred Drakefc576191998-04-04 07:15:02 +000084\method{end_\var{tag}()} method has been defined. The \var{tag}
Guido van Rossuma10768a1997-11-18 15:11:22 +000085argument is the name of the tag, and the
Fred Drake3b5da761998-03-12 15:33:05 +000086\var{method} argument is the bound method which should be used to
Guido van Rossuma10768a1997-11-18 15:11:22 +000087support semantic interpretation of the end tag. If no
Fred Drakefc576191998-04-04 07:15:02 +000088\method{end_\var{tag}()} method is defined for the closing element, this
Guido van Rossuma10768a1997-11-18 15:11:22 +000089handler is not called. The base implementation simply calls
Fred Drake3b5da761998-03-12 15:33:05 +000090\var{method}.
Fred Drakefc576191998-04-04 07:15:02 +000091\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000092
Fred Drakefc576191998-04-04 07:15:02 +000093\begin{methoddesc}{handle_data}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +000094This method is called to process arbitrary data. It is intended to be
95overridden by a derived class; the base class implementation does
96nothing.
Fred Drakefc576191998-04-04 07:15:02 +000097\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000098
Fred Drakefc576191998-04-04 07:15:02 +000099\begin{methoddesc}{handle_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000100This method is called to process a character reference of the form
Fred Drake7f6e2c41998-02-13 14:38:23 +0000101\samp{\&\#\var{ref};}. \var{ref} can either be a decimal number,
Fred Drakefc576191998-04-04 07:15:02 +0000102or a hexadecimal number when preceded by an \character{x}.
Guido van Rossuma10768a1997-11-18 15:11:22 +0000103In the base implementation, \var{ref} must be a number in the
104range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake3b5da761998-03-12 15:33:05 +0000105method \method{handle_data()} with the character as argument. If
Guido van Rossuma10768a1997-11-18 15:11:22 +0000106\var{ref} is invalid or out of range, the method
107\code{unknown_charref(\var{ref})} is called to handle the error. A
108subclass must override this method to provide support for character
109references outside of the \ASCII{} range.
Fred Drakefc576191998-04-04 07:15:02 +0000110\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000111
Fred Drakefc576191998-04-04 07:15:02 +0000112\begin{methoddesc}{handle_entityref}{ref}
Fred Drake3b5da761998-03-12 15:33:05 +0000113This method is called to process a general entity reference of the
114form \samp{\&\var{ref};} where \var{ref} is an general entity
Guido van Rossuma10768a1997-11-18 15:11:22 +0000115reference. It looks for \var{ref} in the instance (or class)
Fred Drake3b5da761998-03-12 15:33:05 +0000116variable \member{entitydefs} which should be a mapping from entity
117names to corresponding translations.
118If a translation is found, it calls the method \method{handle_data()}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000119with the translation; otherwise, it calls the method
Fred Drake3b5da761998-03-12 15:33:05 +0000120\code{unknown_entityref(\var{ref})}. The default \member{entitydefs}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000121defines translations for \code{\&amp;}, \code{\&apos}, \code{\&gt;},
122\code{\&lt;}, and \code{\&quot;}.
Fred Drakefc576191998-04-04 07:15:02 +0000123\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000124
Fred Drakefc576191998-04-04 07:15:02 +0000125\begin{methoddesc}{handle_comment}{comment}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000126This method is called when a comment is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000127\var{comment} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000128\samp{<!--} and \samp{-->} delimiters, but not the delimiters
129themselves. For example, the comment \samp{<!--text-->} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000130cause this method to be called with the argument \code{'text'}. The
131default method does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000132\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000133
Fred Drakefc576191998-04-04 07:15:02 +0000134\begin{methoddesc}{handle_cdata}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000135This method is called when a CDATA element is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000136\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000137\samp{<![CDATA[} and \samp{]]>} delimiters, but not the delimiters
138themselves. For example, the entity \samp{<![CDATA[text]]>} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000139cause this method to be called with the argument \code{'text'}. The
Fred Drake3b5da761998-03-12 15:33:05 +0000140default method does nothing, and is intended to be overridden.
Fred Drakefc576191998-04-04 07:15:02 +0000141\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000142
Fred Drakefc576191998-04-04 07:15:02 +0000143\begin{methoddesc}{handle_proc}{name, data}
Fred Drake3b5da761998-03-12 15:33:05 +0000144This method is called when a processing instruction (PI) is
145encountered. The \var{name} is the PI target, and the \var{data}
146argument is a string containing the text between the PI target and the
147closing delimiter, but not the delimiter itself. For example, the
148instruction \samp{<?XML text?>} will cause this method to be called
149with the arguments \code{'XML'} and \code{'text'}. The default method
Fred Drakefc576191998-04-04 07:15:02 +0000150does nothing. Note that if a document starts with \samp{<?xml
Fred Drake3b5da761998-03-12 15:33:05 +0000151...?>}, \method{handle_xml()} is called to handle it.
Fred Drakefc576191998-04-04 07:15:02 +0000152\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000153
Fred Drakefc576191998-04-04 07:15:02 +0000154\begin{methoddesc}{handle_special}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000155This method is called when a declaration is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000156\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000157\samp{<!} and \samp{>} delimiters, but not the delimiters
158themselves. For example, the entity \samp{<!ENTITY text>} will
Guido van Rossum02505e41998-01-29 14:55:24 +0000159cause this method to be called with the argument \code{'ENTITY text'}. The
Fred Drakefc576191998-04-04 07:15:02 +0000160default method does nothing. Note that \samp{<!DOCTYPE ...>} is
Guido van Rossum02505e41998-01-29 14:55:24 +0000161handled separately if it is located at the start of the document.
Fred Drakefc576191998-04-04 07:15:02 +0000162\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000163
Fred Drakefc576191998-04-04 07:15:02 +0000164\begin{methoddesc}{syntax_error}{message}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000165This method is called when a syntax error is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000166\var{message} is a description of what was wrong. The default method
167raises a \exception{RuntimeError} exception. If this method is
168overridden, it is permissable for it to return. This method is only
169called when the error can be recovered from. Unrecoverable errors
170raise a \exception{RuntimeError} without first calling
171\method{syntax_error()}.
Fred Drakefc576191998-04-04 07:15:02 +0000172\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000173
Fred Drakefc576191998-04-04 07:15:02 +0000174\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000175This method is called to process an unknown start tag. It is intended
176to be overridden by a derived class; the base class implementation
177does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000178\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000179
Fred Drakefc576191998-04-04 07:15:02 +0000180\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000181This method is called to process an unknown end tag. It is intended
182to be overridden by a derived class; the base class implementation
183does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000184\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000185
Fred Drakefc576191998-04-04 07:15:02 +0000186\begin{methoddesc}{unknown_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000187This method is called to process unresolvable numeric character
188references. It is intended to be overridden by a derived class; the
189base class implementation does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000190\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000191
Fred Drakefc576191998-04-04 07:15:02 +0000192\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000193This method is called to process an unknown entity reference. It is
194intended to be overridden by a derived class; the base class
195implementation does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000196\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000197
198Apart from overriding or extending the methods listed above, derived
Guido van Rossum02505e41998-01-29 14:55:24 +0000199classes may also define methods and variables of the following form to
200define processing of specific tags. Tag names in the input stream are
201case dependent; the \var{tag} occurring in method names must be in the
Guido van Rossuma10768a1997-11-18 15:11:22 +0000202correct case:
203
Fred Drakefc576191998-04-04 07:15:02 +0000204\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000205This method is called to process an opening tag \var{tag}. The
206\var{attributes} argument has the same meaning as described for
Fred Drake3b5da761998-03-12 15:33:05 +0000207\method{handle_starttag()} above. In fact, the base implementation of
208\method{handle_starttag()} calls this method.
Fred Drakefc576191998-04-04 07:15:02 +0000209\end{methoddescni}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000210
Fred Drakefc576191998-04-04 07:15:02 +0000211\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000212This method is called to process a closing tag \var{tag}.
Fred Drakefc576191998-04-04 07:15:02 +0000213\end{methoddescni}
Guido van Rossum02505e41998-01-29 14:55:24 +0000214
Fred Drakefc576191998-04-04 07:15:02 +0000215\begin{memberdescni}{\var{tag}_attributes}
216If a class or instance variable \member{\var{tag}_attributes} exists, it
Guido van Rossum02505e41998-01-29 14:55:24 +0000217should be a list or a dictionary. If a list, the elements of the list
218are the valid attributes for the element \var{tag}; if a dictionary,
219the keys are the valid attributes for the element \var{tag}, and the
220values the default values of the attributes, or \code{None} if there
221is no default.
222In addition to the attributes that were present in the tag, the
Fred Drake3b5da761998-03-12 15:33:05 +0000223attribute dictionary that is passed to \method{handle_starttag()} and
224\method{unknown_starttag()} contains values for all attributes that
225have a default value.
Fred Drakefc576191998-04-04 07:15:02 +0000226\end{memberdescni}