blob: 4306aca2bd96470cb8078bd5bb712a3d21a6568e [file] [log] [blame]
Fred Drakefc576191998-04-04 07:15:02 +00001\section{Standard Module \module{xmllib}}
Guido van Rossuma10768a1997-11-18 15:11:22 +00002% Author: Sjoerd Mullender
3\label{module-xmllib}
4\stmodindex{xmllib}
5\index{XML}
6
Fred Drake3b5da761998-03-12 15:33:05 +00007This module defines a class \class{XMLParser} which serves as the basis
Guido van Rossuma10768a1997-11-18 15:11:22 +00008for parsing text files formatted in XML (eXtended Markup Language).
9
Fred Drake3b5da761998-03-12 15:33:05 +000010\begin{classdesc}{XMLParser}{}
11The \class{XMLParser} class must be instantiated without arguments.
12\end{classdesc}
13
14This class provides the following interface methods:
Guido van Rossuma10768a1997-11-18 15:11:22 +000015
Fred Drakefc576191998-04-04 07:15:02 +000016\begin{methoddesc}{reset}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000017Reset the instance. Loses all unprocessed data. This is called
18implicitly at the instantiation time.
Fred Drakefc576191998-04-04 07:15:02 +000019\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000020
Fred Drakefc576191998-04-04 07:15:02 +000021\begin{methoddesc}{setnomoretags}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000022Stop processing tags. Treat all following input as literal input
23(CDATA).
Fred Drakefc576191998-04-04 07:15:02 +000024\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000025
Fred Drakefc576191998-04-04 07:15:02 +000026\begin{methoddesc}{setliteral}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000027Enter literal mode (CDATA mode).
Fred Drakefc576191998-04-04 07:15:02 +000028\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000029
Fred Drakefc576191998-04-04 07:15:02 +000030\begin{methoddesc}{feed}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +000031Feed some text to the parser. It is processed insofar as it consists
32of complete elements; incomplete data is buffered until more data is
Fred Drake3b5da761998-03-12 15:33:05 +000033fed or \method{close()} is called.
Fred Drakefc576191998-04-04 07:15:02 +000034\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000035
Fred Drakefc576191998-04-04 07:15:02 +000036\begin{methoddesc}{close}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000037Force processing of all buffered data as if it were followed by an
38end-of-file mark. This method may be redefined by a derived class to
39define additional processing at the end of the input, but the
Fred Drake3b5da761998-03-12 15:33:05 +000040redefined version should always call \method{close()}.
Fred Drakefc576191998-04-04 07:15:02 +000041\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000042
Fred Drakefc576191998-04-04 07:15:02 +000043\begin{methoddesc}{translate_references}{data}
Fred Drake3b5da761998-03-12 15:33:05 +000044Translate all entity and character references in \var{data} and
Guido van Rossum02505e41998-01-29 14:55:24 +000045returns the translated string.
Fred Drakefc576191998-04-04 07:15:02 +000046\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000047
Fred Drakefc576191998-04-04 07:15:02 +000048\begin{methoddesc}{handle_xml}{encoding, standalone}
49This method is called when the \samp{<?xml ...?>} tag is processed.
Guido van Rossum02505e41998-01-29 14:55:24 +000050The arguments are the values of the encoding and standalone attributes
51in the tag. Both encoding and standalone are optional. The values
Fred Drake3b5da761998-03-12 15:33:05 +000052passed to \method{handle_xml()} default to \code{None} and the string
Guido van Rossum02505e41998-01-29 14:55:24 +000053\code{'no'} respectively.
Fred Drakefc576191998-04-04 07:15:02 +000054\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000055
Fred Drakefc576191998-04-04 07:15:02 +000056\begin{methoddesc}{handle_doctype}{tag, data}
57This method is called when the \samp{<!DOCTYPE...>} tag is processed.
Guido van Rossum02505e41998-01-29 14:55:24 +000058The arguments are the name of the root element and the uninterpreted
59contents of the tag, starting after the white space after the name of
60the root element.
Fred Drakefc576191998-04-04 07:15:02 +000061\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000062
Fred Drakefc576191998-04-04 07:15:02 +000063\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +000064This method is called to handle start tags for which a
Fred Drakefc576191998-04-04 07:15:02 +000065\method{start_\var{tag}()} method has been defined. The \var{tag}
66argument is the name of the tag, and the \var{method} argument is the
Guido van Rossuma10768a1997-11-18 15:11:22 +000067bound method which should be used to support semantic interpretation
68of the start tag. The \var{attributes} argument is a dictionary of
69attributes, the key being the \var{name} and the value being the
70\var{value} of the attribute found inside the tag's \code{<>} brackets.
Guido van Rossum02505e41998-01-29 14:55:24 +000071Character and entity references in the \var{value} have
Guido van Rossuma10768a1997-11-18 15:11:22 +000072been interpreted. For instance, for the tag
73\code{<A HREF="http://www.cwi.nl/">}, this method would be called as
Fred Drakeb0744c51997-12-29 19:59:38 +000074\code{handle_starttag('A', self.start_A, \{'HREF': 'http://www.cwi.nl/'\})}.
Fred Drake3b5da761998-03-12 15:33:05 +000075The base implementation simply calls \var{method} with \var{attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +000076as the only argument.
Fred Drakefc576191998-04-04 07:15:02 +000077\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000078
Fred Drakefc576191998-04-04 07:15:02 +000079\begin{methoddesc}{handle_endtag}{tag, method}
Guido van Rossuma10768a1997-11-18 15:11:22 +000080This method is called to handle endtags for which an
Fred Drakefc576191998-04-04 07:15:02 +000081\method{end_\var{tag}()} method has been defined. The \var{tag}
Guido van Rossuma10768a1997-11-18 15:11:22 +000082argument is the name of the tag, and the
Fred Drake3b5da761998-03-12 15:33:05 +000083\var{method} argument is the bound method which should be used to
Guido van Rossuma10768a1997-11-18 15:11:22 +000084support semantic interpretation of the end tag. If no
Fred Drakefc576191998-04-04 07:15:02 +000085\method{end_\var{tag}()} method is defined for the closing element, this
Guido van Rossuma10768a1997-11-18 15:11:22 +000086handler is not called. The base implementation simply calls
Fred Drake3b5da761998-03-12 15:33:05 +000087\var{method}.
Fred Drakefc576191998-04-04 07:15:02 +000088\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000089
Fred Drakefc576191998-04-04 07:15:02 +000090\begin{methoddesc}{handle_data}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +000091This method is called to process arbitrary data. It is intended to be
92overridden by a derived class; the base class implementation does
93nothing.
Fred Drakefc576191998-04-04 07:15:02 +000094\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000095
Fred Drakefc576191998-04-04 07:15:02 +000096\begin{methoddesc}{handle_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +000097This method is called to process a character reference of the form
Fred Drake7f6e2c41998-02-13 14:38:23 +000098\samp{\&\#\var{ref};}. \var{ref} can either be a decimal number,
Fred Drakefc576191998-04-04 07:15:02 +000099or a hexadecimal number when preceded by an \character{x}.
Guido van Rossuma10768a1997-11-18 15:11:22 +0000100In the base implementation, \var{ref} must be a number in the
101range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake3b5da761998-03-12 15:33:05 +0000102method \method{handle_data()} with the character as argument. If
Guido van Rossuma10768a1997-11-18 15:11:22 +0000103\var{ref} is invalid or out of range, the method
104\code{unknown_charref(\var{ref})} is called to handle the error. A
105subclass must override this method to provide support for character
106references outside of the \ASCII{} range.
Fred Drakefc576191998-04-04 07:15:02 +0000107\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000108
Fred Drakefc576191998-04-04 07:15:02 +0000109\begin{methoddesc}{handle_entityref}{ref}
Fred Drake3b5da761998-03-12 15:33:05 +0000110This method is called to process a general entity reference of the
111form \samp{\&\var{ref};} where \var{ref} is an general entity
Guido van Rossuma10768a1997-11-18 15:11:22 +0000112reference. It looks for \var{ref} in the instance (or class)
Fred Drake3b5da761998-03-12 15:33:05 +0000113variable \member{entitydefs} which should be a mapping from entity
114names to corresponding translations.
115If a translation is found, it calls the method \method{handle_data()}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000116with the translation; otherwise, it calls the method
Fred Drake3b5da761998-03-12 15:33:05 +0000117\code{unknown_entityref(\var{ref})}. The default \member{entitydefs}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000118defines translations for \code{\&amp;}, \code{\&apos}, \code{\&gt;},
119\code{\&lt;}, and \code{\&quot;}.
Fred Drakefc576191998-04-04 07:15:02 +0000120\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000121
Fred Drakefc576191998-04-04 07:15:02 +0000122\begin{methoddesc}{handle_comment}{comment}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000123This method is called when a comment is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000124\var{comment} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000125\samp{<!--} and \samp{-->} delimiters, but not the delimiters
126themselves. For example, the comment \samp{<!--text-->} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000127cause this method to be called with the argument \code{'text'}. The
128default method does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000129\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000130
Fred Drakefc576191998-04-04 07:15:02 +0000131\begin{methoddesc}{handle_cdata}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000132This method is called when a CDATA element is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000133\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000134\samp{<![CDATA[} and \samp{]]>} delimiters, but not the delimiters
135themselves. For example, the entity \samp{<![CDATA[text]]>} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000136cause this method to be called with the argument \code{'text'}. The
Fred Drake3b5da761998-03-12 15:33:05 +0000137default method does nothing, and is intended to be overridden.
Fred Drakefc576191998-04-04 07:15:02 +0000138\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000139
Fred Drakefc576191998-04-04 07:15:02 +0000140\begin{methoddesc}{handle_proc}{name, data}
Fred Drake3b5da761998-03-12 15:33:05 +0000141This method is called when a processing instruction (PI) is
142encountered. The \var{name} is the PI target, and the \var{data}
143argument is a string containing the text between the PI target and the
144closing delimiter, but not the delimiter itself. For example, the
145instruction \samp{<?XML text?>} will cause this method to be called
146with the arguments \code{'XML'} and \code{'text'}. The default method
Fred Drakefc576191998-04-04 07:15:02 +0000147does nothing. Note that if a document starts with \samp{<?xml
Fred Drake3b5da761998-03-12 15:33:05 +0000148...?>}, \method{handle_xml()} is called to handle it.
Fred Drakefc576191998-04-04 07:15:02 +0000149\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000150
Fred Drakefc576191998-04-04 07:15:02 +0000151\begin{methoddesc}{handle_special}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000152This method is called when a declaration is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000153\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000154\samp{<!} and \samp{>} delimiters, but not the delimiters
155themselves. For example, the entity \samp{<!ENTITY text>} will
Guido van Rossum02505e41998-01-29 14:55:24 +0000156cause this method to be called with the argument \code{'ENTITY text'}. The
Fred Drakefc576191998-04-04 07:15:02 +0000157default method does nothing. Note that \samp{<!DOCTYPE ...>} is
Guido van Rossum02505e41998-01-29 14:55:24 +0000158handled separately if it is located at the start of the document.
Fred Drakefc576191998-04-04 07:15:02 +0000159\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000160
Fred Drakefc576191998-04-04 07:15:02 +0000161\begin{methoddesc}{syntax_error}{message}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000162This method is called when a syntax error is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000163\var{message} is a description of what was wrong. The default method
164raises a \exception{RuntimeError} exception. If this method is
165overridden, it is permissable for it to return. This method is only
166called when the error can be recovered from. Unrecoverable errors
167raise a \exception{RuntimeError} without first calling
168\method{syntax_error()}.
Fred Drakefc576191998-04-04 07:15:02 +0000169\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000170
Fred Drakefc576191998-04-04 07:15:02 +0000171\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000172This method is called to process an unknown start tag. It is intended
173to be overridden by a derived class; the base class implementation
174does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000175\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000176
Fred Drakefc576191998-04-04 07:15:02 +0000177\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000178This method is called to process an unknown end tag. It is intended
179to be overridden by a derived class; the base class implementation
180does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000181\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000182
Fred Drakefc576191998-04-04 07:15:02 +0000183\begin{methoddesc}{unknown_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000184This method is called to process unresolvable numeric character
185references. It is intended to be overridden by a derived class; the
186base class implementation does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000187\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000188
Fred Drakefc576191998-04-04 07:15:02 +0000189\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000190This method is called to process an unknown entity reference. It is
191intended to be overridden by a derived class; the base class
192implementation does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000193\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000194
195Apart from overriding or extending the methods listed above, derived
Guido van Rossum02505e41998-01-29 14:55:24 +0000196classes may also define methods and variables of the following form to
197define processing of specific tags. Tag names in the input stream are
198case dependent; the \var{tag} occurring in method names must be in the
Guido van Rossuma10768a1997-11-18 15:11:22 +0000199correct case:
200
Fred Drakefc576191998-04-04 07:15:02 +0000201\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000202This method is called to process an opening tag \var{tag}. The
203\var{attributes} argument has the same meaning as described for
Fred Drake3b5da761998-03-12 15:33:05 +0000204\method{handle_starttag()} above. In fact, the base implementation of
205\method{handle_starttag()} calls this method.
Fred Drakefc576191998-04-04 07:15:02 +0000206\end{methoddescni}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000207
Fred Drakefc576191998-04-04 07:15:02 +0000208\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000209This method is called to process a closing tag \var{tag}.
Fred Drakefc576191998-04-04 07:15:02 +0000210\end{methoddescni}
Guido van Rossum02505e41998-01-29 14:55:24 +0000211
Fred Drakefc576191998-04-04 07:15:02 +0000212\begin{memberdescni}{\var{tag}_attributes}
213If a class or instance variable \member{\var{tag}_attributes} exists, it
Guido van Rossum02505e41998-01-29 14:55:24 +0000214should be a list or a dictionary. If a list, the elements of the list
215are the valid attributes for the element \var{tag}; if a dictionary,
216the keys are the valid attributes for the element \var{tag}, and the
217values the default values of the attributes, or \code{None} if there
218is no default.
219In addition to the attributes that were present in the tag, the
Fred Drake3b5da761998-03-12 15:33:05 +0000220attribute dictionary that is passed to \method{handle_starttag()} and
221\method{unknown_starttag()} contains values for all attributes that
222have a default value.
Fred Drakefc576191998-04-04 07:15:02 +0000223\end{memberdescni}