blob: 129ab1721b99f899e555f943779ebb315744413d [file] [log] [blame]
Fred Drakefc576191998-04-04 07:15:02 +00001\section{Standard Module \module{xmllib}}
Guido van Rossuma10768a1997-11-18 15:11:22 +00002% Author: Sjoerd Mullender
Fred Drakeb91e9341998-07-23 17:59:49 +00003\declaremodule{standard}{xmllib}
4
5\modulesynopsis{A parser for XML documents.}
6
Guido van Rossuma10768a1997-11-18 15:11:22 +00007\index{XML}
8
Fred Drake3b5da761998-03-12 15:33:05 +00009This module defines a class \class{XMLParser} which serves as the basis
Guido van Rossuma10768a1997-11-18 15:11:22 +000010for parsing text files formatted in XML (eXtended Markup Language).
11
Fred Drake3b5da761998-03-12 15:33:05 +000012\begin{classdesc}{XMLParser}{}
13The \class{XMLParser} class must be instantiated without arguments.
14\end{classdesc}
15
16This class provides the following interface methods:
Guido van Rossuma10768a1997-11-18 15:11:22 +000017
Fred Drakefc576191998-04-04 07:15:02 +000018\begin{methoddesc}{reset}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000019Reset the instance. Loses all unprocessed data. This is called
20implicitly at the instantiation time.
Fred Drakefc576191998-04-04 07:15:02 +000021\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000022
Fred Drakefc576191998-04-04 07:15:02 +000023\begin{methoddesc}{setnomoretags}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000024Stop processing tags. Treat all following input as literal input
25(CDATA).
Fred Drakefc576191998-04-04 07:15:02 +000026\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000027
Fred Drakefc576191998-04-04 07:15:02 +000028\begin{methoddesc}{setliteral}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000029Enter literal mode (CDATA mode).
Fred Drakefc576191998-04-04 07:15:02 +000030\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000031
Fred Drakefc576191998-04-04 07:15:02 +000032\begin{methoddesc}{feed}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +000033Feed some text to the parser. It is processed insofar as it consists
34of complete elements; incomplete data is buffered until more data is
Fred Drake3b5da761998-03-12 15:33:05 +000035fed or \method{close()} is called.
Fred Drakefc576191998-04-04 07:15:02 +000036\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000037
Fred Drakefc576191998-04-04 07:15:02 +000038\begin{methoddesc}{close}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +000039Force processing of all buffered data as if it were followed by an
40end-of-file mark. This method may be redefined by a derived class to
41define additional processing at the end of the input, but the
Fred Drake3b5da761998-03-12 15:33:05 +000042redefined version should always call \method{close()}.
Fred Drakefc576191998-04-04 07:15:02 +000043\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000044
Fred Drakefc576191998-04-04 07:15:02 +000045\begin{methoddesc}{translate_references}{data}
Fred Drake3b5da761998-03-12 15:33:05 +000046Translate all entity and character references in \var{data} and
Guido van Rossum02505e41998-01-29 14:55:24 +000047returns the translated string.
Fred Drakefc576191998-04-04 07:15:02 +000048\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000049
Fred Drakefc576191998-04-04 07:15:02 +000050\begin{methoddesc}{handle_xml}{encoding, standalone}
51This method is called when the \samp{<?xml ...?>} tag is processed.
Guido van Rossum02505e41998-01-29 14:55:24 +000052The arguments are the values of the encoding and standalone attributes
53in the tag. Both encoding and standalone are optional. The values
Fred Drake3b5da761998-03-12 15:33:05 +000054passed to \method{handle_xml()} default to \code{None} and the string
Guido van Rossum02505e41998-01-29 14:55:24 +000055\code{'no'} respectively.
Fred Drakefc576191998-04-04 07:15:02 +000056\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000057
Fred Drakefc576191998-04-04 07:15:02 +000058\begin{methoddesc}{handle_doctype}{tag, data}
59This method is called when the \samp{<!DOCTYPE...>} tag is processed.
Guido van Rossum02505e41998-01-29 14:55:24 +000060The arguments are the name of the root element and the uninterpreted
61contents of the tag, starting after the white space after the name of
62the root element.
Fred Drakefc576191998-04-04 07:15:02 +000063\end{methoddesc}
Guido van Rossum02505e41998-01-29 14:55:24 +000064
Fred Drakefc576191998-04-04 07:15:02 +000065\begin{methoddesc}{handle_starttag}{tag, method, attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +000066This method is called to handle start tags for which a
Fred Drakefc576191998-04-04 07:15:02 +000067\method{start_\var{tag}()} method has been defined. The \var{tag}
68argument is the name of the tag, and the \var{method} argument is the
Guido van Rossuma10768a1997-11-18 15:11:22 +000069bound method which should be used to support semantic interpretation
70of the start tag. The \var{attributes} argument is a dictionary of
71attributes, the key being the \var{name} and the value being the
72\var{value} of the attribute found inside the tag's \code{<>} brackets.
Guido van Rossum02505e41998-01-29 14:55:24 +000073Character and entity references in the \var{value} have
Guido van Rossuma10768a1997-11-18 15:11:22 +000074been interpreted. For instance, for the tag
75\code{<A HREF="http://www.cwi.nl/">}, this method would be called as
Fred Drakeb0744c51997-12-29 19:59:38 +000076\code{handle_starttag('A', self.start_A, \{'HREF': 'http://www.cwi.nl/'\})}.
Fred Drake3b5da761998-03-12 15:33:05 +000077The base implementation simply calls \var{method} with \var{attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +000078as the only argument.
Fred Drakefc576191998-04-04 07:15:02 +000079\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000080
Fred Drakefc576191998-04-04 07:15:02 +000081\begin{methoddesc}{handle_endtag}{tag, method}
Guido van Rossuma10768a1997-11-18 15:11:22 +000082This method is called to handle endtags for which an
Fred Drakefc576191998-04-04 07:15:02 +000083\method{end_\var{tag}()} method has been defined. The \var{tag}
Guido van Rossuma10768a1997-11-18 15:11:22 +000084argument is the name of the tag, and the
Fred Drake3b5da761998-03-12 15:33:05 +000085\var{method} argument is the bound method which should be used to
Guido van Rossuma10768a1997-11-18 15:11:22 +000086support semantic interpretation of the end tag. If no
Fred Drakefc576191998-04-04 07:15:02 +000087\method{end_\var{tag}()} method is defined for the closing element, this
Guido van Rossuma10768a1997-11-18 15:11:22 +000088handler is not called. The base implementation simply calls
Fred Drake3b5da761998-03-12 15:33:05 +000089\var{method}.
Fred Drakefc576191998-04-04 07:15:02 +000090\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000091
Fred Drakefc576191998-04-04 07:15:02 +000092\begin{methoddesc}{handle_data}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +000093This method is called to process arbitrary data. It is intended to be
94overridden by a derived class; the base class implementation does
95nothing.
Fred Drakefc576191998-04-04 07:15:02 +000096\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +000097
Fred Drakefc576191998-04-04 07:15:02 +000098\begin{methoddesc}{handle_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +000099This method is called to process a character reference of the form
Fred Drake7f6e2c41998-02-13 14:38:23 +0000100\samp{\&\#\var{ref};}. \var{ref} can either be a decimal number,
Fred Drakefc576191998-04-04 07:15:02 +0000101or a hexadecimal number when preceded by an \character{x}.
Guido van Rossuma10768a1997-11-18 15:11:22 +0000102In the base implementation, \var{ref} must be a number in the
103range 0-255. It translates the character to \ASCII{} and calls the
Fred Drake3b5da761998-03-12 15:33:05 +0000104method \method{handle_data()} with the character as argument. If
Guido van Rossuma10768a1997-11-18 15:11:22 +0000105\var{ref} is invalid or out of range, the method
106\code{unknown_charref(\var{ref})} is called to handle the error. A
107subclass must override this method to provide support for character
108references outside of the \ASCII{} range.
Fred Drakefc576191998-04-04 07:15:02 +0000109\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000110
Fred Drakefc576191998-04-04 07:15:02 +0000111\begin{methoddesc}{handle_entityref}{ref}
Fred Drake3b5da761998-03-12 15:33:05 +0000112This method is called to process a general entity reference of the
113form \samp{\&\var{ref};} where \var{ref} is an general entity
Guido van Rossuma10768a1997-11-18 15:11:22 +0000114reference. It looks for \var{ref} in the instance (or class)
Fred Drake3b5da761998-03-12 15:33:05 +0000115variable \member{entitydefs} which should be a mapping from entity
116names to corresponding translations.
117If a translation is found, it calls the method \method{handle_data()}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000118with the translation; otherwise, it calls the method
Fred Drake3b5da761998-03-12 15:33:05 +0000119\code{unknown_entityref(\var{ref})}. The default \member{entitydefs}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000120defines translations for \code{\&amp;}, \code{\&apos}, \code{\&gt;},
121\code{\&lt;}, and \code{\&quot;}.
Fred Drakefc576191998-04-04 07:15:02 +0000122\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000123
Fred Drakefc576191998-04-04 07:15:02 +0000124\begin{methoddesc}{handle_comment}{comment}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000125This method is called when a comment is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000126\var{comment} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000127\samp{<!--} and \samp{-->} delimiters, but not the delimiters
128themselves. For example, the comment \samp{<!--text-->} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000129cause this method to be called with the argument \code{'text'}. The
130default method does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000131\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000132
Fred Drakefc576191998-04-04 07:15:02 +0000133\begin{methoddesc}{handle_cdata}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000134This method is called when a CDATA element is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000135\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000136\samp{<![CDATA[} and \samp{]]>} delimiters, but not the delimiters
137themselves. For example, the entity \samp{<![CDATA[text]]>} will
Guido van Rossuma10768a1997-11-18 15:11:22 +0000138cause this method to be called with the argument \code{'text'}. The
Fred Drake3b5da761998-03-12 15:33:05 +0000139default method does nothing, and is intended to be overridden.
Fred Drakefc576191998-04-04 07:15:02 +0000140\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000141
Fred Drakefc576191998-04-04 07:15:02 +0000142\begin{methoddesc}{handle_proc}{name, data}
Fred Drake3b5da761998-03-12 15:33:05 +0000143This method is called when a processing instruction (PI) is
144encountered. The \var{name} is the PI target, and the \var{data}
145argument is a string containing the text between the PI target and the
146closing delimiter, but not the delimiter itself. For example, the
147instruction \samp{<?XML text?>} will cause this method to be called
148with the arguments \code{'XML'} and \code{'text'}. The default method
Fred Drakefc576191998-04-04 07:15:02 +0000149does nothing. Note that if a document starts with \samp{<?xml
Fred Drake3b5da761998-03-12 15:33:05 +0000150...?>}, \method{handle_xml()} is called to handle it.
Fred Drakefc576191998-04-04 07:15:02 +0000151\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000152
Fred Drakefc576191998-04-04 07:15:02 +0000153\begin{methoddesc}{handle_special}{data}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000154This method is called when a declaration is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000155\var{data} argument is a string containing the text between the
Fred Drake7f6e2c41998-02-13 14:38:23 +0000156\samp{<!} and \samp{>} delimiters, but not the delimiters
157themselves. For example, the entity \samp{<!ENTITY text>} will
Guido van Rossum02505e41998-01-29 14:55:24 +0000158cause this method to be called with the argument \code{'ENTITY text'}. The
Fred Drakefc576191998-04-04 07:15:02 +0000159default method does nothing. Note that \samp{<!DOCTYPE ...>} is
Guido van Rossum02505e41998-01-29 14:55:24 +0000160handled separately if it is located at the start of the document.
Fred Drakefc576191998-04-04 07:15:02 +0000161\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000162
Fred Drakefc576191998-04-04 07:15:02 +0000163\begin{methoddesc}{syntax_error}{message}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000164This method is called when a syntax error is encountered. The
Fred Drake3b5da761998-03-12 15:33:05 +0000165\var{message} is a description of what was wrong. The default method
166raises a \exception{RuntimeError} exception. If this method is
167overridden, it is permissable for it to return. This method is only
168called when the error can be recovered from. Unrecoverable errors
169raise a \exception{RuntimeError} without first calling
170\method{syntax_error()}.
Fred Drakefc576191998-04-04 07:15:02 +0000171\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000172
Fred Drakefc576191998-04-04 07:15:02 +0000173\begin{methoddesc}{unknown_starttag}{tag, attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000174This method is called to process an unknown start tag. It is intended
175to be overridden by a derived class; the base class implementation
176does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000177\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000178
Fred Drakefc576191998-04-04 07:15:02 +0000179\begin{methoddesc}{unknown_endtag}{tag}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000180This method is called to process an unknown end tag. It is intended
181to be overridden by a derived class; the base class implementation
182does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000183\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000184
Fred Drakefc576191998-04-04 07:15:02 +0000185\begin{methoddesc}{unknown_charref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000186This method is called to process unresolvable numeric character
187references. It is intended to be overridden by a derived class; the
188base class implementation does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000189\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000190
Fred Drakefc576191998-04-04 07:15:02 +0000191\begin{methoddesc}{unknown_entityref}{ref}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000192This method is called to process an unknown entity reference. It is
193intended to be overridden by a derived class; the base class
194implementation does nothing.
Fred Drakefc576191998-04-04 07:15:02 +0000195\end{methoddesc}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000196
197Apart from overriding or extending the methods listed above, derived
Guido van Rossum02505e41998-01-29 14:55:24 +0000198classes may also define methods and variables of the following form to
199define processing of specific tags. Tag names in the input stream are
200case dependent; the \var{tag} occurring in method names must be in the
Guido van Rossuma10768a1997-11-18 15:11:22 +0000201correct case:
202
Fred Drakefc576191998-04-04 07:15:02 +0000203\begin{methoddescni}{start_\var{tag}}{attributes}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000204This method is called to process an opening tag \var{tag}. The
205\var{attributes} argument has the same meaning as described for
Fred Drake3b5da761998-03-12 15:33:05 +0000206\method{handle_starttag()} above. In fact, the base implementation of
207\method{handle_starttag()} calls this method.
Fred Drakefc576191998-04-04 07:15:02 +0000208\end{methoddescni}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000209
Fred Drakefc576191998-04-04 07:15:02 +0000210\begin{methoddescni}{end_\var{tag}}{}
Guido van Rossuma10768a1997-11-18 15:11:22 +0000211This method is called to process a closing tag \var{tag}.
Fred Drakefc576191998-04-04 07:15:02 +0000212\end{methoddescni}
Guido van Rossum02505e41998-01-29 14:55:24 +0000213
Fred Drakefc576191998-04-04 07:15:02 +0000214\begin{memberdescni}{\var{tag}_attributes}
215If a class or instance variable \member{\var{tag}_attributes} exists, it
Guido van Rossum02505e41998-01-29 14:55:24 +0000216should be a list or a dictionary. If a list, the elements of the list
217are the valid attributes for the element \var{tag}; if a dictionary,
218the keys are the valid attributes for the element \var{tag}, and the
219values the default values of the attributes, or \code{None} if there
220is no default.
221In addition to the attributes that were present in the tag, the
Fred Drake3b5da761998-03-12 15:33:05 +0000222attribute dictionary that is passed to \method{handle_starttag()} and
223\method{unknown_starttag()} contains values for all attributes that
224have a default value.
Fred Drakefc576191998-04-04 07:15:02 +0000225\end{memberdescni}