blob: eb8ccdb63eeb379712c370bc866e7526d2439db9 [file] [log] [blame]
Fred Drake7fbc85c2000-09-23 04:47:56 +00001\section{\module{xml.parsers.expat} ---
Fred Drakeefffe8e2000-10-29 05:10:30 +00002 Fast XML parsing using Expat}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +00003
Fred Drake7fbc85c2000-09-23 04:47:56 +00004\declaremodule{standard}{xml.parsers.expat}
5\modulesynopsis{An interface to the Expat non-validating XML parser.}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +00006\moduleauthor{Paul Prescod}{paul@prescod.net}
7\sectionauthor{A.M. Kuchling}{amk1@bigfoot.com}
8
Fred Drake7fbc85c2000-09-23 04:47:56 +00009\versionadded{2.0}
10
Fred Drakeefffe8e2000-10-29 05:10:30 +000011The \module{xml.parsers.expat} module is a Python interface to the
12Expat\index{Expat} non-validating XML parser.
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +000013The module provides a single extension type, \class{xmlparser}, that
14represents the current state of an XML parser. After an
15\class{xmlparser} object has been created, various attributes of the object
16can be set to handler functions. When an XML document is then fed to
17the parser, the handler functions are called for the character data
18and markup in the XML document.
Fred Drake7fbc85c2000-09-23 04:47:56 +000019
20This module uses the \module{pyexpat}\refbimodindex{pyexpat} module to
21provide access to the Expat parser. Direct use of the
22\module{pyexpat} module is deprecated.
Fred Drakeefffe8e2000-10-29 05:10:30 +000023
24This module provides one exception and one type object:
25
26\begin{excdesc}{error}
27 The exception raised when Expat reports an error.
28\end{excdesc}
29
30\begin{datadesc}{XMLParserType}
31 The type of the return values from the \function{ParserCreate()}
32 function.
33\end{datadesc}
34
35
Fred Drake7fbc85c2000-09-23 04:47:56 +000036The \module{xml.parsers.expat} module contains two functions:
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +000037
38\begin{funcdesc}{ErrorString}{errno}
39Returns an explanatory string for a given error number \var{errno}.
40\end{funcdesc}
41
Fred Drakeefffe8e2000-10-29 05:10:30 +000042\begin{funcdesc}{ParserCreate}{\optional{encoding\optional{,
43 namespace_separator}}}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +000044Creates and returns a new \class{xmlparser} object.
45\var{encoding}, if specified, must be a string naming the encoding
46used by the XML data. Expat doesn't support as many encodings as
47Python does, and its repertoire of encodings can't be extended; it
48supports UTF-8, UTF-16, ISO-8859-1 (Latin1), and ASCII.
49
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +000050Expat can optionally do XML namespace processing for you, enabled by
Fred Drakeefffe8e2000-10-29 05:10:30 +000051providing a value for \var{namespace_separator}. The value must be a
52one-character string; a \exception{ValueError} will be raised if the
53string has an illegal length (\code{None} is considered the same as
54omission). When namespace processing is enabled, element type names
55and attribute names that belong to a namespace will be expanded. The
56element name passed to the element handlers
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +000057\function{StartElementHandler()} and \function{EndElementHandler()}
58will be the concatenation of the namespace URI, the namespace
59separator character, and the local part of the name. If the namespace
Fred Drakeefffe8e2000-10-29 05:10:30 +000060separator is a zero byte (\code{chr(0)}) then the namespace URI and
61the local part will be concatenated without any separator.
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +000062
Fred Drake2fef3ab2000-11-28 06:38:22 +000063For example, if \var{namespace_separator} is set to a space character
64(\character{ }) and the following document is parsed:
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +000065
66\begin{verbatim}
67<?xml version="1.0"?>
68<root xmlns = "http://default-namespace.org/"
69 xmlns:py = "http://www.python.org/ns/">
70 <py:elem1 />
71 <elem2 xmlns="" />
72</root>
73\end{verbatim}
74
Fred Draked79c33a2000-09-25 14:14:30 +000075\function{StartElementHandler()} will receive the following strings
76for each element:
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +000077
78\begin{verbatim}
79http://default-namespace.org/ root
80http://www.python.org/ns/ elem1
81elem2
82\end{verbatim}
83
84\end{funcdesc}
85
86\class{xmlparser} objects have the following methods:
87
Fred Drake2fef3ab2000-11-28 06:38:22 +000088\begin{methoddesc}[xmlparser]{Parse}{data\optional{, isfinal}}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +000089Parses the contents of the string \var{data}, calling the appropriate
90handler functions to process the parsed data. \var{isfinal} must be
Fred Drakec05cbb02000-07-05 02:03:34 +000091true on the final call to this method. \var{data} can be the empty
92string at any time.
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +000093\end{methoddesc}
94
Fred Drakeefffe8e2000-10-29 05:10:30 +000095\begin{methoddesc}[xmlparser]{ParseFile}{file}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +000096Parse XML data reading from the object \var{file}. \var{file} only
97needs to provide the \method{read(\var{nbytes})} method, returning the
98empty string when there's no more data.
99\end{methoddesc}
100
Fred Drakeefffe8e2000-10-29 05:10:30 +0000101\begin{methoddesc}[xmlparser]{SetBase}{base}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000102Sets the base to be used for resolving relative URIs in system identifiers in
103declarations. Resolving relative identifiers is left to the application:
104this value will be passed through as the base argument to the
105\function{ExternalEntityRefHandler}, \function{NotationDeclHandler},
106and \function{UnparsedEntityDeclHandler} functions.
107\end{methoddesc}
108
Fred Drakeefffe8e2000-10-29 05:10:30 +0000109\begin{methoddesc}[xmlparser]{GetBase}{}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000110Returns a string containing the base set by a previous call to
111\method{SetBase()}, or \code{None} if
112\method{SetBase()} hasn't been called.
113\end{methoddesc}
114
Fred Drakeefffe8e2000-10-29 05:10:30 +0000115
Fred Draked79c33a2000-09-25 14:14:30 +0000116\class{xmlparser} objects have the following attributes:
Andrew M. Kuchling0690c862000-08-17 23:15:21 +0000117
Fred Drakeefffe8e2000-10-29 05:10:30 +0000118\begin{memberdesc}[xmlparser]{returns_unicode}
Andrew M. Kuchling0690c862000-08-17 23:15:21 +0000119If this attribute is set to 1, the handler functions will be passed
120Unicode strings. If \member{returns_unicode} is 0, 8-bit strings
121containing UTF-8 encoded data will be passed to the handlers.
Fred Drakeb62966c2000-12-07 00:00:21 +0000122\versionchanged[Can be changed at any time to affect the result
123 type.]{1.6}
Fred Drakeefffe8e2000-10-29 05:10:30 +0000124\end{memberdesc}
Andrew M. Kuchling0690c862000-08-17 23:15:21 +0000125
126The following attributes contain values relating to the most recent
127error encountered by an \class{xmlparser} object, and will only have
128correct values once a call to \method{Parse()} or \method{ParseFile()}
Fred Drake7fbc85c2000-09-23 04:47:56 +0000129has raised a \exception{xml.parsers.expat.error} exception.
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000130
Fred Drakeefffe8e2000-10-29 05:10:30 +0000131\begin{memberdesc}[xmlparser]{ErrorByteIndex}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000132Byte index at which an error occurred.
Fred Drakeefffe8e2000-10-29 05:10:30 +0000133\end{memberdesc}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000134
Fred Drakeefffe8e2000-10-29 05:10:30 +0000135\begin{memberdesc}[xmlparser]{ErrorCode}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000136Numeric code specifying the problem. This value can be passed to the
137\function{ErrorString()} function, or compared to one of the constants
Fred Drake7fbc85c2000-09-23 04:47:56 +0000138defined in the \module{errors} object.
Fred Drakeefffe8e2000-10-29 05:10:30 +0000139\end{memberdesc}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000140
Fred Drakeefffe8e2000-10-29 05:10:30 +0000141\begin{memberdesc}[xmlparser]{ErrorColumnNumber}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000142Column number at which an error occurred.
Fred Drakeefffe8e2000-10-29 05:10:30 +0000143\end{memberdesc}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000144
Fred Drakeefffe8e2000-10-29 05:10:30 +0000145\begin{memberdesc}[xmlparser]{ErrorLineNumber}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000146Line number at which an error occurred.
Fred Drakeefffe8e2000-10-29 05:10:30 +0000147\end{memberdesc}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000148
149Here is the list of handlers that can be set. To set a handler on an
Fred Drakec05cbb02000-07-05 02:03:34 +0000150\class{xmlparser} object \var{o}, use
151\code{\var{o}.\var{handlername} = \var{func}}. \var{handlername} must
152be taken from the following list, and \var{func} must be a callable
153object accepting the correct number of arguments. The arguments are
154all strings, unless otherwise stated.
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000155
Fred Drakeefffe8e2000-10-29 05:10:30 +0000156\begin{methoddesc}[xmlparser]{StartElementHandler}{name, attributes}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000157Called for the start of every element. \var{name} is a string
158containing the element name, and \var{attributes} is a dictionary
159mapping attribute names to their values.
160\end{methoddesc}
161
Fred Drakeefffe8e2000-10-29 05:10:30 +0000162\begin{methoddesc}[xmlparser]{EndElementHandler}{name}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000163Called for the end of every element.
164\end{methoddesc}
165
Fred Drakeefffe8e2000-10-29 05:10:30 +0000166\begin{methoddesc}[xmlparser]{ProcessingInstructionHandler}{target, data}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000167Called for every processing instruction.
168\end{methoddesc}
169
Fred Drakeefffe8e2000-10-29 05:10:30 +0000170\begin{methoddesc}[xmlparser]{CharacterDataHandler}{data}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000171Called for character data.
172\end{methoddesc}
173
Fred Drakeefffe8e2000-10-29 05:10:30 +0000174\begin{methoddesc}[xmlparser]{UnparsedEntityDeclHandler}{entityName, base,
175 systemId, publicId,
176 notationName}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000177Called for unparsed (NDATA) entity declarations.
178\end{methoddesc}
179
Fred Drakeefffe8e2000-10-29 05:10:30 +0000180\begin{methoddesc}[xmlparser]{NotationDeclHandler}{notationName, base,
181 systemId, publicId}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000182Called for notation declarations.
183\end{methoddesc}
184
Fred Drakeefffe8e2000-10-29 05:10:30 +0000185\begin{methoddesc}[xmlparser]{StartNamespaceDeclHandler}{prefix, uri}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000186Called when an element contains a namespace declaration.
187\end{methoddesc}
188
Fred Drakeefffe8e2000-10-29 05:10:30 +0000189\begin{methoddesc}[xmlparser]{EndNamespaceDeclHandler}{prefix}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000190Called when the closing tag is reached for an element
191that contained a namespace declaration.
192\end{methoddesc}
193
Fred Drakeefffe8e2000-10-29 05:10:30 +0000194\begin{methoddesc}[xmlparser]{CommentHandler}{data}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000195Called for comments.
196\end{methoddesc}
197
Fred Drakeefffe8e2000-10-29 05:10:30 +0000198\begin{methoddesc}[xmlparser]{StartCdataSectionHandler}{}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000199Called at the start of a CDATA section.
200\end{methoddesc}
201
Fred Drakeefffe8e2000-10-29 05:10:30 +0000202\begin{methoddesc}[xmlparser]{EndCdataSectionHandler}{}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000203Called at the end of a CDATA section.
204\end{methoddesc}
205
Fred Drakeefffe8e2000-10-29 05:10:30 +0000206\begin{methoddesc}[xmlparser]{DefaultHandler}{data}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000207Called for any characters in the XML document for
208which no applicable handler has been specified. This means
209characters that are part of a construct which could be reported, but
210for which no handler has been supplied.
211\end{methoddesc}
212
Fred Drakeefffe8e2000-10-29 05:10:30 +0000213\begin{methoddesc}[xmlparser]{DefaultHandlerExpand}{data}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000214This is the same as the \function{DefaultHandler},
215but doesn't inhibit expansion of internal entities.
216The entity reference will not be passed to the default handler.
217\end{methoddesc}
218
Fred Drakeefffe8e2000-10-29 05:10:30 +0000219\begin{methoddesc}[xmlparser]{NotStandaloneHandler}{}
Fred Draked79c33a2000-09-25 14:14:30 +0000220Called if the XML document hasn't been declared as being a standalone
221document.
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000222\end{methoddesc}
223
Fred Drakeefffe8e2000-10-29 05:10:30 +0000224\begin{methoddesc}[xmlparser]{ExternalEntityRefHandler}{context, base,
225 systemId, publicId}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000226Called for references to external entities.
227\end{methoddesc}
228
229
Fred Drake7fbc85c2000-09-23 04:47:56 +0000230\subsection{Example \label{expat-example}}
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000231
Fred Drakec05cbb02000-07-05 02:03:34 +0000232The following program defines three handlers that just print out their
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000233arguments.
234
235\begin{verbatim}
Fred Drake7fbc85c2000-09-23 04:47:56 +0000236import xml.parsers.expat
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000237
238# 3 handler functions
239def start_element(name, attrs):
240 print 'Start element:', name, attrs
241def end_element(name):
242 print 'End element:', name
243def char_data(data):
244 print 'Character data:', repr(data)
245
Fred Drake7fbc85c2000-09-23 04:47:56 +0000246p = xml.parsers.expat.ParserCreate()
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000247
248p.StartElementHandler = start_element
Fred Drake7fbc85c2000-09-23 04:47:56 +0000249p.EndElementHandler = end_element
250p.CharacterDataHandler = char_data
Andrew M. Kuchling6b14eeb2000-06-11 02:42:07 +0000251
252p.Parse("""<?xml version="1.0"?>
253<parent id="top"><child1 name="paul">Text goes here</child1>
254<child2 name="fred">More text</child2>
255</parent>""")
256\end{verbatim}
257
258The output from this program is:
259
260\begin{verbatim}
261Start element: parent {'id': 'top'}
262Start element: child1 {'name': 'paul'}
263Character data: 'Text goes here'
264End element: child1
265Character data: '\012'
266Start element: child2 {'name': 'fred'}
267Character data: 'More text'
268End element: child2
269Character data: '\012'
270End element: parent
271\end{verbatim}
Fred Drakec05cbb02000-07-05 02:03:34 +0000272
273
Fred Drake7fbc85c2000-09-23 04:47:56 +0000274\subsection{Expat error constants \label{expat-errors}}
Fred Drakec05cbb02000-07-05 02:03:34 +0000275\sectionauthor{A.M. Kuchling}{amk1@bigfoot.com}
276
277The following table lists the error constants in the
Fred Drake7fbc85c2000-09-23 04:47:56 +0000278\code{errors} object of the \module{xml.parsers.expat} module. These
279constants are useful in interpreting some of the attributes of the
280parser object after an error has occurred.
Fred Drakec05cbb02000-07-05 02:03:34 +0000281
Fred Drake7fbc85c2000-09-23 04:47:56 +0000282The \code{errors} object has the following attributes:
Fred Drakec05cbb02000-07-05 02:03:34 +0000283
Fred Drakeacab3d62000-07-11 16:30:30 +0000284\begin{datadesc}{XML_ERROR_ASYNC_ENTITY}
285\end{datadesc}
286
287\begin{datadesc}{XML_ERROR_ATTRIBUTE_EXTERNAL_ENTITY_REF}
288\end{datadesc}
289
290\begin{datadesc}{XML_ERROR_BAD_CHAR_REF}
291\end{datadesc}
292
293\begin{datadesc}{XML_ERROR_BINARY_ENTITY_REF}
294\end{datadesc}
295
296\begin{datadesc}{XML_ERROR_DUPLICATE_ATTRIBUTE}
297An attribute was used more than once in a start tag.
298\end{datadesc}
299
300\begin{datadesc}{XML_ERROR_INCORRECT_ENCODING}
301\end{datadesc}
302
303\begin{datadesc}{XML_ERROR_INVALID_TOKEN}
304\end{datadesc}
305
306\begin{datadesc}{XML_ERROR_JUNK_AFTER_DOC_ELEMENT}
307Something other than whitespace occurred after the document element.
308\end{datadesc}
309
310\begin{datadesc}{XML_ERROR_MISPLACED_XML_PI}
311\end{datadesc}
312
313\begin{datadesc}{XML_ERROR_NO_ELEMENTS}
314\end{datadesc}
315
316\begin{datadesc}{XML_ERROR_NO_MEMORY}
317Expat was not able to allocate memory internally.
318\end{datadesc}
319
320\begin{datadesc}{XML_ERROR_PARAM_ENTITY_REF}
321\end{datadesc}
322
323\begin{datadesc}{XML_ERROR_PARTIAL_CHAR}
324\end{datadesc}
325
326\begin{datadesc}{XML_ERROR_RECURSIVE_ENTITY_REF}
327\end{datadesc}
328
329\begin{datadesc}{XML_ERROR_SYNTAX}
330Some unspecified syntax error was encountered.
331\end{datadesc}
332
333\begin{datadesc}{XML_ERROR_TAG_MISMATCH}
334An end tag did not match the innermost open start tag.
335\end{datadesc}
336
337\begin{datadesc}{XML_ERROR_UNCLOSED_TOKEN}
338\end{datadesc}
339
340\begin{datadesc}{XML_ERROR_UNDEFINED_ENTITY}
341A reference was made to a entity which was not defined.
342\end{datadesc}
343
344\begin{datadesc}{XML_ERROR_UNKNOWN_ENCODING}
345The document encoding is not supported by Expat.
346\end{datadesc}