blob: e818a77c6cb09680ae212da0e0f086fc831f776b [file] [log] [blame]
Daniel Veillard66b82892003-01-04 00:44:13 +00001<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2 "http://www.w3.org/TR/html4/loose.dtd">
3<html>
4<head>
5 <meta http-equiv="Content-Type" content="text/html">
6 <style type="text/css">
7<!--
8TD {font-family: Verdana,Arial,Helvetica}
9BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
10H1 {font-family: Verdana,Arial,Helvetica}
11H2 {font-family: Verdana,Arial,Helvetica}
12H3 {font-family: Verdana,Arial,Helvetica}
13A:link, A:visited, A:active { text-decoration: underline }-->
14
15
16 </style>
17 <title>XML resources publication guidelines</title>
18</head>
19
20<body bgcolor="#fffacd" text="#000000">
21<h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
22
23<p></p>
24
25<p>This document describes the use of the XmlTextReader streaming API added
Daniel Veillarde59494f2003-01-04 16:35:29 +000026to libxml2 in version 2.5.0 . This API is closely modeled after the <a
Daniel Veillard66b82892003-01-04 00:44:13 +000027href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
28and <a
29href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
30classes of the C# language.</p>
31
32<p>This tutorial will present the key points of this API, and working
33examples using both C and the Python bindings:</p>
34
35<p>Table of content:</p>
36<ul>
37 <li><a href="#Introducti">Introduction: why a new API</a></li>
38 <li><a href="#Walking">Walking a simple tree</a></li>
39 <li><a href="#Extracting">Extracting informations for the current
40 node</a></li>
Daniel Veillarde59494f2003-01-04 16:35:29 +000041 <li><a href="#Extracting1">Extracting informations for the
42 attributes</a></li>
Daniel Veillard66b82892003-01-04 00:44:13 +000043 <li><a href="#Validating">Validating a document</a></li>
44 <li><a href="#Entities">Entities substitution</a></li>
45</ul>
46
47<p></p>
48
49<h2><a name="Introducti">Introduction: why a new API</a></h2>
50
51<p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
52tree based</a>, where the parsing operation results in a document loaded
53completely in memory, and expose it as a tree of nodes all availble at the
54same time. This is very simple and quite powerful, but has the major
55limitation that the size of the document that can be hamdled is limited by
56the size of the memory available. Libxml2 also provide a <a
57href="http://www.saxproject.org/">SAX</a> based API, but that version was
58designed upon one of the early <a
59href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
60also not formally defined for C. SAX basically work by registering callbacks
61which are called directly by the parser as it progresses through the document
62streams. The problem is that this programming model is relatively complex,
63not well standardized, cannot provide validation directly, makes entity,
64namespace and base processing relatively hard.</p>
65
66<p>The <a
67href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
68API from C#</a> provides a far simpler programming model, the API act as a
69cursor going forward on the document stream and stopping at each node in the
70way. The user code keep the control of the progresses and simply call a
71Read() function repeatedly to progress to each node in sequence in document
72order. There is direct support for namespaces, xml:base, entity handling and
73adding DTD validation on top of it was relatively simple. This API is really
74close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
75specification</a> This provides a far more standard, easy to use and powerful
76API than the existing SAX. Moreover integrating extension feature based on
77the tree seems relatively easy.</p>
78
79<p>In a nutshell the XmlTextReader API provides a simpler, more standard and
80more extensible interface to handle large document than the existing SAX
81version.</p>
82
83<h2><a name="Walking">Walking a simple tree</a></h2>
84
85<p>Basically the XmlTextReader API is a forward only tree walking interface.
86The basic steps are:</p>
87<ol>
88 <li>prepare a reader context operating on some input</li>
89 <li>run a loop iterating over all nodes in the document</li>
90 <li>free up the reader context</li>
91</ol>
92
93<p>Here is a basic C sample doing this:</p>
94<pre>#include &lt;libxml/xmlreader.h&gt;
95
96void processNode(xmlTextReaderPtr reader) {
97 /* handling of a node in the tree */
98}
99
100int streamFile(char *filename) {
101 xmlTextReaderPtr reader;
102 int ret;
103
104 reader = xmlNewTextReaderFilename(filename);
105 if (reader != NULL) {
106 ret = xmlTextReaderRead(reader);
107 while (ret == 1) {
108 processNode(reader);
109 ret = xmlTextReaderRead(reader);
110 }
111 xmlFreeTextReader(reader);
112 if (ret != 0) {
113 printf("%s : failed to parse\n", filename);
114 }
115 } else {
116 printf("Unable to open %s\n", filename);
117 }
118}</pre>
119
120<p>A few things to notice:</p>
121<ul>
122 <li>the include file needed : <code>libxml/xmlreader.h</code></li>
123 <li>the creation of the reader using a filename</li>
124 <li>the repeated call to xmlTextReaderRead() and how any return value
125 different from 1 should stop the loop</li>
126 <li>that a negative return mean a parsing error</li>
127 <li>how xmlFreeTextReader() should be used to free up the resources used by
128 the reader.</li>
129</ul>
130
131<p>Here is a similar code in python for exactly the same processing:</p>
132<pre>import libxml2
133
134def processNode(reader):
135 pass
136
Daniel Veillarde59494f2003-01-04 16:35:29 +0000137def streamFile(filename):
138 try:
139 reader = libxml2.newTextReaderFilename(filename)
140 except:
141 print "unable to open %s" % (filename)
142 return
Daniel Veillard66b82892003-01-04 00:44:13 +0000143
Daniel Veillard66b82892003-01-04 00:44:13 +0000144 ret = reader.Read()
Daniel Veillarde59494f2003-01-04 16:35:29 +0000145 while ret == 1:
146 processNode(reader)
147 ret = reader.Read()
148
149 if ret != 0:
150 print "%s : failed to parse" % (filename)
Daniel Veillard66b82892003-01-04 00:44:13 +0000151</pre>
152
153<p>The only things worth adding are that the <a
154href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
155is abstracted as a class like in C#</a> with the same method names (but the
Daniel Veillarde59494f2003-01-04 16:35:29 +0000156properties are currently accessed with methods) and that one doesn't need to
157free the reader at the end of the processing, it will get garbage collected
158once all references have disapeared</p>
Daniel Veillard66b82892003-01-04 00:44:13 +0000159
160<h2><a name="Extracting">Extracting informations for the current node</a></h2>
161
162<p>So far the example code did not indicate how informations were extracted
163from the reader, it was abstrated as a call to the processNode() routine,
164with the reader as the argument. At each invocation, the parser is stopped on
165a given node and the reader can be used to query those node properties. Each
166<em>Property</em> is available at the C level as a function taking a single
167xmlTextReaderPtr argument whose name is
168<code>xmlTextReader</code><em>Property</em> , if the return type is an
169<code>xmlChar *</code> string then it must be deallocated with
170<code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
171<em>Property</em> method to the reader class that can be called on the
172instance. The list of the properties is based on the <a
173href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
174XmlTextReader class</a> set of properties and methods:</p>
175<ul>
176 <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
177 element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
178 entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
179 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
180 fragment and 12 for notation nodes.</li>
181 <li><em>Name</em>: the <a
182 href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
183 name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
184 <li><em>LocalName</em>: the <a
185 href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
186 the node.</li>
187 <li><em>Prefix</em>: a shorthand reference to the <a
188 href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
189 the node.</li>
190 <li><em>NamespaceUri</em>: the URI defining the <a
191 href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
192 the node.</li>
193 <li><em>BaseUri:</em> the base URI of the node. See the <a
194 href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
195 <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
196 root node.</li>
197 <li><em>HasAttributes</em>: whether the node has attributes.</li>
198 <li><em>HasValue</em>: whether the node can have a text value.</li>
199 <li><em>Value</em>: provides the text value of the node if present.</li>
200 <li><em>IsDefault</em>: whether an Attribute node was generated from the
201 default value defined in the DTD or schema (<em>unsupported
202 yet</em>).</li>
203 <li><em>XmlLang</em>: the <a
204 href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
205 within which the node resides.</li>
206 <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
207 bit bizarre in the sense that <code>&lt;a/&gt;</code> will be considered
208 empty while <code>&lt;a&gt;&lt;/a&gt;</code> will not.</li>
209 <li><em>AttributeCount</em>: provides the number of attributes of the
210 current node.</li>
211</ul>
212
Daniel Veillarde59494f2003-01-04 16:35:29 +0000213<p>Let's look first at a small example to get this in practice by redefining
214the processNode() function in the Python example:</p>
215<pre>def processNode(reader):
216 print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
217 reader.Name(), reader.IsEmptyElement())</pre>
218
219<p>and look at the result of calling streamFile("tst.xml") for various
220content of the XML test file.</p>
221
222<p>For the minimal document "<code>&lt;doc/&gt;</code>" we get:</p>
223<pre>0 1 doc 1</pre>
224
225<p>Only one node is found, its depth is 0, type 1 indocate an element start,
226of name "doc" and it is empty. Trying now with
227"<code>&lt;doc&gt;&lt;/doc&gt;</code>" instead leads to:</p>
228<pre>0 1 doc 0
2290 15 doc 0</pre>
230
231<p>The document root node is not flagged as empty anymore and both a start
232and an end of element are detected. The following document shows how
233character data are reported:</p>
234<pre>&lt;doc&gt;&lt;a/&gt;&lt;b&gt;some text&lt;/b&gt;
235&lt;c/&gt;&lt;/doc&gt;</pre>
236
237<p>We modifying the processNode() function to also report the node Value:</p>
238<pre>def processNode(reader):
239 print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
240 reader.Name(), reader.IsEmptyElement(),
241 reader.Value())</pre>
242
243<p>The result of the test is:</p>
244<pre>0 1 doc 0 None
2451 1 a 1 None
2461 1 b 0 None
2472 3 #text 0 some text
2481 15 b 0 None
2491 3 #text 0
250
2511 1 c 1 None
2520 15 doc 0 None</pre>
253
254<p>There is a few things to note:</p>
255<ul>
256 <li>the increase of the depth value (first row) as children nodes are
257 explored</li>
258 <li>the text node child of the b element, of type 3 and its content</li>
259 <li>the text node containing the line return between elements b and c</li>
260 <li>that elements have the Value None (or NULL in C)</li>
261</ul>
262
263<p>The equivalent routine for <code>processNode()</code> as used by
264<code>xmllint --stream --debug</code> is the following and can be found in
265the xmllint.c module in the source distribution:</p>
266<pre>static void processNode(xmlTextReaderPtr reader) {
267 xmlChar *name, *value;
268
269 name = xmlTextReaderName(reader);
270 if (name == NULL)
271 name = xmlStrdup(BAD_CAST "--");
272 value = xmlTextReaderValue(reader);
273
274 printf("%d %d %s %d",
275 xmlTextReaderDepth(reader),
276 xmlTextReaderNodeType(reader),
277 name,
278 xmlTextReaderIsEmptyElement(reader));
279 xmlFree(name);
280 if (value == NULL)
281 printf("\n");
282 else {
283 printf(" %s\n", value);
284 xmlFree(value);
285 }
286}</pre>
287
288<h2><a name="Extracting1">Extracting informations for the attributes</a></h2>
289
290<p>The previous examples don't indicate how attributes are processed. The
291simple test "<code>&lt;doc a="b"/&gt;</code>" provides the following
292result:</p>
293<pre>0 1 doc 1 None</pre>
294
295<p>This prove that attributes nodes are not traversed by default. The
296<em>HasAttributes</em> property allow to detect their presence. To check
297their content the API has special instructions basically 2 kind of operations
298are possible:</p>
299<ol>
300 <li>to move the reader to the attribute nodes of the current element, in
301 that case the cursor is positionned on the attribute node</li>
302 <li>to directly query the element node for the attribute value</li>
303</ol>
304
305<p>In both case the attribute can be designed either by its position in the
306list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
307by their name (and namespace):</p>
308<ul>
309 <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
310 the specified index no relative to the containing element.</li>
311 <li><em>GetAttribute</em>(name): provides the value of the attribute with
312 the specified qualified name.</li>
313 <li>GetAttributeNs(localName, namespaceURI): provides the value of the
314 attribute with the specified local name and namespace URI.</li>
315 <li><em>MoveToAttributeNo</em>(no): moves the position of the current
316 instance to the attribute with the specified index relative to the
317 containing element.</li>
318 <li><em>MoveToAttribute</em>(name): moves the position of the current
319 instance to the attribute with the specified qualified name.</li>
320 <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
321 of the current instance to the attribute with the specified local name
322 and namespace URI.</li>
323 <li><em>MoveToFirstAttribute</em>: moves the position of the current
324 instance to the first attribute associated with the current node.</li>
325 <li><em>MoveToNextAttribute</em>: moves the position of the current
326 instance to the next attribute associated with the current node.</li>
327 <li><em>MoveToElement</em>: moves the position of the current instance to
328 the node that contains the current Attribute node.</li>
329</ul>
330
331<p>After modifying the processNode() function to show attributes:</p>
332<pre>def processNode(reader):
333 print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
334 reader.Name(), reader.IsEmptyElement(),
335 reader.Value())
336 if reader.NodeType() == 1: # Element
337 while reader.MoveToNextAttribute():
338 print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
339 reader.Name(),reader.Value())</pre>
340
341<p>the output for the same input document reflects the attribute:</p>
342<pre>0 1 doc 1 None
343-- 1 2 (a) [b]</pre>
344
345<p>There is a couple of things to note on the attribute processing:</p>
346<ul>
347 <li>their depth is the one of the carrying element plus one</li>
348 <li>namespace declarations are seen as attributes like in DOM</li>
349</ul>
Daniel Veillard66b82892003-01-04 00:44:13 +0000350
351<h2><a name="Validating">Validating a document</a></h2>
352
Daniel Veillarde59494f2003-01-04 16:35:29 +0000353<p>Libxml2 implementation adds some extra feature on top of the XmlTextReader
354API, the main one is the ability to DTD validate the parsed document
355progressively. This is simply the activation of the associated feature of the
356parser used by the reader structure. There are a few options available
357defined as the enum xmlParserProperties in the libxml/xmlreader.h header
358file:</p>
359<ul>
360 <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
361 <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
362 loading the DTD)</li>
363 <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
364 the DTD)</li>
365 <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
366 reference nodes are not generated and are replaced by their expanded
367 content.</li>
368 <li>more settings might be added, those were the one available at the 2.5.0
369 release...</li>
370</ul>
371
372<p>The GetParserProp() and SetParserProp() methods can then be used to get
373and set the values of those parser properties of the reader. For example</p>
374<pre>def parseAndValidate(file):
375 reader = libxml2.newTextReaderFilename(file)
376 reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
377 ret = reader.Read()
378 while ret == 1:
379 ret = reader.Read()
380 if ret != 0:
381 print "Error parsing and validating %s" % (file)</pre>
382
383<p>This routine will parse and validate the file. Errors message can be
384captured by registering an error handler. See python/tests/reader2.py for
385more complete Python examples. At the C level the equivalent call to cativate
386the validation feature is just:</p>
387<pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
388
389<p>and a return value of 0 indicates success.</p>
390
Daniel Veillard66b82892003-01-04 00:44:13 +0000391<h2><a name="Entities">Entities substitution</a></h2>
392
393<p> </p>
394
395<p><a href="mailto:veillard@redhat.com">Daniel Veillard</a></p>
396
397<p>$Id$</p>
398
399<p></p>
400</body>
401</html>