blob: fd956466009ce23228a302ba8f42043278209b53 [file] [log] [blame]
Daniel Veillard66b82892003-01-04 00:44:13 +00001<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2 "http://www.w3.org/TR/html4/loose.dtd">
3<html>
4<head>
5 <meta http-equiv="Content-Type" content="text/html">
6 <style type="text/css">
7<!--
8TD {font-family: Verdana,Arial,Helvetica}
9BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
10H1 {font-family: Verdana,Arial,Helvetica}
11H2 {font-family: Verdana,Arial,Helvetica}
12H3 {font-family: Verdana,Arial,Helvetica}
13A:link, A:visited, A:active { text-decoration: underline }-->
14
15
Daniel Veillardac297932003-04-17 12:55:35 +000016
17
Daniel Veillard66b82892003-01-04 00:44:13 +000018 </style>
Daniel Veillarda55b27b2003-01-06 22:20:21 +000019 <title>Libxml2 XmlTextReader Interface tutorial</title>
Daniel Veillard66b82892003-01-04 00:44:13 +000020</head>
21
22<body bgcolor="#fffacd" text="#000000">
23<h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
24
25<p></p>
26
27<p>This document describes the use of the XmlTextReader streaming API added
Daniel Veillarde59494f2003-01-04 16:35:29 +000028to libxml2 in version 2.5.0 . This API is closely modeled after the <a
Daniel Veillard66b82892003-01-04 00:44:13 +000029href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
30and <a
31href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
32classes of the C# language.</p>
33
34<p>This tutorial will present the key points of this API, and working
35examples using both C and the Python bindings:</p>
36
37<p>Table of content:</p>
38<ul>
39 <li><a href="#Introducti">Introduction: why a new API</a></li>
40 <li><a href="#Walking">Walking a simple tree</a></li>
41 <li><a href="#Extracting">Extracting informations for the current
42 node</a></li>
Daniel Veillarde59494f2003-01-04 16:35:29 +000043 <li><a href="#Extracting1">Extracting informations for the
44 attributes</a></li>
Daniel Veillard66b82892003-01-04 00:44:13 +000045 <li><a href="#Validating">Validating a document</a></li>
46 <li><a href="#Entities">Entities substitution</a></li>
Daniel Veillardac297932003-04-17 12:55:35 +000047 <li><a href="#L1142">Relax-NG Validation</a></li>
48 <li><a href="#Mixing">Mixing the reader and tree or XPath
49 operations</a></li>
Daniel Veillard66b82892003-01-04 00:44:13 +000050</ul>
51
52<p></p>
53
54<h2><a name="Introducti">Introduction: why a new API</a></h2>
55
56<p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
57tree based</a>, where the parsing operation results in a document loaded
58completely in memory, and expose it as a tree of nodes all availble at the
59same time. This is very simple and quite powerful, but has the major
60limitation that the size of the document that can be hamdled is limited by
61the size of the memory available. Libxml2 also provide a <a
62href="http://www.saxproject.org/">SAX</a> based API, but that version was
63designed upon one of the early <a
64href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
65also not formally defined for C. SAX basically work by registering callbacks
66which are called directly by the parser as it progresses through the document
67streams. The problem is that this programming model is relatively complex,
68not well standardized, cannot provide validation directly, makes entity,
69namespace and base processing relatively hard.</p>
70
71<p>The <a
72href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
73API from C#</a> provides a far simpler programming model, the API act as a
74cursor going forward on the document stream and stopping at each node in the
75way. The user code keep the control of the progresses and simply call a
76Read() function repeatedly to progress to each node in sequence in document
77order. There is direct support for namespaces, xml:base, entity handling and
78adding DTD validation on top of it was relatively simple. This API is really
79close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
80specification</a> This provides a far more standard, easy to use and powerful
81API than the existing SAX. Moreover integrating extension feature based on
82the tree seems relatively easy.</p>
83
84<p>In a nutshell the XmlTextReader API provides a simpler, more standard and
85more extensible interface to handle large document than the existing SAX
86version.</p>
87
88<h2><a name="Walking">Walking a simple tree</a></h2>
89
90<p>Basically the XmlTextReader API is a forward only tree walking interface.
91The basic steps are:</p>
92<ol>
93 <li>prepare a reader context operating on some input</li>
94 <li>run a loop iterating over all nodes in the document</li>
95 <li>free up the reader context</li>
96</ol>
97
98<p>Here is a basic C sample doing this:</p>
99<pre>#include &lt;libxml/xmlreader.h&gt;
100
101void processNode(xmlTextReaderPtr reader) {
102 /* handling of a node in the tree */
103}
104
105int streamFile(char *filename) {
106 xmlTextReaderPtr reader;
107 int ret;
108
109 reader = xmlNewTextReaderFilename(filename);
110 if (reader != NULL) {
111 ret = xmlTextReaderRead(reader);
112 while (ret == 1) {
113 processNode(reader);
114 ret = xmlTextReaderRead(reader);
115 }
116 xmlFreeTextReader(reader);
117 if (ret != 0) {
118 printf("%s : failed to parse\n", filename);
119 }
120 } else {
121 printf("Unable to open %s\n", filename);
122 }
123}</pre>
124
125<p>A few things to notice:</p>
126<ul>
127 <li>the include file needed : <code>libxml/xmlreader.h</code></li>
128 <li>the creation of the reader using a filename</li>
129 <li>the repeated call to xmlTextReaderRead() and how any return value
130 different from 1 should stop the loop</li>
131 <li>that a negative return mean a parsing error</li>
132 <li>how xmlFreeTextReader() should be used to free up the resources used by
133 the reader.</li>
134</ul>
135
136<p>Here is a similar code in python for exactly the same processing:</p>
137<pre>import libxml2
138
139def processNode(reader):
140 pass
141
Daniel Veillarde59494f2003-01-04 16:35:29 +0000142def streamFile(filename):
143 try:
144 reader = libxml2.newTextReaderFilename(filename)
145 except:
146 print "unable to open %s" % (filename)
147 return
Daniel Veillard66b82892003-01-04 00:44:13 +0000148
Daniel Veillard66b82892003-01-04 00:44:13 +0000149 ret = reader.Read()
Daniel Veillarde59494f2003-01-04 16:35:29 +0000150 while ret == 1:
151 processNode(reader)
152 ret = reader.Read()
153
154 if ret != 0:
Daniel Veillardac297932003-04-17 12:55:35 +0000155 print "%s : failed to parse" % (filename)</pre>
Daniel Veillard66b82892003-01-04 00:44:13 +0000156
157<p>The only things worth adding are that the <a
158href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
159is abstracted as a class like in C#</a> with the same method names (but the
Daniel Veillarde59494f2003-01-04 16:35:29 +0000160properties are currently accessed with methods) and that one doesn't need to
161free the reader at the end of the processing, it will get garbage collected
162once all references have disapeared</p>
Daniel Veillard66b82892003-01-04 00:44:13 +0000163
164<h2><a name="Extracting">Extracting informations for the current node</a></h2>
165
166<p>So far the example code did not indicate how informations were extracted
167from the reader, it was abstrated as a call to the processNode() routine,
168with the reader as the argument. At each invocation, the parser is stopped on
169a given node and the reader can be used to query those node properties. Each
170<em>Property</em> is available at the C level as a function taking a single
171xmlTextReaderPtr argument whose name is
172<code>xmlTextReader</code><em>Property</em> , if the return type is an
173<code>xmlChar *</code> string then it must be deallocated with
174<code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
175<em>Property</em> method to the reader class that can be called on the
176instance. The list of the properties is based on the <a
177href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
178XmlTextReader class</a> set of properties and methods:</p>
179<ul>
180 <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
181 element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
182 entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
183 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
184 fragment and 12 for notation nodes.</li>
185 <li><em>Name</em>: the <a
186 href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
187 name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
188 <li><em>LocalName</em>: the <a
189 href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
190 the node.</li>
191 <li><em>Prefix</em>: a shorthand reference to the <a
192 href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
193 the node.</li>
194 <li><em>NamespaceUri</em>: the URI defining the <a
195 href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
196 the node.</li>
197 <li><em>BaseUri:</em> the base URI of the node. See the <a
198 href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
199 <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
200 root node.</li>
201 <li><em>HasAttributes</em>: whether the node has attributes.</li>
202 <li><em>HasValue</em>: whether the node can have a text value.</li>
203 <li><em>Value</em>: provides the text value of the node if present.</li>
204 <li><em>IsDefault</em>: whether an Attribute node was generated from the
205 default value defined in the DTD or schema (<em>unsupported
206 yet</em>).</li>
207 <li><em>XmlLang</em>: the <a
208 href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
209 within which the node resides.</li>
210 <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
211 bit bizarre in the sense that <code>&lt;a/&gt;</code> will be considered
212 empty while <code>&lt;a&gt;&lt;/a&gt;</code> will not.</li>
213 <li><em>AttributeCount</em>: provides the number of attributes of the
214 current node.</li>
215</ul>
216
Daniel Veillarde59494f2003-01-04 16:35:29 +0000217<p>Let's look first at a small example to get this in practice by redefining
218the processNode() function in the Python example:</p>
219<pre>def processNode(reader):
220 print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
221 reader.Name(), reader.IsEmptyElement())</pre>
222
223<p>and look at the result of calling streamFile("tst.xml") for various
224content of the XML test file.</p>
225
226<p>For the minimal document "<code>&lt;doc/&gt;</code>" we get:</p>
227<pre>0 1 doc 1</pre>
228
229<p>Only one node is found, its depth is 0, type 1 indocate an element start,
230of name "doc" and it is empty. Trying now with
231"<code>&lt;doc&gt;&lt;/doc&gt;</code>" instead leads to:</p>
232<pre>0 1 doc 0
2330 15 doc 0</pre>
234
235<p>The document root node is not flagged as empty anymore and both a start
236and an end of element are detected. The following document shows how
237character data are reported:</p>
238<pre>&lt;doc&gt;&lt;a/&gt;&lt;b&gt;some text&lt;/b&gt;
239&lt;c/&gt;&lt;/doc&gt;</pre>
240
241<p>We modifying the processNode() function to also report the node Value:</p>
242<pre>def processNode(reader):
243 print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
244 reader.Name(), reader.IsEmptyElement(),
245 reader.Value())</pre>
246
247<p>The result of the test is:</p>
248<pre>0 1 doc 0 None
2491 1 a 1 None
2501 1 b 0 None
2512 3 #text 0 some text
2521 15 b 0 None
2531 3 #text 0
254
2551 1 c 1 None
2560 15 doc 0 None</pre>
257
258<p>There is a few things to note:</p>
259<ul>
260 <li>the increase of the depth value (first row) as children nodes are
261 explored</li>
262 <li>the text node child of the b element, of type 3 and its content</li>
263 <li>the text node containing the line return between elements b and c</li>
264 <li>that elements have the Value None (or NULL in C)</li>
265</ul>
266
267<p>The equivalent routine for <code>processNode()</code> as used by
268<code>xmllint --stream --debug</code> is the following and can be found in
269the xmllint.c module in the source distribution:</p>
270<pre>static void processNode(xmlTextReaderPtr reader) {
271 xmlChar *name, *value;
272
273 name = xmlTextReaderName(reader);
274 if (name == NULL)
275 name = xmlStrdup(BAD_CAST "--");
276 value = xmlTextReaderValue(reader);
277
278 printf("%d %d %s %d",
279 xmlTextReaderDepth(reader),
280 xmlTextReaderNodeType(reader),
281 name,
282 xmlTextReaderIsEmptyElement(reader));
283 xmlFree(name);
284 if (value == NULL)
285 printf("\n");
286 else {
287 printf(" %s\n", value);
288 xmlFree(value);
289 }
290}</pre>
291
292<h2><a name="Extracting1">Extracting informations for the attributes</a></h2>
293
294<p>The previous examples don't indicate how attributes are processed. The
295simple test "<code>&lt;doc a="b"/&gt;</code>" provides the following
296result:</p>
297<pre>0 1 doc 1 None</pre>
298
299<p>This prove that attributes nodes are not traversed by default. The
300<em>HasAttributes</em> property allow to detect their presence. To check
301their content the API has special instructions basically 2 kind of operations
302are possible:</p>
303<ol>
304 <li>to move the reader to the attribute nodes of the current element, in
305 that case the cursor is positionned on the attribute node</li>
306 <li>to directly query the element node for the attribute value</li>
307</ol>
308
309<p>In both case the attribute can be designed either by its position in the
310list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
311by their name (and namespace):</p>
312<ul>
313 <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
314 the specified index no relative to the containing element.</li>
315 <li><em>GetAttribute</em>(name): provides the value of the attribute with
316 the specified qualified name.</li>
317 <li>GetAttributeNs(localName, namespaceURI): provides the value of the
318 attribute with the specified local name and namespace URI.</li>
319 <li><em>MoveToAttributeNo</em>(no): moves the position of the current
320 instance to the attribute with the specified index relative to the
321 containing element.</li>
322 <li><em>MoveToAttribute</em>(name): moves the position of the current
323 instance to the attribute with the specified qualified name.</li>
324 <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
325 of the current instance to the attribute with the specified local name
326 and namespace URI.</li>
327 <li><em>MoveToFirstAttribute</em>: moves the position of the current
328 instance to the first attribute associated with the current node.</li>
329 <li><em>MoveToNextAttribute</em>: moves the position of the current
330 instance to the next attribute associated with the current node.</li>
331 <li><em>MoveToElement</em>: moves the position of the current instance to
332 the node that contains the current Attribute node.</li>
333</ul>
334
335<p>After modifying the processNode() function to show attributes:</p>
336<pre>def processNode(reader):
337 print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
338 reader.Name(), reader.IsEmptyElement(),
339 reader.Value())
340 if reader.NodeType() == 1: # Element
341 while reader.MoveToNextAttribute():
342 print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
343 reader.Name(),reader.Value())</pre>
344
345<p>the output for the same input document reflects the attribute:</p>
346<pre>0 1 doc 1 None
347-- 1 2 (a) [b]</pre>
348
349<p>There is a couple of things to note on the attribute processing:</p>
350<ul>
351 <li>their depth is the one of the carrying element plus one</li>
352 <li>namespace declarations are seen as attributes like in DOM</li>
353</ul>
Daniel Veillard66b82892003-01-04 00:44:13 +0000354
355<h2><a name="Validating">Validating a document</a></h2>
356
Daniel Veillarde59494f2003-01-04 16:35:29 +0000357<p>Libxml2 implementation adds some extra feature on top of the XmlTextReader
358API, the main one is the ability to DTD validate the parsed document
359progressively. This is simply the activation of the associated feature of the
360parser used by the reader structure. There are a few options available
361defined as the enum xmlParserProperties in the libxml/xmlreader.h header
362file:</p>
363<ul>
364 <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
365 <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
366 loading the DTD)</li>
367 <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
368 the DTD)</li>
369 <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
370 reference nodes are not generated and are replaced by their expanded
371 content.</li>
372 <li>more settings might be added, those were the one available at the 2.5.0
373 release...</li>
374</ul>
375
376<p>The GetParserProp() and SetParserProp() methods can then be used to get
377and set the values of those parser properties of the reader. For example</p>
378<pre>def parseAndValidate(file):
379 reader = libxml2.newTextReaderFilename(file)
380 reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
381 ret = reader.Read()
382 while ret == 1:
383 ret = reader.Read()
384 if ret != 0:
385 print "Error parsing and validating %s" % (file)</pre>
386
387<p>This routine will parse and validate the file. Errors message can be
388captured by registering an error handler. See python/tests/reader2.py for
389more complete Python examples. At the C level the equivalent call to cativate
390the validation feature is just:</p>
391<pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
392
393<p>and a return value of 0 indicates success.</p>
394
Daniel Veillard66b82892003-01-04 00:44:13 +0000395<h2><a name="Entities">Entities substitution</a></h2>
396
Daniel Veillardac297932003-04-17 12:55:35 +0000397<p>By default the xmlReader will report entities as such and not replace them
398with their content. This default behaviour can however be overriden using:</p>
Daniel Veillard067bae52003-01-05 01:27:54 +0000399
Daniel Veillardac297932003-04-17 12:55:35 +0000400<p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>
401
402<h2><a name="L1142">Relax-NG Validation</a></h2>
403
404<p style="font-size: 10pt">Introduced in version 2.5.7</p>
405
406<p>Libxml2 can now validate the document being read using the xmlReader using
407Relax-NG schemas. While the Relax NG validator can't always work in a
408streamable mode, only subsets which cannot be reduced to regular expressions
409need to have their subtree expanded for validation. In practice it means
410that, unless the schemas for the top level element content is not expressable
411as a regexp, only chunk of the document needs to be parsed while
412validating.</p>
413
414<p>The steps to do so are:</p>
415<ul>
416 <li>create a reader working on a document as usual</li>
417 <li>before any call to read associate it to a Relax NG schemas, either the
418 preparsed schemas or the URL to the schemas to use</li>
419 <li>errors will be reported the usual way, and the validity status can be
420 obtained using the IsValid() interface of the reader like for DTDs.</li>
421</ul>
422
423<p>Example, assuming the reader has already being created and that the schema
424string contains the Relax-NG schemas:</p>
425
426<p><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
427rngs = rngp.relaxNGParse()<br>
428reader.RelaxNGSetSchema(rngs)<br>
429ret = reader.Read()<br>
430while ret == 1:<br>
431 ret = reader.Read()<br>
432if ret != 0:<br>
433 print "Error parsing the document"<br>
434if reader.IsValid() != 1:<br>
435 print "Document failed to validate"</code><br>
436See <code>reader6.py</code> in the sources or documentation for a complete
437example.</p>
438
439<h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>
440
441<p style="font-size: 10pt">Introduced in version 2.5.7</p>
442
443<p>While the reader is a streaming interface, its underlying implementation
444is based on the DOM builder of libxml2. As a result it is relatively simple
445to mix operations based on both models under some constraints. To do so the
446reader has an Expand() operation allowing to grow the subtree under the
447current node. It returns a pointer to a standard node wich can be manipulated
448in the usual ways. The node will get all its ancestors and the full subtree
449available. Usual operations like XPath queries can be used on that reduced
450view of the document. Here is an example extracted from reader5.py in the
451sources which extract and prints the bibliography for the "Dragon" compiler
452book from the XML 1.0 recommendation:</p>
453<pre>f = open('../../test/valid/REC-xml-19980210.xml')
454input = libxml2.inputBuffer(f)
455reader = input.newTextReader("REC")
456res=""
457while reader.Read():
458 while reader.Name() == 'bibl':
459 node = reader.Expand() # expand the subtree
460 if node.xpathEval("@id = 'Aho'"): # use XPath on it
461 res = res + node.serialize()
462 if reader.Next() != 1: # skip the subtree
463 break;</pre>
464
465<p>Note however that the node instance returned by the Expand() call is only
466valid until the next Read() operation. The Expand() operation does not
467affects the Read() ones, however usually once processed the full subtree is
468not useful anymore, and the Next() operation allows to skip it completely and
469process to the successor or return 0 if the document end is reached. </p>
Daniel Veillard66b82892003-01-04 00:44:13 +0000470
471<p><a href="mailto:veillard@redhat.com">Daniel Veillard</a></p>
472
473<p>$Id$</p>
474
475<p></p>
476</body>
477</html>