blob: 3edadb6bca2ac0dbbf648e88bd12bb5e13ab39fa [file] [log] [blame]
Daniel Veillard66b82892003-01-04 00:44:13 +00001<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2 "http://www.w3.org/TR/html4/loose.dtd">
3<html>
4<head>
5 <meta http-equiv="Content-Type" content="text/html">
6 <style type="text/css">
7<!--
8TD {font-family: Verdana,Arial,Helvetica}
9BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
10H1 {font-family: Verdana,Arial,Helvetica}
11H2 {font-family: Verdana,Arial,Helvetica}
12H3 {font-family: Verdana,Arial,Helvetica}
13A:link, A:visited, A:active { text-decoration: underline }-->
14
15
Daniel Veillardac297932003-04-17 12:55:35 +000016
17
Daniel Veillarde81765f2003-04-17 14:59:27 +000018
19
Daniel Veillard66b82892003-01-04 00:44:13 +000020 </style>
Daniel Veillarda55b27b2003-01-06 22:20:21 +000021 <title>Libxml2 XmlTextReader Interface tutorial</title>
Daniel Veillard66b82892003-01-04 00:44:13 +000022</head>
23
24<body bgcolor="#fffacd" text="#000000">
25<h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
26
27<p></p>
28
29<p>This document describes the use of the XmlTextReader streaming API added
Daniel Veillarde59494f2003-01-04 16:35:29 +000030to libxml2 in version 2.5.0 . This API is closely modeled after the <a
Daniel Veillard66b82892003-01-04 00:44:13 +000031href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
32and <a
33href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
34classes of the C# language.</p>
35
36<p>This tutorial will present the key points of this API, and working
37examples using both C and the Python bindings:</p>
38
39<p>Table of content:</p>
40<ul>
41 <li><a href="#Introducti">Introduction: why a new API</a></li>
42 <li><a href="#Walking">Walking a simple tree</a></li>
43 <li><a href="#Extracting">Extracting informations for the current
44 node</a></li>
Daniel Veillarde59494f2003-01-04 16:35:29 +000045 <li><a href="#Extracting1">Extracting informations for the
46 attributes</a></li>
Daniel Veillard66b82892003-01-04 00:44:13 +000047 <li><a href="#Validating">Validating a document</a></li>
48 <li><a href="#Entities">Entities substitution</a></li>
Daniel Veillardac297932003-04-17 12:55:35 +000049 <li><a href="#L1142">Relax-NG Validation</a></li>
50 <li><a href="#Mixing">Mixing the reader and tree or XPath
51 operations</a></li>
Daniel Veillard66b82892003-01-04 00:44:13 +000052</ul>
53
54<p></p>
55
56<h2><a name="Introducti">Introduction: why a new API</a></h2>
57
58<p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
59tree based</a>, where the parsing operation results in a document loaded
60completely in memory, and expose it as a tree of nodes all availble at the
61same time. This is very simple and quite powerful, but has the major
62limitation that the size of the document that can be hamdled is limited by
63the size of the memory available. Libxml2 also provide a <a
64href="http://www.saxproject.org/">SAX</a> based API, but that version was
65designed upon one of the early <a
66href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
67also not formally defined for C. SAX basically work by registering callbacks
68which are called directly by the parser as it progresses through the document
69streams. The problem is that this programming model is relatively complex,
70not well standardized, cannot provide validation directly, makes entity,
71namespace and base processing relatively hard.</p>
72
73<p>The <a
74href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
75API from C#</a> provides a far simpler programming model, the API act as a
76cursor going forward on the document stream and stopping at each node in the
77way. The user code keep the control of the progresses and simply call a
78Read() function repeatedly to progress to each node in sequence in document
79order. There is direct support for namespaces, xml:base, entity handling and
80adding DTD validation on top of it was relatively simple. This API is really
81close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
82specification</a> This provides a far more standard, easy to use and powerful
83API than the existing SAX. Moreover integrating extension feature based on
84the tree seems relatively easy.</p>
85
86<p>In a nutshell the XmlTextReader API provides a simpler, more standard and
87more extensible interface to handle large document than the existing SAX
88version.</p>
89
90<h2><a name="Walking">Walking a simple tree</a></h2>
91
92<p>Basically the XmlTextReader API is a forward only tree walking interface.
93The basic steps are:</p>
94<ol>
95 <li>prepare a reader context operating on some input</li>
96 <li>run a loop iterating over all nodes in the document</li>
97 <li>free up the reader context</li>
98</ol>
99
100<p>Here is a basic C sample doing this:</p>
101<pre>#include &lt;libxml/xmlreader.h&gt;
102
103void processNode(xmlTextReaderPtr reader) {
104 /* handling of a node in the tree */
105}
106
107int streamFile(char *filename) {
108 xmlTextReaderPtr reader;
109 int ret;
110
111 reader = xmlNewTextReaderFilename(filename);
112 if (reader != NULL) {
113 ret = xmlTextReaderRead(reader);
114 while (ret == 1) {
115 processNode(reader);
116 ret = xmlTextReaderRead(reader);
117 }
118 xmlFreeTextReader(reader);
119 if (ret != 0) {
120 printf("%s : failed to parse\n", filename);
121 }
122 } else {
123 printf("Unable to open %s\n", filename);
124 }
125}</pre>
126
127<p>A few things to notice:</p>
128<ul>
129 <li>the include file needed : <code>libxml/xmlreader.h</code></li>
130 <li>the creation of the reader using a filename</li>
131 <li>the repeated call to xmlTextReaderRead() and how any return value
132 different from 1 should stop the loop</li>
133 <li>that a negative return mean a parsing error</li>
134 <li>how xmlFreeTextReader() should be used to free up the resources used by
135 the reader.</li>
136</ul>
137
138<p>Here is a similar code in python for exactly the same processing:</p>
139<pre>import libxml2
140
141def processNode(reader):
142 pass
143
Daniel Veillarde59494f2003-01-04 16:35:29 +0000144def streamFile(filename):
145 try:
146 reader = libxml2.newTextReaderFilename(filename)
147 except:
148 print "unable to open %s" % (filename)
149 return
Daniel Veillard66b82892003-01-04 00:44:13 +0000150
Daniel Veillard66b82892003-01-04 00:44:13 +0000151 ret = reader.Read()
Daniel Veillarde59494f2003-01-04 16:35:29 +0000152 while ret == 1:
153 processNode(reader)
154 ret = reader.Read()
155
156 if ret != 0:
Daniel Veillardac297932003-04-17 12:55:35 +0000157 print "%s : failed to parse" % (filename)</pre>
Daniel Veillard66b82892003-01-04 00:44:13 +0000158
159<p>The only things worth adding are that the <a
160href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
161is abstracted as a class like in C#</a> with the same method names (but the
Daniel Veillarde59494f2003-01-04 16:35:29 +0000162properties are currently accessed with methods) and that one doesn't need to
163free the reader at the end of the processing, it will get garbage collected
164once all references have disapeared</p>
Daniel Veillard66b82892003-01-04 00:44:13 +0000165
166<h2><a name="Extracting">Extracting informations for the current node</a></h2>
167
168<p>So far the example code did not indicate how informations were extracted
169from the reader, it was abstrated as a call to the processNode() routine,
170with the reader as the argument. At each invocation, the parser is stopped on
171a given node and the reader can be used to query those node properties. Each
172<em>Property</em> is available at the C level as a function taking a single
173xmlTextReaderPtr argument whose name is
174<code>xmlTextReader</code><em>Property</em> , if the return type is an
175<code>xmlChar *</code> string then it must be deallocated with
176<code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
177<em>Property</em> method to the reader class that can be called on the
178instance. The list of the properties is based on the <a
179href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
180XmlTextReader class</a> set of properties and methods:</p>
181<ul>
182 <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
183 element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
184 entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
185 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
186 fragment and 12 for notation nodes.</li>
187 <li><em>Name</em>: the <a
188 href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
189 name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
190 <li><em>LocalName</em>: the <a
191 href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
192 the node.</li>
193 <li><em>Prefix</em>: a shorthand reference to the <a
194 href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
195 the node.</li>
196 <li><em>NamespaceUri</em>: the URI defining the <a
197 href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
198 the node.</li>
199 <li><em>BaseUri:</em> the base URI of the node. See the <a
200 href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
201 <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
202 root node.</li>
203 <li><em>HasAttributes</em>: whether the node has attributes.</li>
204 <li><em>HasValue</em>: whether the node can have a text value.</li>
205 <li><em>Value</em>: provides the text value of the node if present.</li>
206 <li><em>IsDefault</em>: whether an Attribute node was generated from the
207 default value defined in the DTD or schema (<em>unsupported
208 yet</em>).</li>
209 <li><em>XmlLang</em>: the <a
210 href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
211 within which the node resides.</li>
212 <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
213 bit bizarre in the sense that <code>&lt;a/&gt;</code> will be considered
214 empty while <code>&lt;a&gt;&lt;/a&gt;</code> will not.</li>
215 <li><em>AttributeCount</em>: provides the number of attributes of the
216 current node.</li>
217</ul>
218
Daniel Veillarde59494f2003-01-04 16:35:29 +0000219<p>Let's look first at a small example to get this in practice by redefining
220the processNode() function in the Python example:</p>
221<pre>def processNode(reader):
222 print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
223 reader.Name(), reader.IsEmptyElement())</pre>
224
225<p>and look at the result of calling streamFile("tst.xml") for various
226content of the XML test file.</p>
227
228<p>For the minimal document "<code>&lt;doc/&gt;</code>" we get:</p>
229<pre>0 1 doc 1</pre>
230
231<p>Only one node is found, its depth is 0, type 1 indocate an element start,
232of name "doc" and it is empty. Trying now with
233"<code>&lt;doc&gt;&lt;/doc&gt;</code>" instead leads to:</p>
234<pre>0 1 doc 0
2350 15 doc 0</pre>
236
237<p>The document root node is not flagged as empty anymore and both a start
238and an end of element are detected. The following document shows how
239character data are reported:</p>
240<pre>&lt;doc&gt;&lt;a/&gt;&lt;b&gt;some text&lt;/b&gt;
241&lt;c/&gt;&lt;/doc&gt;</pre>
242
243<p>We modifying the processNode() function to also report the node Value:</p>
244<pre>def processNode(reader):
245 print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
246 reader.Name(), reader.IsEmptyElement(),
247 reader.Value())</pre>
248
249<p>The result of the test is:</p>
250<pre>0 1 doc 0 None
2511 1 a 1 None
2521 1 b 0 None
2532 3 #text 0 some text
2541 15 b 0 None
2551 3 #text 0
256
2571 1 c 1 None
2580 15 doc 0 None</pre>
259
260<p>There is a few things to note:</p>
261<ul>
262 <li>the increase of the depth value (first row) as children nodes are
263 explored</li>
264 <li>the text node child of the b element, of type 3 and its content</li>
265 <li>the text node containing the line return between elements b and c</li>
266 <li>that elements have the Value None (or NULL in C)</li>
267</ul>
268
269<p>The equivalent routine for <code>processNode()</code> as used by
270<code>xmllint --stream --debug</code> is the following and can be found in
271the xmllint.c module in the source distribution:</p>
272<pre>static void processNode(xmlTextReaderPtr reader) {
273 xmlChar *name, *value;
274
275 name = xmlTextReaderName(reader);
276 if (name == NULL)
277 name = xmlStrdup(BAD_CAST "--");
278 value = xmlTextReaderValue(reader);
279
280 printf("%d %d %s %d",
281 xmlTextReaderDepth(reader),
282 xmlTextReaderNodeType(reader),
283 name,
284 xmlTextReaderIsEmptyElement(reader));
285 xmlFree(name);
286 if (value == NULL)
287 printf("\n");
288 else {
289 printf(" %s\n", value);
290 xmlFree(value);
291 }
292}</pre>
293
294<h2><a name="Extracting1">Extracting informations for the attributes</a></h2>
295
296<p>The previous examples don't indicate how attributes are processed. The
297simple test "<code>&lt;doc a="b"/&gt;</code>" provides the following
298result:</p>
299<pre>0 1 doc 1 None</pre>
300
301<p>This prove that attributes nodes are not traversed by default. The
302<em>HasAttributes</em> property allow to detect their presence. To check
303their content the API has special instructions basically 2 kind of operations
304are possible:</p>
305<ol>
306 <li>to move the reader to the attribute nodes of the current element, in
307 that case the cursor is positionned on the attribute node</li>
308 <li>to directly query the element node for the attribute value</li>
309</ol>
310
311<p>In both case the attribute can be designed either by its position in the
312list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
313by their name (and namespace):</p>
314<ul>
315 <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
316 the specified index no relative to the containing element.</li>
317 <li><em>GetAttribute</em>(name): provides the value of the attribute with
318 the specified qualified name.</li>
319 <li>GetAttributeNs(localName, namespaceURI): provides the value of the
320 attribute with the specified local name and namespace URI.</li>
321 <li><em>MoveToAttributeNo</em>(no): moves the position of the current
322 instance to the attribute with the specified index relative to the
323 containing element.</li>
324 <li><em>MoveToAttribute</em>(name): moves the position of the current
325 instance to the attribute with the specified qualified name.</li>
326 <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
327 of the current instance to the attribute with the specified local name
328 and namespace URI.</li>
329 <li><em>MoveToFirstAttribute</em>: moves the position of the current
330 instance to the first attribute associated with the current node.</li>
331 <li><em>MoveToNextAttribute</em>: moves the position of the current
332 instance to the next attribute associated with the current node.</li>
333 <li><em>MoveToElement</em>: moves the position of the current instance to
334 the node that contains the current Attribute node.</li>
335</ul>
336
337<p>After modifying the processNode() function to show attributes:</p>
338<pre>def processNode(reader):
339 print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
340 reader.Name(), reader.IsEmptyElement(),
341 reader.Value())
342 if reader.NodeType() == 1: # Element
343 while reader.MoveToNextAttribute():
344 print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
345 reader.Name(),reader.Value())</pre>
346
347<p>the output for the same input document reflects the attribute:</p>
348<pre>0 1 doc 1 None
349-- 1 2 (a) [b]</pre>
350
351<p>There is a couple of things to note on the attribute processing:</p>
352<ul>
353 <li>their depth is the one of the carrying element plus one</li>
354 <li>namespace declarations are seen as attributes like in DOM</li>
355</ul>
Daniel Veillard66b82892003-01-04 00:44:13 +0000356
357<h2><a name="Validating">Validating a document</a></h2>
358
Daniel Veillarde59494f2003-01-04 16:35:29 +0000359<p>Libxml2 implementation adds some extra feature on top of the XmlTextReader
360API, the main one is the ability to DTD validate the parsed document
361progressively. This is simply the activation of the associated feature of the
362parser used by the reader structure. There are a few options available
363defined as the enum xmlParserProperties in the libxml/xmlreader.h header
364file:</p>
365<ul>
366 <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
367 <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
368 loading the DTD)</li>
369 <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
370 the DTD)</li>
371 <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
372 reference nodes are not generated and are replaced by their expanded
373 content.</li>
374 <li>more settings might be added, those were the one available at the 2.5.0
375 release...</li>
376</ul>
377
378<p>The GetParserProp() and SetParserProp() methods can then be used to get
379and set the values of those parser properties of the reader. For example</p>
380<pre>def parseAndValidate(file):
381 reader = libxml2.newTextReaderFilename(file)
382 reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
383 ret = reader.Read()
384 while ret == 1:
385 ret = reader.Read()
386 if ret != 0:
387 print "Error parsing and validating %s" % (file)</pre>
388
389<p>This routine will parse and validate the file. Errors message can be
390captured by registering an error handler. See python/tests/reader2.py for
391more complete Python examples. At the C level the equivalent call to cativate
392the validation feature is just:</p>
393<pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
394
395<p>and a return value of 0 indicates success.</p>
396
Daniel Veillard66b82892003-01-04 00:44:13 +0000397<h2><a name="Entities">Entities substitution</a></h2>
398
Daniel Veillardac297932003-04-17 12:55:35 +0000399<p>By default the xmlReader will report entities as such and not replace them
400with their content. This default behaviour can however be overriden using:</p>
Daniel Veillard067bae52003-01-05 01:27:54 +0000401
Daniel Veillardac297932003-04-17 12:55:35 +0000402<p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>
403
404<h2><a name="L1142">Relax-NG Validation</a></h2>
405
406<p style="font-size: 10pt">Introduced in version 2.5.7</p>
407
408<p>Libxml2 can now validate the document being read using the xmlReader using
409Relax-NG schemas. While the Relax NG validator can't always work in a
410streamable mode, only subsets which cannot be reduced to regular expressions
411need to have their subtree expanded for validation. In practice it means
412that, unless the schemas for the top level element content is not expressable
413as a regexp, only chunk of the document needs to be parsed while
414validating.</p>
415
416<p>The steps to do so are:</p>
417<ul>
418 <li>create a reader working on a document as usual</li>
419 <li>before any call to read associate it to a Relax NG schemas, either the
420 preparsed schemas or the URL to the schemas to use</li>
421 <li>errors will be reported the usual way, and the validity status can be
422 obtained using the IsValid() interface of the reader like for DTDs.</li>
423</ul>
424
425<p>Example, assuming the reader has already being created and that the schema
426string contains the Relax-NG schemas:</p>
Daniel Veillarde81765f2003-04-17 14:59:27 +0000427<pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
Daniel Veillardac297932003-04-17 12:55:35 +0000428rngs = rngp.relaxNGParse()<br>
429reader.RelaxNGSetSchema(rngs)<br>
430ret = reader.Read()<br>
431while ret == 1:<br>
432 ret = reader.Read()<br>
433if ret != 0:<br>
434 print "Error parsing the document"<br>
435if reader.IsValid() != 1:<br>
436 print "Document failed to validate"</code><br>
Daniel Veillarde81765f2003-04-17 14:59:27 +0000437</pre>
438
439<p>See <code>reader6.py</code> in the sources or documentation for a complete
Daniel Veillardac297932003-04-17 12:55:35 +0000440example.</p>
441
442<h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>
443
444<p style="font-size: 10pt">Introduced in version 2.5.7</p>
445
446<p>While the reader is a streaming interface, its underlying implementation
447is based on the DOM builder of libxml2. As a result it is relatively simple
448to mix operations based on both models under some constraints. To do so the
449reader has an Expand() operation allowing to grow the subtree under the
Daniel Veillarde81765f2003-04-17 14:59:27 +0000450current node. It returns a pointer to a standard node which can be
451manipulated in the usual ways. The node will get all its ancestors and the
452full subtree available. Usual operations like XPath queries can be used on
453that reduced view of the document. Here is an example extracted from
454reader5.py in the sources which extract and prints the bibliography for the
455"Dragon" compiler book from the XML 1.0 recommendation:</p>
Daniel Veillardac297932003-04-17 12:55:35 +0000456<pre>f = open('../../test/valid/REC-xml-19980210.xml')
457input = libxml2.inputBuffer(f)
458reader = input.newTextReader("REC")
459res=""
460while reader.Read():
461 while reader.Name() == 'bibl':
462 node = reader.Expand() # expand the subtree
463 if node.xpathEval("@id = 'Aho'"): # use XPath on it
464 res = res + node.serialize()
465 if reader.Next() != 1: # skip the subtree
466 break;</pre>
467
468<p>Note however that the node instance returned by the Expand() call is only
469valid until the next Read() operation. The Expand() operation does not
470affects the Read() ones, however usually once processed the full subtree is
471not useful anymore, and the Next() operation allows to skip it completely and
Daniel Veillarde81765f2003-04-17 14:59:27 +0000472process to the successor or return 0 if the document end is reached.</p>
Daniel Veillard66b82892003-01-04 00:44:13 +0000473
474<p><a href="mailto:veillard@redhat.com">Daniel Veillard</a></p>
475
476<p>$Id$</p>
477
478<p></p>
479</body>
480</html>