blob: d776ec0d6c6110bda11b4d4eb9ad1a25d266b942 [file] [log] [blame]
Daniel Veillard66b82892003-01-04 00:44:13 +00001<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2 "http://www.w3.org/TR/html4/loose.dtd">
3<html>
4<head>
5 <meta http-equiv="Content-Type" content="text/html">
6 <style type="text/css">
7<!--
8TD {font-family: Verdana,Arial,Helvetica}
9BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
10H1 {font-family: Verdana,Arial,Helvetica}
11H2 {font-family: Verdana,Arial,Helvetica}
12H3 {font-family: Verdana,Arial,Helvetica}
13A:link, A:visited, A:active { text-decoration: underline }-->
14
15
16 </style>
17 <title>XML resources publication guidelines</title>
18</head>
19
20<body bgcolor="#fffacd" text="#000000">
21<h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
22
23<p></p>
24
25<p>This document describes the use of the XmlTextReader streaming API added
26to libxml2 in version 2.5.0 . This API is closely modelled on the <a
27href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
28and <a
29href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
30classes of the C# language.</p>
31
32<p>This tutorial will present the key points of this API, and working
33examples using both C and the Python bindings:</p>
34
35<p>Table of content:</p>
36<ul>
37 <li><a href="#Introducti">Introduction: why a new API</a></li>
38 <li><a href="#Walking">Walking a simple tree</a></li>
39 <li><a href="#Extracting">Extracting informations for the current
40 node</a></li>
41 <li><a href="#Validating">Validating a document</a></li>
42 <li><a href="#Entities">Entities substitution</a></li>
43</ul>
44
45<p></p>
46
47<h2><a name="Introducti">Introduction: why a new API</a></h2>
48
49<p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
50tree based</a>, where the parsing operation results in a document loaded
51completely in memory, and expose it as a tree of nodes all availble at the
52same time. This is very simple and quite powerful, but has the major
53limitation that the size of the document that can be hamdled is limited by
54the size of the memory available. Libxml2 also provide a <a
55href="http://www.saxproject.org/">SAX</a> based API, but that version was
56designed upon one of the early <a
57href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
58also not formally defined for C. SAX basically work by registering callbacks
59which are called directly by the parser as it progresses through the document
60streams. The problem is that this programming model is relatively complex,
61not well standardized, cannot provide validation directly, makes entity,
62namespace and base processing relatively hard.</p>
63
64<p>The <a
65href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
66API from C#</a> provides a far simpler programming model, the API act as a
67cursor going forward on the document stream and stopping at each node in the
68way. The user code keep the control of the progresses and simply call a
69Read() function repeatedly to progress to each node in sequence in document
70order. There is direct support for namespaces, xml:base, entity handling and
71adding DTD validation on top of it was relatively simple. This API is really
72close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
73specification</a> This provides a far more standard, easy to use and powerful
74API than the existing SAX. Moreover integrating extension feature based on
75the tree seems relatively easy.</p>
76
77<p>In a nutshell the XmlTextReader API provides a simpler, more standard and
78more extensible interface to handle large document than the existing SAX
79version.</p>
80
81<h2><a name="Walking">Walking a simple tree</a></h2>
82
83<p>Basically the XmlTextReader API is a forward only tree walking interface.
84The basic steps are:</p>
85<ol>
86 <li>prepare a reader context operating on some input</li>
87 <li>run a loop iterating over all nodes in the document</li>
88 <li>free up the reader context</li>
89</ol>
90
91<p>Here is a basic C sample doing this:</p>
92<pre>#include &lt;libxml/xmlreader.h&gt;
93
94void processNode(xmlTextReaderPtr reader) {
95 /* handling of a node in the tree */
96}
97
98int streamFile(char *filename) {
99 xmlTextReaderPtr reader;
100 int ret;
101
102 reader = xmlNewTextReaderFilename(filename);
103 if (reader != NULL) {
104 ret = xmlTextReaderRead(reader);
105 while (ret == 1) {
106 processNode(reader);
107 ret = xmlTextReaderRead(reader);
108 }
109 xmlFreeTextReader(reader);
110 if (ret != 0) {
111 printf("%s : failed to parse\n", filename);
112 }
113 } else {
114 printf("Unable to open %s\n", filename);
115 }
116}</pre>
117
118<p>A few things to notice:</p>
119<ul>
120 <li>the include file needed : <code>libxml/xmlreader.h</code></li>
121 <li>the creation of the reader using a filename</li>
122 <li>the repeated call to xmlTextReaderRead() and how any return value
123 different from 1 should stop the loop</li>
124 <li>that a negative return mean a parsing error</li>
125 <li>how xmlFreeTextReader() should be used to free up the resources used by
126 the reader.</li>
127</ul>
128
129<p>Here is a similar code in python for exactly the same processing:</p>
130<pre>import libxml2
131
132def processNode(reader):
133 pass
134
135try:
136 reader = newTextReaderFilename(filename)
137except:
138 print "unable to open %s" % (filename)
139
140
141ret = reader.Read()
142while ret == 1:
143 processNode(reader)
144 ret = reader.Read()
145if ret != 0:
146 print "%s : failed to parse" % (filename)
147</pre>
148
149<p>The only things worth adding are that the <a
150href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
151is abstracted as a class like in C#</a> with the same method names (but the
152properties are currently accessed with methods) and to note one doesn't need
153to free the reader at the end of the processing, it will get garbage
154collected once all references have disapeared</p>
155
156<h2><a name="Extracting">Extracting informations for the current node</a></h2>
157
158<p>So far the example code did not indicate how informations were extracted
159from the reader, it was abstrated as a call to the processNode() routine,
160with the reader as the argument. At each invocation, the parser is stopped on
161a given node and the reader can be used to query those node properties. Each
162<em>Property</em> is available at the C level as a function taking a single
163xmlTextReaderPtr argument whose name is
164<code>xmlTextReader</code><em>Property</em> , if the return type is an
165<code>xmlChar *</code> string then it must be deallocated with
166<code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
167<em>Property</em> method to the reader class that can be called on the
168instance. The list of the properties is based on the <a
169href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
170XmlTextReader class</a> set of properties and methods:</p>
171<ul>
172 <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
173 element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
174 entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
175 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
176 fragment and 12 for notation nodes.</li>
177 <li><em>Name</em>: the <a
178 href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
179 name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
180 <li><em>LocalName</em>: the <a
181 href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
182 the node.</li>
183 <li><em>Prefix</em>: a shorthand reference to the <a
184 href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
185 the node.</li>
186 <li><em>NamespaceUri</em>: the URI defining the <a
187 href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
188 the node.</li>
189 <li><em>BaseUri:</em> the base URI of the node. See the <a
190 href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
191 <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
192 root node.</li>
193 <li><em>HasAttributes</em>: whether the node has attributes.</li>
194 <li><em>HasValue</em>: whether the node can have a text value.</li>
195 <li><em>Value</em>: provides the text value of the node if present.</li>
196 <li><em>IsDefault</em>: whether an Attribute node was generated from the
197 default value defined in the DTD or schema (<em>unsupported
198 yet</em>).</li>
199 <li><em>XmlLang</em>: the <a
200 href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
201 within which the node resides.</li>
202 <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
203 bit bizarre in the sense that <code>&lt;a/&gt;</code> will be considered
204 empty while <code>&lt;a&gt;&lt;/a&gt;</code> will not.</li>
205 <li><em>AttributeCount</em>: provides the number of attributes of the
206 current node.</li>
207</ul>
208
209<p></p>
210
211<h2><a name="Validating">Validating a document</a></h2>
212
213<h2><a name="Entities">Entities substitution</a></h2>
214
215<p> </p>
216
217<p><a href="mailto:veillard@redhat.com">Daniel Veillard</a></p>
218
219<p>$Id$</p>
220
221<p></p>
222</body>
223</html>