Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame^] | 1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" |
| 2 | "http://www.w3.org/TR/html4/loose.dtd"> |
| 3 | <html> |
| 4 | <head> |
| 5 | <meta http-equiv="Content-Type" content="text/html"> |
| 6 | <style type="text/css"> |
| 7 | <!-- |
| 8 | TD {font-family: Verdana,Arial,Helvetica} |
| 9 | BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em} |
| 10 | H1 {font-family: Verdana,Arial,Helvetica} |
| 11 | H2 {font-family: Verdana,Arial,Helvetica} |
| 12 | H3 {font-family: Verdana,Arial,Helvetica} |
| 13 | A:link, A:visited, A:active { text-decoration: underline }--> |
| 14 | |
| 15 | |
| 16 | </style> |
| 17 | <title>XML resources publication guidelines</title> |
| 18 | </head> |
| 19 | |
| 20 | <body bgcolor="#fffacd" text="#000000"> |
| 21 | <h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1> |
| 22 | |
| 23 | <p></p> |
| 24 | |
| 25 | <p>This document describes the use of the XmlTextReader streaming API added |
| 26 | to libxml2 in version 2.5.0 . This API is closely modelled on the <a |
| 27 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a> |
| 28 | and <a |
| 29 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a> |
| 30 | classes of the C# language.</p> |
| 31 | |
| 32 | <p>This tutorial will present the key points of this API, and working |
| 33 | examples using both C and the Python bindings:</p> |
| 34 | |
| 35 | <p>Table of content:</p> |
| 36 | <ul> |
| 37 | <li><a href="#Introducti">Introduction: why a new API</a></li> |
| 38 | <li><a href="#Walking">Walking a simple tree</a></li> |
| 39 | <li><a href="#Extracting">Extracting informations for the current |
| 40 | node</a></li> |
| 41 | <li><a href="#Validating">Validating a document</a></li> |
| 42 | <li><a href="#Entities">Entities substitution</a></li> |
| 43 | </ul> |
| 44 | |
| 45 | <p></p> |
| 46 | |
| 47 | <h2><a name="Introducti">Introduction: why a new API</a></h2> |
| 48 | |
| 49 | <p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is |
| 50 | tree based</a>, where the parsing operation results in a document loaded |
| 51 | completely in memory, and expose it as a tree of nodes all availble at the |
| 52 | same time. This is very simple and quite powerful, but has the major |
| 53 | limitation that the size of the document that can be hamdled is limited by |
| 54 | the size of the memory available. Libxml2 also provide a <a |
| 55 | href="http://www.saxproject.org/">SAX</a> based API, but that version was |
| 56 | designed upon one of the early <a |
| 57 | href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is |
| 58 | also not formally defined for C. SAX basically work by registering callbacks |
| 59 | which are called directly by the parser as it progresses through the document |
| 60 | streams. The problem is that this programming model is relatively complex, |
| 61 | not well standardized, cannot provide validation directly, makes entity, |
| 62 | namespace and base processing relatively hard.</p> |
| 63 | |
| 64 | <p>The <a |
| 65 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader |
| 66 | API from C#</a> provides a far simpler programming model, the API act as a |
| 67 | cursor going forward on the document stream and stopping at each node in the |
| 68 | way. The user code keep the control of the progresses and simply call a |
| 69 | Read() function repeatedly to progress to each node in sequence in document |
| 70 | order. There is direct support for namespaces, xml:base, entity handling and |
| 71 | adding DTD validation on top of it was relatively simple. This API is really |
| 72 | close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core |
| 73 | specification</a> This provides a far more standard, easy to use and powerful |
| 74 | API than the existing SAX. Moreover integrating extension feature based on |
| 75 | the tree seems relatively easy.</p> |
| 76 | |
| 77 | <p>In a nutshell the XmlTextReader API provides a simpler, more standard and |
| 78 | more extensible interface to handle large document than the existing SAX |
| 79 | version.</p> |
| 80 | |
| 81 | <h2><a name="Walking">Walking a simple tree</a></h2> |
| 82 | |
| 83 | <p>Basically the XmlTextReader API is a forward only tree walking interface. |
| 84 | The basic steps are:</p> |
| 85 | <ol> |
| 86 | <li>prepare a reader context operating on some input</li> |
| 87 | <li>run a loop iterating over all nodes in the document</li> |
| 88 | <li>free up the reader context</li> |
| 89 | </ol> |
| 90 | |
| 91 | <p>Here is a basic C sample doing this:</p> |
| 92 | <pre>#include <libxml/xmlreader.h> |
| 93 | |
| 94 | void processNode(xmlTextReaderPtr reader) { |
| 95 | /* handling of a node in the tree */ |
| 96 | } |
| 97 | |
| 98 | int streamFile(char *filename) { |
| 99 | xmlTextReaderPtr reader; |
| 100 | int ret; |
| 101 | |
| 102 | reader = xmlNewTextReaderFilename(filename); |
| 103 | if (reader != NULL) { |
| 104 | ret = xmlTextReaderRead(reader); |
| 105 | while (ret == 1) { |
| 106 | processNode(reader); |
| 107 | ret = xmlTextReaderRead(reader); |
| 108 | } |
| 109 | xmlFreeTextReader(reader); |
| 110 | if (ret != 0) { |
| 111 | printf("%s : failed to parse\n", filename); |
| 112 | } |
| 113 | } else { |
| 114 | printf("Unable to open %s\n", filename); |
| 115 | } |
| 116 | }</pre> |
| 117 | |
| 118 | <p>A few things to notice:</p> |
| 119 | <ul> |
| 120 | <li>the include file needed : <code>libxml/xmlreader.h</code></li> |
| 121 | <li>the creation of the reader using a filename</li> |
| 122 | <li>the repeated call to xmlTextReaderRead() and how any return value |
| 123 | different from 1 should stop the loop</li> |
| 124 | <li>that a negative return mean a parsing error</li> |
| 125 | <li>how xmlFreeTextReader() should be used to free up the resources used by |
| 126 | the reader.</li> |
| 127 | </ul> |
| 128 | |
| 129 | <p>Here is a similar code in python for exactly the same processing:</p> |
| 130 | <pre>import libxml2 |
| 131 | |
| 132 | def processNode(reader): |
| 133 | pass |
| 134 | |
| 135 | try: |
| 136 | reader = newTextReaderFilename(filename) |
| 137 | except: |
| 138 | print "unable to open %s" % (filename) |
| 139 | |
| 140 | |
| 141 | ret = reader.Read() |
| 142 | while ret == 1: |
| 143 | processNode(reader) |
| 144 | ret = reader.Read() |
| 145 | if ret != 0: |
| 146 | print "%s : failed to parse" % (filename) |
| 147 | </pre> |
| 148 | |
| 149 | <p>The only things worth adding are that the <a |
| 150 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader |
| 151 | is abstracted as a class like in C#</a> with the same method names (but the |
| 152 | properties are currently accessed with methods) and to note one doesn't need |
| 153 | to free the reader at the end of the processing, it will get garbage |
| 154 | collected once all references have disapeared</p> |
| 155 | |
| 156 | <h2><a name="Extracting">Extracting informations for the current node</a></h2> |
| 157 | |
| 158 | <p>So far the example code did not indicate how informations were extracted |
| 159 | from the reader, it was abstrated as a call to the processNode() routine, |
| 160 | with the reader as the argument. At each invocation, the parser is stopped on |
| 161 | a given node and the reader can be used to query those node properties. Each |
| 162 | <em>Property</em> is available at the C level as a function taking a single |
| 163 | xmlTextReaderPtr argument whose name is |
| 164 | <code>xmlTextReader</code><em>Property</em> , if the return type is an |
| 165 | <code>xmlChar *</code> string then it must be deallocated with |
| 166 | <code>xmlFree()</code> to avoid leaks. For the Python interface, there is a |
| 167 | <em>Property</em> method to the reader class that can be called on the |
| 168 | instance. The list of the properties is based on the <a |
| 169 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C# |
| 170 | XmlTextReader class</a> set of properties and methods:</p> |
| 171 | <ul> |
| 172 | <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of |
| 173 | element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for |
| 174 | entity references, 6 for entity declarations, 7 for PIs, 8 for comments, |
| 175 | 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document |
| 176 | fragment and 12 for notation nodes.</li> |
| 177 | <li><em>Name</em>: the <a |
| 178 | href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified |
| 179 | name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li> |
| 180 | <li><em>LocalName</em>: the <a |
| 181 | href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of |
| 182 | the node.</li> |
| 183 | <li><em>Prefix</em>: a shorthand reference to the <a |
| 184 | href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with |
| 185 | the node.</li> |
| 186 | <li><em>NamespaceUri</em>: the URI defining the <a |
| 187 | href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with |
| 188 | the node.</li> |
| 189 | <li><em>BaseUri:</em> the base URI of the node. See the <a |
| 190 | href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li> |
| 191 | <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the |
| 192 | root node.</li> |
| 193 | <li><em>HasAttributes</em>: whether the node has attributes.</li> |
| 194 | <li><em>HasValue</em>: whether the node can have a text value.</li> |
| 195 | <li><em>Value</em>: provides the text value of the node if present.</li> |
| 196 | <li><em>IsDefault</em>: whether an Attribute node was generated from the |
| 197 | default value defined in the DTD or schema (<em>unsupported |
| 198 | yet</em>).</li> |
| 199 | <li><em>XmlLang</em>: the <a |
| 200 | href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope |
| 201 | within which the node resides.</li> |
| 202 | <li><em>IsEmptyElement</em>: check if the current node is empty, this is a |
| 203 | bit bizarre in the sense that <code><a/></code> will be considered |
| 204 | empty while <code><a></a></code> will not.</li> |
| 205 | <li><em>AttributeCount</em>: provides the number of attributes of the |
| 206 | current node.</li> |
| 207 | </ul> |
| 208 | |
| 209 | <p></p> |
| 210 | |
| 211 | <h2><a name="Validating">Validating a document</a></h2> |
| 212 | |
| 213 | <h2><a name="Entities">Entities substitution</a></h2> |
| 214 | |
| 215 | <p> </p> |
| 216 | |
| 217 | <p><a href="mailto:veillard@redhat.com">Daniel Veillard</a></p> |
| 218 | |
| 219 | <p>$Id$</p> |
| 220 | |
| 221 | <p></p> |
| 222 | </body> |
| 223 | </html> |