Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 1 | <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" |
| 2 | "http://www.w3.org/TR/html4/loose.dtd"> |
| 3 | <html> |
| 4 | <head> |
| 5 | <meta http-equiv="Content-Type" content="text/html"> |
William M. Brack | 008c06b | 2003-09-01 22:17:39 +0000 | [diff] [blame] | 6 | <style type="text/css"></style> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 7 | <!-- |
| 8 | TD {font-family: Verdana,Arial,Helvetica} |
| 9 | BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em} |
| 10 | H1 {font-family: Verdana,Arial,Helvetica} |
| 11 | H2 {font-family: Verdana,Arial,Helvetica} |
| 12 | H3 {font-family: Verdana,Arial,Helvetica} |
William M. Brack | 008c06b | 2003-09-01 22:17:39 +0000 | [diff] [blame] | 13 | A:link, A:visited, A:active { text-decoration: underline } |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 14 | </style> |
William M. Brack | 008c06b | 2003-09-01 22:17:39 +0000 | [diff] [blame] | 15 | --> |
Daniel Veillard | a55b27b | 2003-01-06 22:20:21 +0000 | [diff] [blame] | 16 | <title>Libxml2 XmlTextReader Interface tutorial</title> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 17 | </head> |
| 18 | |
| 19 | <body bgcolor="#fffacd" text="#000000"> |
| 20 | <h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1> |
| 21 | |
| 22 | <p></p> |
| 23 | |
| 24 | <p>This document describes the use of the XmlTextReader streaming API added |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 25 | to libxml2 in version 2.5.0 . This API is closely modeled after the <a |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 26 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a> |
| 27 | and <a |
| 28 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a> |
| 29 | classes of the C# language.</p> |
| 30 | |
| 31 | <p>This tutorial will present the key points of this API, and working |
| 32 | examples using both C and the Python bindings:</p> |
| 33 | |
| 34 | <p>Table of content:</p> |
| 35 | <ul> |
| 36 | <li><a href="#Introducti">Introduction: why a new API</a></li> |
| 37 | <li><a href="#Walking">Walking a simple tree</a></li> |
| 38 | <li><a href="#Extracting">Extracting informations for the current |
| 39 | node</a></li> |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 40 | <li><a href="#Extracting1">Extracting informations for the |
| 41 | attributes</a></li> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 42 | <li><a href="#Validating">Validating a document</a></li> |
| 43 | <li><a href="#Entities">Entities substitution</a></li> |
Daniel Veillard | ac29793 | 2003-04-17 12:55:35 +0000 | [diff] [blame] | 44 | <li><a href="#L1142">Relax-NG Validation</a></li> |
| 45 | <li><a href="#Mixing">Mixing the reader and tree or XPath |
| 46 | operations</a></li> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 47 | </ul> |
| 48 | |
| 49 | <p></p> |
| 50 | |
| 51 | <h2><a name="Introducti">Introduction: why a new API</a></h2> |
| 52 | |
| 53 | <p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is |
| 54 | tree based</a>, where the parsing operation results in a document loaded |
| 55 | completely in memory, and expose it as a tree of nodes all availble at the |
| 56 | same time. This is very simple and quite powerful, but has the major |
| 57 | limitation that the size of the document that can be hamdled is limited by |
| 58 | the size of the memory available. Libxml2 also provide a <a |
| 59 | href="http://www.saxproject.org/">SAX</a> based API, but that version was |
| 60 | designed upon one of the early <a |
| 61 | href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is |
| 62 | also not formally defined for C. SAX basically work by registering callbacks |
| 63 | which are called directly by the parser as it progresses through the document |
| 64 | streams. The problem is that this programming model is relatively complex, |
| 65 | not well standardized, cannot provide validation directly, makes entity, |
| 66 | namespace and base processing relatively hard.</p> |
| 67 | |
| 68 | <p>The <a |
| 69 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 70 | API from C#</a> provides a far simpler programming model. The API acts as a |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 71 | cursor going forward on the document stream and stopping at each node in the |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 72 | way. The user's code keeps control of the progress and simply calls a |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 73 | Read() function repeatedly to progress to each node in sequence in document |
| 74 | order. There is direct support for namespaces, xml:base, entity handling and |
| 75 | adding DTD validation on top of it was relatively simple. This API is really |
| 76 | close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core |
| 77 | specification</a> This provides a far more standard, easy to use and powerful |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 78 | API than the existing SAX. Moreover integrating extension features based on |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 79 | the tree seems relatively easy.</p> |
| 80 | |
| 81 | <p>In a nutshell the XmlTextReader API provides a simpler, more standard and |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 82 | more extensible interface to handle large documents than the existing SAX |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 83 | version.</p> |
| 84 | |
| 85 | <h2><a name="Walking">Walking a simple tree</a></h2> |
| 86 | |
| 87 | <p>Basically the XmlTextReader API is a forward only tree walking interface. |
| 88 | The basic steps are:</p> |
| 89 | <ol> |
| 90 | <li>prepare a reader context operating on some input</li> |
| 91 | <li>run a loop iterating over all nodes in the document</li> |
| 92 | <li>free up the reader context</li> |
| 93 | </ol> |
| 94 | |
| 95 | <p>Here is a basic C sample doing this:</p> |
| 96 | <pre>#include <libxml/xmlreader.h> |
| 97 | |
| 98 | void processNode(xmlTextReaderPtr reader) { |
| 99 | /* handling of a node in the tree */ |
| 100 | } |
| 101 | |
| 102 | int streamFile(char *filename) { |
| 103 | xmlTextReaderPtr reader; |
| 104 | int ret; |
| 105 | |
| 106 | reader = xmlNewTextReaderFilename(filename); |
| 107 | if (reader != NULL) { |
| 108 | ret = xmlTextReaderRead(reader); |
| 109 | while (ret == 1) { |
| 110 | processNode(reader); |
| 111 | ret = xmlTextReaderRead(reader); |
| 112 | } |
| 113 | xmlFreeTextReader(reader); |
| 114 | if (ret != 0) { |
| 115 | printf("%s : failed to parse\n", filename); |
| 116 | } |
| 117 | } else { |
| 118 | printf("Unable to open %s\n", filename); |
| 119 | } |
| 120 | }</pre> |
| 121 | |
| 122 | <p>A few things to notice:</p> |
| 123 | <ul> |
| 124 | <li>the include file needed : <code>libxml/xmlreader.h</code></li> |
| 125 | <li>the creation of the reader using a filename</li> |
| 126 | <li>the repeated call to xmlTextReaderRead() and how any return value |
| 127 | different from 1 should stop the loop</li> |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 128 | <li>that a negative return means a parsing error</li> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 129 | <li>how xmlFreeTextReader() should be used to free up the resources used by |
| 130 | the reader.</li> |
| 131 | </ul> |
| 132 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 133 | <p>Here is similar code in python for exactly the same processing:</p> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 134 | <pre>import libxml2 |
| 135 | |
| 136 | def processNode(reader): |
| 137 | pass |
| 138 | |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 139 | def streamFile(filename): |
| 140 | try: |
| 141 | reader = libxml2.newTextReaderFilename(filename) |
| 142 | except: |
| 143 | print "unable to open %s" % (filename) |
| 144 | return |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 145 | |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 146 | ret = reader.Read() |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 147 | while ret == 1: |
| 148 | processNode(reader) |
| 149 | ret = reader.Read() |
| 150 | |
| 151 | if ret != 0: |
Daniel Veillard | ac29793 | 2003-04-17 12:55:35 +0000 | [diff] [blame] | 152 | print "%s : failed to parse" % (filename)</pre> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 153 | |
| 154 | <p>The only things worth adding are that the <a |
| 155 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader |
| 156 | is abstracted as a class like in C#</a> with the same method names (but the |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 157 | properties are currently accessed with methods) and that one doesn't need to |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 158 | free the reader at the end of the processing. It will get garbage collected |
| 159 | once all references have disapeared.</p> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 160 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 161 | <h2><a name="Extracting">Extracting information for the current node</a></h2> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 162 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 163 | <p>So far the example code did not indicate how information was extracted |
| 164 | from the reader. It was abstrated as a call to the processNode() routine, |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 165 | with the reader as the argument. At each invocation, the parser is stopped on |
| 166 | a given node and the reader can be used to query those node properties. Each |
| 167 | <em>Property</em> is available at the C level as a function taking a single |
| 168 | xmlTextReaderPtr argument whose name is |
| 169 | <code>xmlTextReader</code><em>Property</em> , if the return type is an |
| 170 | <code>xmlChar *</code> string then it must be deallocated with |
| 171 | <code>xmlFree()</code> to avoid leaks. For the Python interface, there is a |
| 172 | <em>Property</em> method to the reader class that can be called on the |
| 173 | instance. The list of the properties is based on the <a |
| 174 | href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C# |
| 175 | XmlTextReader class</a> set of properties and methods:</p> |
| 176 | <ul> |
| 177 | <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of |
| 178 | element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for |
| 179 | entity references, 6 for entity declarations, 7 for PIs, 8 for comments, |
| 180 | 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document |
| 181 | fragment and 12 for notation nodes.</li> |
| 182 | <li><em>Name</em>: the <a |
| 183 | href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified |
| 184 | name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li> |
| 185 | <li><em>LocalName</em>: the <a |
| 186 | href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of |
| 187 | the node.</li> |
| 188 | <li><em>Prefix</em>: a shorthand reference to the <a |
| 189 | href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with |
| 190 | the node.</li> |
| 191 | <li><em>NamespaceUri</em>: the URI defining the <a |
| 192 | href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with |
| 193 | the node.</li> |
| 194 | <li><em>BaseUri:</em> the base URI of the node. See the <a |
| 195 | href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li> |
| 196 | <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the |
| 197 | root node.</li> |
| 198 | <li><em>HasAttributes</em>: whether the node has attributes.</li> |
| 199 | <li><em>HasValue</em>: whether the node can have a text value.</li> |
| 200 | <li><em>Value</em>: provides the text value of the node if present.</li> |
| 201 | <li><em>IsDefault</em>: whether an Attribute node was generated from the |
| 202 | default value defined in the DTD or schema (<em>unsupported |
| 203 | yet</em>).</li> |
| 204 | <li><em>XmlLang</em>: the <a |
| 205 | href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope |
| 206 | within which the node resides.</li> |
| 207 | <li><em>IsEmptyElement</em>: check if the current node is empty, this is a |
| 208 | bit bizarre in the sense that <code><a/></code> will be considered |
| 209 | empty while <code><a></a></code> will not.</li> |
| 210 | <li><em>AttributeCount</em>: provides the number of attributes of the |
| 211 | current node.</li> |
| 212 | </ul> |
| 213 | |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 214 | <p>Let's look first at a small example to get this in practice by redefining |
| 215 | the processNode() function in the Python example:</p> |
| 216 | <pre>def processNode(reader): |
| 217 | print "%d %d %s %d" % (reader.Depth(), reader.NodeType(), |
| 218 | reader.Name(), reader.IsEmptyElement())</pre> |
| 219 | |
| 220 | <p>and look at the result of calling streamFile("tst.xml") for various |
| 221 | content of the XML test file.</p> |
| 222 | |
| 223 | <p>For the minimal document "<code><doc/></code>" we get:</p> |
| 224 | <pre>0 1 doc 1</pre> |
| 225 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 226 | <p>Only one node is found, its depth is 0, type 1 indicate an element start, |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 227 | of name "doc" and it is empty. Trying now with |
| 228 | "<code><doc></doc></code>" instead leads to:</p> |
| 229 | <pre>0 1 doc 0 |
| 230 | 0 15 doc 0</pre> |
| 231 | |
| 232 | <p>The document root node is not flagged as empty anymore and both a start |
| 233 | and an end of element are detected. The following document shows how |
| 234 | character data are reported:</p> |
| 235 | <pre><doc><a/><b>some text</b> |
| 236 | <c/></doc></pre> |
| 237 | |
| 238 | <p>We modifying the processNode() function to also report the node Value:</p> |
| 239 | <pre>def processNode(reader): |
| 240 | print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(), |
| 241 | reader.Name(), reader.IsEmptyElement(), |
| 242 | reader.Value())</pre> |
| 243 | |
| 244 | <p>The result of the test is:</p> |
| 245 | <pre>0 1 doc 0 None |
| 246 | 1 1 a 1 None |
| 247 | 1 1 b 0 None |
| 248 | 2 3 #text 0 some text |
| 249 | 1 15 b 0 None |
| 250 | 1 3 #text 0 |
| 251 | |
| 252 | 1 1 c 1 None |
| 253 | 0 15 doc 0 None</pre> |
| 254 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 255 | <p>There are a few things to note:</p> |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 256 | <ul> |
| 257 | <li>the increase of the depth value (first row) as children nodes are |
| 258 | explored</li> |
| 259 | <li>the text node child of the b element, of type 3 and its content</li> |
| 260 | <li>the text node containing the line return between elements b and c</li> |
| 261 | <li>that elements have the Value None (or NULL in C)</li> |
| 262 | </ul> |
| 263 | |
| 264 | <p>The equivalent routine for <code>processNode()</code> as used by |
| 265 | <code>xmllint --stream --debug</code> is the following and can be found in |
| 266 | the xmllint.c module in the source distribution:</p> |
| 267 | <pre>static void processNode(xmlTextReaderPtr reader) { |
| 268 | xmlChar *name, *value; |
| 269 | |
| 270 | name = xmlTextReaderName(reader); |
| 271 | if (name == NULL) |
| 272 | name = xmlStrdup(BAD_CAST "--"); |
| 273 | value = xmlTextReaderValue(reader); |
| 274 | |
| 275 | printf("%d %d %s %d", |
| 276 | xmlTextReaderDepth(reader), |
| 277 | xmlTextReaderNodeType(reader), |
| 278 | name, |
| 279 | xmlTextReaderIsEmptyElement(reader)); |
| 280 | xmlFree(name); |
| 281 | if (value == NULL) |
| 282 | printf("\n"); |
| 283 | else { |
| 284 | printf(" %s\n", value); |
| 285 | xmlFree(value); |
| 286 | } |
| 287 | }</pre> |
| 288 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 289 | <h2><a name="Extracting1">Extracting information for the attributes</a></h2> |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 290 | |
| 291 | <p>The previous examples don't indicate how attributes are processed. The |
| 292 | simple test "<code><doc a="b"/></code>" provides the following |
| 293 | result:</p> |
| 294 | <pre>0 1 doc 1 None</pre> |
| 295 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 296 | <p>This proves that attribute nodes are not traversed by default. The |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 297 | <em>HasAttributes</em> property allow to detect their presence. To check |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 298 | their content the API has special instructions. Basically two kinds of operations |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 299 | are possible:</p> |
| 300 | <ol> |
| 301 | <li>to move the reader to the attribute nodes of the current element, in |
| 302 | that case the cursor is positionned on the attribute node</li> |
| 303 | <li>to directly query the element node for the attribute value</li> |
| 304 | </ol> |
| 305 | |
| 306 | <p>In both case the attribute can be designed either by its position in the |
| 307 | list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or |
| 308 | by their name (and namespace):</p> |
| 309 | <ul> |
| 310 | <li><em>GetAttributeNo</em>(no): provides the value of the attribute with |
| 311 | the specified index no relative to the containing element.</li> |
| 312 | <li><em>GetAttribute</em>(name): provides the value of the attribute with |
| 313 | the specified qualified name.</li> |
| 314 | <li>GetAttributeNs(localName, namespaceURI): provides the value of the |
| 315 | attribute with the specified local name and namespace URI.</li> |
| 316 | <li><em>MoveToAttributeNo</em>(no): moves the position of the current |
| 317 | instance to the attribute with the specified index relative to the |
| 318 | containing element.</li> |
| 319 | <li><em>MoveToAttribute</em>(name): moves the position of the current |
| 320 | instance to the attribute with the specified qualified name.</li> |
| 321 | <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position |
| 322 | of the current instance to the attribute with the specified local name |
| 323 | and namespace URI.</li> |
| 324 | <li><em>MoveToFirstAttribute</em>: moves the position of the current |
| 325 | instance to the first attribute associated with the current node.</li> |
| 326 | <li><em>MoveToNextAttribute</em>: moves the position of the current |
| 327 | instance to the next attribute associated with the current node.</li> |
| 328 | <li><em>MoveToElement</em>: moves the position of the current instance to |
| 329 | the node that contains the current Attribute node.</li> |
| 330 | </ul> |
| 331 | |
| 332 | <p>After modifying the processNode() function to show attributes:</p> |
| 333 | <pre>def processNode(reader): |
| 334 | print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(), |
| 335 | reader.Name(), reader.IsEmptyElement(), |
| 336 | reader.Value()) |
| 337 | if reader.NodeType() == 1: # Element |
| 338 | while reader.MoveToNextAttribute(): |
| 339 | print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(), |
| 340 | reader.Name(),reader.Value())</pre> |
| 341 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 342 | <p>The output for the same input document reflects the attribute:</p> |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 343 | <pre>0 1 doc 1 None |
| 344 | -- 1 2 (a) [b]</pre> |
| 345 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 346 | <p>There are a couple of things to note on the attribute processing:</p> |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 347 | <ul> |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 348 | <li>Their depth is the one of the carrying element plus one.</li> |
| 349 | <li>Namespace declarations are seen as attributes, as in DOM.</li> |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 350 | </ul> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 351 | |
| 352 | <h2><a name="Validating">Validating a document</a></h2> |
| 353 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 354 | <p>Libxml2 implementation adds some extra features on top of the XmlTextReader |
| 355 | API. The main one is the ability to DTD validate the parsed document |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 356 | progressively. This is simply the activation of the associated feature of the |
| 357 | parser used by the reader structure. There are a few options available |
| 358 | defined as the enum xmlParserProperties in the libxml/xmlreader.h header |
| 359 | file:</p> |
| 360 | <ul> |
| 361 | <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li> |
| 362 | <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply |
| 363 | loading the DTD)</li> |
| 364 | <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading |
| 365 | the DTD)</li> |
| 366 | <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity |
| 367 | reference nodes are not generated and are replaced by their expanded |
| 368 | content.</li> |
| 369 | <li>more settings might be added, those were the one available at the 2.5.0 |
| 370 | release...</li> |
| 371 | </ul> |
| 372 | |
| 373 | <p>The GetParserProp() and SetParserProp() methods can then be used to get |
| 374 | and set the values of those parser properties of the reader. For example</p> |
| 375 | <pre>def parseAndValidate(file): |
| 376 | reader = libxml2.newTextReaderFilename(file) |
| 377 | reader.SetParserProp(libxml2.PARSER_VALIDATE, 1) |
| 378 | ret = reader.Read() |
| 379 | while ret == 1: |
| 380 | ret = reader.Read() |
| 381 | if ret != 0: |
| 382 | print "Error parsing and validating %s" % (file)</pre> |
| 383 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 384 | <p>This routine will parse and validate the file. Error messages can be |
Daniel Veillard | e59494f | 2003-01-04 16:35:29 +0000 | [diff] [blame] | 385 | captured by registering an error handler. See python/tests/reader2.py for |
| 386 | more complete Python examples. At the C level the equivalent call to cativate |
| 387 | the validation feature is just:</p> |
| 388 | <pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre> |
| 389 | |
| 390 | <p>and a return value of 0 indicates success.</p> |
| 391 | |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 392 | <h2><a name="Entities">Entities substitution</a></h2> |
| 393 | |
Daniel Veillard | ac29793 | 2003-04-17 12:55:35 +0000 | [diff] [blame] | 394 | <p>By default the xmlReader will report entities as such and not replace them |
| 395 | with their content. This default behaviour can however be overriden using:</p> |
Daniel Veillard | 067bae5 | 2003-01-05 01:27:54 +0000 | [diff] [blame] | 396 | |
Daniel Veillard | ac29793 | 2003-04-17 12:55:35 +0000 | [diff] [blame] | 397 | <p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p> |
| 398 | |
| 399 | <h2><a name="L1142">Relax-NG Validation</a></h2> |
| 400 | |
| 401 | <p style="font-size: 10pt">Introduced in version 2.5.7</p> |
| 402 | |
| 403 | <p>Libxml2 can now validate the document being read using the xmlReader using |
| 404 | Relax-NG schemas. While the Relax NG validator can't always work in a |
| 405 | streamable mode, only subsets which cannot be reduced to regular expressions |
| 406 | need to have their subtree expanded for validation. In practice it means |
| 407 | that, unless the schemas for the top level element content is not expressable |
| 408 | as a regexp, only chunk of the document needs to be parsed while |
| 409 | validating.</p> |
| 410 | |
| 411 | <p>The steps to do so are:</p> |
| 412 | <ul> |
| 413 | <li>create a reader working on a document as usual</li> |
| 414 | <li>before any call to read associate it to a Relax NG schemas, either the |
| 415 | preparsed schemas or the URL to the schemas to use</li> |
| 416 | <li>errors will be reported the usual way, and the validity status can be |
| 417 | obtained using the IsValid() interface of the reader like for DTDs.</li> |
| 418 | </ul> |
| 419 | |
| 420 | <p>Example, assuming the reader has already being created and that the schema |
| 421 | string contains the Relax-NG schemas:</p> |
Daniel Veillard | e81765f | 2003-04-17 14:59:27 +0000 | [diff] [blame] | 422 | <pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br> |
Daniel Veillard | ac29793 | 2003-04-17 12:55:35 +0000 | [diff] [blame] | 423 | rngs = rngp.relaxNGParse()<br> |
| 424 | reader.RelaxNGSetSchema(rngs)<br> |
| 425 | ret = reader.Read()<br> |
| 426 | while ret == 1:<br> |
| 427 | ret = reader.Read()<br> |
| 428 | if ret != 0:<br> |
| 429 | print "Error parsing the document"<br> |
| 430 | if reader.IsValid() != 1:<br> |
| 431 | print "Document failed to validate"</code><br> |
Daniel Veillard | e81765f | 2003-04-17 14:59:27 +0000 | [diff] [blame] | 432 | </pre> |
| 433 | |
| 434 | <p>See <code>reader6.py</code> in the sources or documentation for a complete |
Daniel Veillard | ac29793 | 2003-04-17 12:55:35 +0000 | [diff] [blame] | 435 | example.</p> |
| 436 | |
| 437 | <h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2> |
| 438 | |
| 439 | <p style="font-size: 10pt">Introduced in version 2.5.7</p> |
| 440 | |
| 441 | <p>While the reader is a streaming interface, its underlying implementation |
| 442 | is based on the DOM builder of libxml2. As a result it is relatively simple |
| 443 | to mix operations based on both models under some constraints. To do so the |
| 444 | reader has an Expand() operation allowing to grow the subtree under the |
Daniel Veillard | e81765f | 2003-04-17 14:59:27 +0000 | [diff] [blame] | 445 | current node. It returns a pointer to a standard node which can be |
| 446 | manipulated in the usual ways. The node will get all its ancestors and the |
| 447 | full subtree available. Usual operations like XPath queries can be used on |
| 448 | that reduced view of the document. Here is an example extracted from |
| 449 | reader5.py in the sources which extract and prints the bibliography for the |
| 450 | "Dragon" compiler book from the XML 1.0 recommendation:</p> |
Daniel Veillard | ac29793 | 2003-04-17 12:55:35 +0000 | [diff] [blame] | 451 | <pre>f = open('../../test/valid/REC-xml-19980210.xml') |
| 452 | input = libxml2.inputBuffer(f) |
| 453 | reader = input.newTextReader("REC") |
| 454 | res="" |
| 455 | while reader.Read(): |
| 456 | while reader.Name() == 'bibl': |
| 457 | node = reader.Expand() # expand the subtree |
| 458 | if node.xpathEval("@id = 'Aho'"): # use XPath on it |
| 459 | res = res + node.serialize() |
| 460 | if reader.Next() != 1: # skip the subtree |
| 461 | break;</pre> |
| 462 | |
MST 2003 John Fleck | dbf6ae8 | 2003-11-05 04:15:16 +0000 | [diff] [blame] | 463 | <p>Note, however that the node instance returned by the Expand() call is only |
Daniel Veillard | ac29793 | 2003-04-17 12:55:35 +0000 | [diff] [blame] | 464 | valid until the next Read() operation. The Expand() operation does not |
| 465 | affects the Read() ones, however usually once processed the full subtree is |
| 466 | not useful anymore, and the Next() operation allows to skip it completely and |
Daniel Veillard | e81765f | 2003-04-17 14:59:27 +0000 | [diff] [blame] | 467 | process to the successor or return 0 if the document end is reached.</p> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 468 | |
Daniel Veillard | 567a45b | 2005-10-18 19:11:55 +0000 | [diff] [blame^] | 469 | <p><a href="mailto:xml@gnome.org">Daniel Veillard</a></p> |
Daniel Veillard | 66b8289 | 2003-01-04 00:44:13 +0000 | [diff] [blame] | 470 | |
| 471 | <p>$Id$</p> |
| 472 | |
| 473 | <p></p> |
| 474 | </body> |
| 475 | </html> |