Blame - doc/xmlreader.html - platform/external/libxml2

blob: 38608c128711640a897de0c2017beee563680035 [file] [log] [blame]

Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	1	<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
				2	"http://www.w3.org/TR/html4/loose.dtd">
				3	<html>
				4	<head>
				5	<meta http-equiv="Content-Type" content="text/html">
William M. Brack	008c06b	2003-09-01 22:17:39 +0000	[diff] [blame]	6	<style type="text/css"></style>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	7	<!--
				8	TD {font-family: Verdana,Arial,Helvetica}
				9	BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
				10	H1 {font-family: Verdana,Arial,Helvetica}
				11	H2 {font-family: Verdana,Arial,Helvetica}
				12	H3 {font-family: Verdana,Arial,Helvetica}
William M. Brack	008c06b	2003-09-01 22:17:39 +0000	[diff] [blame]	13	A:link, A:visited, A:active { text-decoration: underline }
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	14	</style>
William M. Brack	008c06b	2003-09-01 22:17:39 +0000	[diff] [blame]	15	-->
Daniel Veillard	a55b27b	2003-01-06 22:20:21 +0000	[diff] [blame]	16	<title>Libxml2 XmlTextReader Interface tutorial</title>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	17	</head>
				18
				19	<body bgcolor="#fffacd" text="#000000">
				20	<h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
				21
				22	<p></p>
				23
				24	<p>This document describes the use of the XmlTextReader streaming API added
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	25	to libxml2 in version 2.5.0 . This API is closely modeled after the <a
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	26	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
				27	and <a
				28	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
				29	classes of the C# language.</p>
				30
				31	<p>This tutorial will present the key points of this API, and working
				32	examples using both C and the Python bindings:</p>
				33
				34	<p>Table of content:</p>
				35	<ul>
				36	<li><a href="#Introducti">Introduction: why a new API</a></li>
				37	<li><a href="#Walking">Walking a simple tree</a></li>
				38	<li><a href="#Extracting">Extracting informations for the current
				39	node</a></li>
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	40	<li><a href="#Extracting1">Extracting informations for the
				41	attributes</a></li>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	42	<li><a href="#Validating">Validating a document</a></li>
				43	<li><a href="#Entities">Entities substitution</a></li>
Daniel Veillard	ac29793	2003-04-17 12:55:35 +0000	[diff] [blame]	44	<li><a href="#L1142">Relax-NG Validation</a></li>
				45	<li><a href="#Mixing">Mixing the reader and tree or XPath
				46	operations</a></li>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	47	</ul>
				48
				49	<p></p>
				50
				51	<h2><a name="Introducti">Introduction: why a new API</a></h2>
				52
				53	<p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
				54	tree based</a>, where the parsing operation results in a document loaded
				55	completely in memory, and expose it as a tree of nodes all availble at the
				56	same time. This is very simple and quite powerful, but has the major
				57	limitation that the size of the document that can be hamdled is limited by
				58	the size of the memory available. Libxml2 also provide a <a
				59	href="http://www.saxproject.org/">SAX</a> based API, but that version was
				60	designed upon one of the early <a
				61	href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
				62	also not formally defined for C. SAX basically work by registering callbacks
				63	which are called directly by the parser as it progresses through the document
				64	streams. The problem is that this programming model is relatively complex,
				65	not well standardized, cannot provide validation directly, makes entity,
				66	namespace and base processing relatively hard.</p>
				67
				68	<p>The <a
				69	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	70	API from C#</a> provides a far simpler programming model. The API acts as a
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	71	cursor going forward on the document stream and stopping at each node in the
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	72	way. The user's code keeps control of the progress and simply calls a
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	73	Read() function repeatedly to progress to each node in sequence in document
				74	order. There is direct support for namespaces, xml:base, entity handling and
				75	adding DTD validation on top of it was relatively simple. This API is really
				76	close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
				77	specification</a> This provides a far more standard, easy to use and powerful
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	78	API than the existing SAX. Moreover integrating extension features based on
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	79	the tree seems relatively easy.</p>
				80
				81	<p>In a nutshell the XmlTextReader API provides a simpler, more standard and
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	82	more extensible interface to handle large documents than the existing SAX
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	83	version.</p>
				84
				85	<h2><a name="Walking">Walking a simple tree</a></h2>
				86
				87	<p>Basically the XmlTextReader API is a forward only tree walking interface.
				88	The basic steps are:</p>
				89	<ol>
				90	<li>prepare a reader context operating on some input</li>
				91	<li>run a loop iterating over all nodes in the document</li>
				92	<li>free up the reader context</li>
				93	</ol>
				94
				95	<p>Here is a basic C sample doing this:</p>
				96	<pre>#include <libxml/xmlreader.h>
				97
				98	void processNode(xmlTextReaderPtr reader) {
				99	/* handling of a node in the tree */
				100	}
				101
				102	int streamFile(char *filename) {
				103	xmlTextReaderPtr reader;
				104	int ret;
				105
				106	reader = xmlNewTextReaderFilename(filename);
				107	if (reader != NULL) {
				108	ret = xmlTextReaderRead(reader);
				109	while (ret == 1) {
				110	processNode(reader);
				111	ret = xmlTextReaderRead(reader);
				112	}
				113	xmlFreeTextReader(reader);
				114	if (ret != 0) {
				115	printf("%s : failed to parse\n", filename);
				116	}
				117	} else {
				118	printf("Unable to open %s\n", filename);
				119	}
				120	}</pre>
				121
				122	<p>A few things to notice:</p>
				123	<ul>
				124	<li>the include file needed : <code>libxml/xmlreader.h</code></li>
				125	<li>the creation of the reader using a filename</li>
				126	<li>the repeated call to xmlTextReaderRead() and how any return value
				127	different from 1 should stop the loop</li>
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	128	<li>that a negative return means a parsing error</li>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	129	<li>how xmlFreeTextReader() should be used to free up the resources used by
				130	the reader.</li>
				131	</ul>
				132
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	133	<p>Here is similar code in python for exactly the same processing:</p>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	134	<pre>import libxml2
				135
				136	def processNode(reader):
				137	pass
				138
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	139	def streamFile(filename):
				140	try:
				141	reader = libxml2.newTextReaderFilename(filename)
				142	except:
				143	print "unable to open %s" % (filename)
				144	return
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	145
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	146	ret = reader.Read()
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	147	while ret == 1:
				148	processNode(reader)
				149	ret = reader.Read()
				150
				151	if ret != 0:
Daniel Veillard	ac29793	2003-04-17 12:55:35 +0000	[diff] [blame]	152	print "%s : failed to parse" % (filename)</pre>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	153
				154	<p>The only things worth adding are that the <a
				155	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
				156	is abstracted as a class like in C#</a> with the same method names (but the
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	157	properties are currently accessed with methods) and that one doesn't need to
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	158	free the reader at the end of the processing. It will get garbage collected
				159	once all references have disapeared.</p>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	160
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	161	<h2><a name="Extracting">Extracting information for the current node</a></h2>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	162
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	163	<p>So far the example code did not indicate how information was extracted
				164	from the reader. It was abstrated as a call to the processNode() routine,
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	165	with the reader as the argument. At each invocation, the parser is stopped on
				166	a given node and the reader can be used to query those node properties. Each
				167	<em>Property</em> is available at the C level as a function taking a single
				168	xmlTextReaderPtr argument whose name is
				169	<code>xmlTextReader</code><em>Property</em> , if the return type is an
				170	<code>xmlChar *</code> string then it must be deallocated with
				171	<code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
				172	<em>Property</em> method to the reader class that can be called on the
				173	instance. The list of the properties is based on the <a
				174	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
				175	XmlTextReader class</a> set of properties and methods:</p>
				176	<ul>
				177	<li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
				178	element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
				179	entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
				180	9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
				181	fragment and 12 for notation nodes.</li>
				182	<li><em>Name</em>: the <a
				183	href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
				184	name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
				185	<li><em>LocalName</em>: the <a
				186	href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
				187	the node.</li>
				188	<li><em>Prefix</em>: a shorthand reference to the <a
				189	href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
				190	the node.</li>
				191	<li><em>NamespaceUri</em>: the URI defining the <a
				192	href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
				193	the node.</li>
				194	<li><em>BaseUri:</em> the base URI of the node. See the <a
				195	href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
				196	<li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
				197	root node.</li>
				198	<li><em>HasAttributes</em>: whether the node has attributes.</li>
				199	<li><em>HasValue</em>: whether the node can have a text value.</li>
				200	<li><em>Value</em>: provides the text value of the node if present.</li>
				201	<li><em>IsDefault</em>: whether an Attribute node was generated from the
				202	default value defined in the DTD or schema (<em>unsupported
				203	yet</em>).</li>
				204	<li><em>XmlLang</em>: the <a
				205	href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
				206	within which the node resides.</li>
				207	<li><em>IsEmptyElement</em>: check if the current node is empty, this is a
				208	bit bizarre in the sense that <code><a/></code> will be considered
				209	empty while <code><a></a></code> will not.</li>
				210	<li><em>AttributeCount</em>: provides the number of attributes of the
				211	current node.</li>
				212	</ul>
				213
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	214	<p>Let's look first at a small example to get this in practice by redefining
				215	the processNode() function in the Python example:</p>
				216	<pre>def processNode(reader):
				217	print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
				218	reader.Name(), reader.IsEmptyElement())</pre>
				219
				220	<p>and look at the result of calling streamFile("tst.xml") for various
				221	content of the XML test file.</p>
				222
				223	<p>For the minimal document "<code><doc/></code>" we get:</p>
				224	<pre>0 1 doc 1</pre>
				225
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	226	<p>Only one node is found, its depth is 0, type 1 indicate an element start,
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	227	of name "doc" and it is empty. Trying now with
				228	"<code><doc></doc></code>" instead leads to:</p>
				229	<pre>0 1 doc 0
				230	0 15 doc 0</pre>
				231
				232	<p>The document root node is not flagged as empty anymore and both a start
				233	and an end of element are detected. The following document shows how
				234	character data are reported:</p>
				235	<pre><doc><a/><b>some text</b>
				236	<c/></doc></pre>
				237
				238	<p>We modifying the processNode() function to also report the node Value:</p>
				239	<pre>def processNode(reader):
				240	print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
				241	reader.Name(), reader.IsEmptyElement(),
				242	reader.Value())</pre>
				243
				244	<p>The result of the test is:</p>
				245	<pre>0 1 doc 0 None
				246	1 1 a 1 None
				247	1 1 b 0 None
				248	2 3 #text 0 some text
				249	1 15 b 0 None
				250	1 3 #text 0
				251
				252	1 1 c 1 None
				253	0 15 doc 0 None</pre>
				254
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	255	<p>There are a few things to note:</p>
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	256	<ul>
				257	<li>the increase of the depth value (first row) as children nodes are
				258	explored</li>
				259	<li>the text node child of the b element, of type 3 and its content</li>
				260	<li>the text node containing the line return between elements b and c</li>
				261	<li>that elements have the Value None (or NULL in C)</li>
				262	</ul>
				263
				264	<p>The equivalent routine for <code>processNode()</code> as used by
				265	<code>xmllint --stream --debug</code> is the following and can be found in
				266	the xmllint.c module in the source distribution:</p>
				267	<pre>static void processNode(xmlTextReaderPtr reader) {
				268	xmlChar name, value;
				269
				270	name = xmlTextReaderName(reader);
				271	if (name == NULL)
				272	name = xmlStrdup(BAD_CAST "--");
				273	value = xmlTextReaderValue(reader);
				274
				275	printf("%d %d %s %d",
				276	xmlTextReaderDepth(reader),
				277	xmlTextReaderNodeType(reader),
				278	name,
				279	xmlTextReaderIsEmptyElement(reader));
				280	xmlFree(name);
				281	if (value == NULL)
				282	printf("\n");
				283	else {
				284	printf(" %s\n", value);
				285	xmlFree(value);
				286	}
				287	}</pre>
				288
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	289	<h2><a name="Extracting1">Extracting information for the attributes</a></h2>
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	290
				291	<p>The previous examples don't indicate how attributes are processed. The
				292	simple test "<code><doc a="b"/></code>" provides the following
				293	result:</p>
				294	<pre>0 1 doc 1 None</pre>
				295
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	296	<p>This proves that attribute nodes are not traversed by default. The
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	297	<em>HasAttributes</em> property allow to detect their presence. To check
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	298	their content the API has special instructions. Basically two kinds of operations
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	299	are possible:</p>
				300	<ol>
				301	<li>to move the reader to the attribute nodes of the current element, in
				302	that case the cursor is positionned on the attribute node</li>
				303	<li>to directly query the element node for the attribute value</li>
				304	</ol>
				305
				306	<p>In both case the attribute can be designed either by its position in the
				307	list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
				308	by their name (and namespace):</p>
				309	<ul>
				310	<li><em>GetAttributeNo</em>(no): provides the value of the attribute with
				311	the specified index no relative to the containing element.</li>
				312	<li><em>GetAttribute</em>(name): provides the value of the attribute with
				313	the specified qualified name.</li>
				314	<li>GetAttributeNs(localName, namespaceURI): provides the value of the
				315	attribute with the specified local name and namespace URI.</li>
				316	<li><em>MoveToAttributeNo</em>(no): moves the position of the current
				317	instance to the attribute with the specified index relative to the
				318	containing element.</li>
				319	<li><em>MoveToAttribute</em>(name): moves the position of the current
				320	instance to the attribute with the specified qualified name.</li>
				321	<li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
				322	of the current instance to the attribute with the specified local name
				323	and namespace URI.</li>
				324	<li><em>MoveToFirstAttribute</em>: moves the position of the current
				325	instance to the first attribute associated with the current node.</li>
				326	<li><em>MoveToNextAttribute</em>: moves the position of the current
				327	instance to the next attribute associated with the current node.</li>
				328	<li><em>MoveToElement</em>: moves the position of the current instance to
				329	the node that contains the current Attribute node.</li>
				330	</ul>
				331
				332	<p>After modifying the processNode() function to show attributes:</p>
				333	<pre>def processNode(reader):
				334	print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
				335	reader.Name(), reader.IsEmptyElement(),
				336	reader.Value())
				337	if reader.NodeType() == 1: # Element
				338	while reader.MoveToNextAttribute():
				339	print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
				340	reader.Name(),reader.Value())</pre>
				341
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	342	<p>The output for the same input document reflects the attribute:</p>
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	343	<pre>0 1 doc 1 None
				344	-- 1 2 (a) [b]</pre>
				345
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	346	<p>There are a couple of things to note on the attribute processing:</p>
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	347	<ul>
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	348	<li>Their depth is the one of the carrying element plus one.</li>
				349	<li>Namespace declarations are seen as attributes, as in DOM.</li>
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	350	</ul>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	351
				352	<h2><a name="Validating">Validating a document</a></h2>
				353
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	354	<p>Libxml2 implementation adds some extra features on top of the XmlTextReader
				355	API. The main one is the ability to DTD validate the parsed document
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	356	progressively. This is simply the activation of the associated feature of the
				357	parser used by the reader structure. There are a few options available
				358	defined as the enum xmlParserProperties in the libxml/xmlreader.h header
				359	file:</p>
				360	<ul>
				361	<li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
				362	<li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
				363	loading the DTD)</li>
				364	<li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
				365	the DTD)</li>
				366	<li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
				367	reference nodes are not generated and are replaced by their expanded
				368	content.</li>
				369	<li>more settings might be added, those were the one available at the 2.5.0
				370	release...</li>
				371	</ul>
				372
				373	<p>The GetParserProp() and SetParserProp() methods can then be used to get
				374	and set the values of those parser properties of the reader. For example</p>
				375	<pre>def parseAndValidate(file):
				376	reader = libxml2.newTextReaderFilename(file)
				377	reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
				378	ret = reader.Read()
				379	while ret == 1:
				380	ret = reader.Read()
				381	if ret != 0:
				382	print "Error parsing and validating %s" % (file)</pre>
				383
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	384	<p>This routine will parse and validate the file. Error messages can be
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame]	385	captured by registering an error handler. See python/tests/reader2.py for
				386	more complete Python examples. At the C level the equivalent call to cativate
				387	the validation feature is just:</p>
				388	<pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
				389
				390	<p>and a return value of 0 indicates success.</p>
				391
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	392	<h2><a name="Entities">Entities substitution</a></h2>
				393
Daniel Veillard	ac29793	2003-04-17 12:55:35 +0000	[diff] [blame]	394	<p>By default the xmlReader will report entities as such and not replace them
				395	with their content. This default behaviour can however be overriden using:</p>
Daniel Veillard	067bae5	2003-01-05 01:27:54 +0000	[diff] [blame]	396
Daniel Veillard	ac29793	2003-04-17 12:55:35 +0000	[diff] [blame]	397	<p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>
				398
				399	<h2><a name="L1142">Relax-NG Validation</a></h2>
				400
				401	<p style="font-size: 10pt">Introduced in version 2.5.7</p>
				402
				403	<p>Libxml2 can now validate the document being read using the xmlReader using
				404	Relax-NG schemas. While the Relax NG validator can't always work in a
				405	streamable mode, only subsets which cannot be reduced to regular expressions
				406	need to have their subtree expanded for validation. In practice it means
				407	that, unless the schemas for the top level element content is not expressable
				408	as a regexp, only chunk of the document needs to be parsed while
				409	validating.</p>
				410
				411	<p>The steps to do so are:</p>
				412	<ul>
				413	<li>create a reader working on a document as usual</li>
				414	<li>before any call to read associate it to a Relax NG schemas, either the
				415	preparsed schemas or the URL to the schemas to use</li>
				416	<li>errors will be reported the usual way, and the validity status can be
				417	obtained using the IsValid() interface of the reader like for DTDs.</li>
				418	</ul>
				419
				420	<p>Example, assuming the reader has already being created and that the schema
				421	string contains the Relax-NG schemas:</p>
Daniel Veillard	e81765f	2003-04-17 14:59:27 +0000	[diff] [blame]	422	<pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
Daniel Veillard	ac29793	2003-04-17 12:55:35 +0000	[diff] [blame]	423	rngs = rngp.relaxNGParse()<br>
				424	reader.RelaxNGSetSchema(rngs)<br>
				425	ret = reader.Read()<br>
				426	while ret == 1:<br>
				427	ret = reader.Read()<br>
				428	if ret != 0:<br>
				429	print "Error parsing the document"<br>
				430	if reader.IsValid() != 1:<br>
				431	print "Document failed to validate"</code><br>
Daniel Veillard	e81765f	2003-04-17 14:59:27 +0000	[diff] [blame]	432	</pre>
				433
				434	<p>See <code>reader6.py</code> in the sources or documentation for a complete
Daniel Veillard	ac29793	2003-04-17 12:55:35 +0000	[diff] [blame]	435	example.</p>
				436
				437	<h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>
				438
				439	<p style="font-size: 10pt">Introduced in version 2.5.7</p>
				440
				441	<p>While the reader is a streaming interface, its underlying implementation
				442	is based on the DOM builder of libxml2. As a result it is relatively simple
				443	to mix operations based on both models under some constraints. To do so the
				444	reader has an Expand() operation allowing to grow the subtree under the
Daniel Veillard	e81765f	2003-04-17 14:59:27 +0000	[diff] [blame]	445	current node. It returns a pointer to a standard node which can be
				446	manipulated in the usual ways. The node will get all its ancestors and the
				447	full subtree available. Usual operations like XPath queries can be used on
				448	that reduced view of the document. Here is an example extracted from
				449	reader5.py in the sources which extract and prints the bibliography for the
				450	"Dragon" compiler book from the XML 1.0 recommendation:</p>
Daniel Veillard	ac29793	2003-04-17 12:55:35 +0000	[diff] [blame]	451	<pre>f = open('../../test/valid/REC-xml-19980210.xml')
				452	input = libxml2.inputBuffer(f)
				453	reader = input.newTextReader("REC")
				454	res=""
				455	while reader.Read():
				456	while reader.Name() == 'bibl':
				457	node = reader.Expand() # expand the subtree
				458	if node.xpathEval("@id = 'Aho'"): # use XPath on it
				459	res = res + node.serialize()
				460	if reader.Next() != 1: # skip the subtree
				461	break;</pre>
				462
MST 2003 John Fleck	dbf6ae8	2003-11-05 04:15:16 +0000	[diff] [blame]	463	<p>Note, however that the node instance returned by the Expand() call is only
Daniel Veillard	ac29793	2003-04-17 12:55:35 +0000	[diff] [blame]	464	valid until the next Read() operation. The Expand() operation does not
				465	affects the Read() ones, however usually once processed the full subtree is
				466	not useful anymore, and the Next() operation allows to skip it completely and
Daniel Veillard	e81765f	2003-04-17 14:59:27 +0000	[diff] [blame]	467	process to the successor or return 0 if the document end is reached.</p>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	468
Daniel Veillard	567a45b	2005-10-18 19:11:55 +0000	[diff] [blame]	469	<p><a href="mailto:xml@gnome.org">Daniel Veillard</a></p>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	470
				471	<p>$Id$</p>
				472
				473	<p></p>
				474	</body>
				475	</html>