Blame - doc/xmlreader.html - fp2-dev/platform/external/libxml2

blob: e818a77c6cb09680ae212da0e0f086fc831f776b [file] [log] [blame]

Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	1	<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
				2	"http://www.w3.org/TR/html4/loose.dtd">
				3	<html>
				4	<head>
				5	<meta http-equiv="Content-Type" content="text/html">
				6	<style type="text/css">
				7	<!--
				8	TD {font-family: Verdana,Arial,Helvetica}
				9	BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
				10	H1 {font-family: Verdana,Arial,Helvetica}
				11	H2 {font-family: Verdana,Arial,Helvetica}
				12	H3 {font-family: Verdana,Arial,Helvetica}
				13	A:link, A:visited, A:active { text-decoration: underline }-->
				14
				15
				16	</style>
				17	<title>XML resources publication guidelines</title>
				18	</head>
				19
				20	<body bgcolor="#fffacd" text="#000000">
				21	<h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
				22
				23	<p></p>
				24
				25	<p>This document describes the use of the XmlTextReader streaming API added
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame^]	26	to libxml2 in version 2.5.0 . This API is closely modeled after the <a
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	27	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
				28	and <a
				29	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
				30	classes of the C# language.</p>
				31
				32	<p>This tutorial will present the key points of this API, and working
				33	examples using both C and the Python bindings:</p>
				34
				35	<p>Table of content:</p>
				36	<ul>
				37	<li><a href="#Introducti">Introduction: why a new API</a></li>
				38	<li><a href="#Walking">Walking a simple tree</a></li>
				39	<li><a href="#Extracting">Extracting informations for the current
				40	node</a></li>
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame^]	41	<li><a href="#Extracting1">Extracting informations for the
				42	attributes</a></li>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	43	<li><a href="#Validating">Validating a document</a></li>
				44	<li><a href="#Entities">Entities substitution</a></li>
				45	</ul>
				46
				47	<p></p>
				48
				49	<h2><a name="Introducti">Introduction: why a new API</a></h2>
				50
				51	<p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
				52	tree based</a>, where the parsing operation results in a document loaded
				53	completely in memory, and expose it as a tree of nodes all availble at the
				54	same time. This is very simple and quite powerful, but has the major
				55	limitation that the size of the document that can be hamdled is limited by
				56	the size of the memory available. Libxml2 also provide a <a
				57	href="http://www.saxproject.org/">SAX</a> based API, but that version was
				58	designed upon one of the early <a
				59	href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
				60	also not formally defined for C. SAX basically work by registering callbacks
				61	which are called directly by the parser as it progresses through the document
				62	streams. The problem is that this programming model is relatively complex,
				63	not well standardized, cannot provide validation directly, makes entity,
				64	namespace and base processing relatively hard.</p>
				65
				66	<p>The <a
				67	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
				68	API from C#</a> provides a far simpler programming model, the API act as a
				69	cursor going forward on the document stream and stopping at each node in the
				70	way. The user code keep the control of the progresses and simply call a
				71	Read() function repeatedly to progress to each node in sequence in document
				72	order. There is direct support for namespaces, xml:base, entity handling and
				73	adding DTD validation on top of it was relatively simple. This API is really
				74	close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
				75	specification</a> This provides a far more standard, easy to use and powerful
				76	API than the existing SAX. Moreover integrating extension feature based on
				77	the tree seems relatively easy.</p>
				78
				79	<p>In a nutshell the XmlTextReader API provides a simpler, more standard and
				80	more extensible interface to handle large document than the existing SAX
				81	version.</p>
				82
				83	<h2><a name="Walking">Walking a simple tree</a></h2>
				84
				85	<p>Basically the XmlTextReader API is a forward only tree walking interface.
				86	The basic steps are:</p>
				87	<ol>
				88	<li>prepare a reader context operating on some input</li>
				89	<li>run a loop iterating over all nodes in the document</li>
				90	<li>free up the reader context</li>
				91	</ol>
				92
				93	<p>Here is a basic C sample doing this:</p>
				94	<pre>#include <libxml/xmlreader.h>
				95
				96	void processNode(xmlTextReaderPtr reader) {
				97	/* handling of a node in the tree */
				98	}
				99
				100	int streamFile(char *filename) {
				101	xmlTextReaderPtr reader;
				102	int ret;
				103
				104	reader = xmlNewTextReaderFilename(filename);
				105	if (reader != NULL) {
				106	ret = xmlTextReaderRead(reader);
				107	while (ret == 1) {
				108	processNode(reader);
				109	ret = xmlTextReaderRead(reader);
				110	}
				111	xmlFreeTextReader(reader);
				112	if (ret != 0) {
				113	printf("%s : failed to parse\n", filename);
				114	}
				115	} else {
				116	printf("Unable to open %s\n", filename);
				117	}
				118	}</pre>
				119
				120	<p>A few things to notice:</p>
				121	<ul>
				122	<li>the include file needed : <code>libxml/xmlreader.h</code></li>
				123	<li>the creation of the reader using a filename</li>
				124	<li>the repeated call to xmlTextReaderRead() and how any return value
				125	different from 1 should stop the loop</li>
				126	<li>that a negative return mean a parsing error</li>
				127	<li>how xmlFreeTextReader() should be used to free up the resources used by
				128	the reader.</li>
				129	</ul>
				130
				131	<p>Here is a similar code in python for exactly the same processing:</p>
				132	<pre>import libxml2
				133
				134	def processNode(reader):
				135	pass
				136
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame^]	137	def streamFile(filename):
				138	try:
				139	reader = libxml2.newTextReaderFilename(filename)
				140	except:
				141	print "unable to open %s" % (filename)
				142	return
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	143
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	144	ret = reader.Read()
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame^]	145	while ret == 1:
				146	processNode(reader)
				147	ret = reader.Read()
				148
				149	if ret != 0:
				150	print "%s : failed to parse" % (filename)
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	151	</pre>
				152
				153	<p>The only things worth adding are that the <a
				154	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
				155	is abstracted as a class like in C#</a> with the same method names (but the
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame^]	156	properties are currently accessed with methods) and that one doesn't need to
				157	free the reader at the end of the processing, it will get garbage collected
				158	once all references have disapeared</p>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	159
				160	<h2><a name="Extracting">Extracting informations for the current node</a></h2>
				161
				162	<p>So far the example code did not indicate how informations were extracted
				163	from the reader, it was abstrated as a call to the processNode() routine,
				164	with the reader as the argument. At each invocation, the parser is stopped on
				165	a given node and the reader can be used to query those node properties. Each
				166	<em>Property</em> is available at the C level as a function taking a single
				167	xmlTextReaderPtr argument whose name is
				168	<code>xmlTextReader</code><em>Property</em> , if the return type is an
				169	<code>xmlChar *</code> string then it must be deallocated with
				170	<code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
				171	<em>Property</em> method to the reader class that can be called on the
				172	instance. The list of the properties is based on the <a
				173	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
				174	XmlTextReader class</a> set of properties and methods:</p>
				175	<ul>
				176	<li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
				177	element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
				178	entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
				179	9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
				180	fragment and 12 for notation nodes.</li>
				181	<li><em>Name</em>: the <a
				182	href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
				183	name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
				184	<li><em>LocalName</em>: the <a
				185	href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
				186	the node.</li>
				187	<li><em>Prefix</em>: a shorthand reference to the <a
				188	href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
				189	the node.</li>
				190	<li><em>NamespaceUri</em>: the URI defining the <a
				191	href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
				192	the node.</li>
				193	<li><em>BaseUri:</em> the base URI of the node. See the <a
				194	href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
				195	<li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
				196	root node.</li>
				197	<li><em>HasAttributes</em>: whether the node has attributes.</li>
				198	<li><em>HasValue</em>: whether the node can have a text value.</li>
				199	<li><em>Value</em>: provides the text value of the node if present.</li>
				200	<li><em>IsDefault</em>: whether an Attribute node was generated from the
				201	default value defined in the DTD or schema (<em>unsupported
				202	yet</em>).</li>
				203	<li><em>XmlLang</em>: the <a
				204	href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
				205	within which the node resides.</li>
				206	<li><em>IsEmptyElement</em>: check if the current node is empty, this is a
				207	bit bizarre in the sense that <code><a/></code> will be considered
				208	empty while <code><a></a></code> will not.</li>
				209	<li><em>AttributeCount</em>: provides the number of attributes of the
				210	current node.</li>
				211	</ul>
				212
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame^]	213	<p>Let's look first at a small example to get this in practice by redefining
				214	the processNode() function in the Python example:</p>
				215	<pre>def processNode(reader):
				216	print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
				217	reader.Name(), reader.IsEmptyElement())</pre>
				218
				219	<p>and look at the result of calling streamFile("tst.xml") for various
				220	content of the XML test file.</p>
				221
				222	<p>For the minimal document "<code><doc/></code>" we get:</p>
				223	<pre>0 1 doc 1</pre>
				224
				225	<p>Only one node is found, its depth is 0, type 1 indocate an element start,
				226	of name "doc" and it is empty. Trying now with
				227	"<code><doc></doc></code>" instead leads to:</p>
				228	<pre>0 1 doc 0
				229	0 15 doc 0</pre>
				230
				231	<p>The document root node is not flagged as empty anymore and both a start
				232	and an end of element are detected. The following document shows how
				233	character data are reported:</p>
				234	<pre><doc><a/><b>some text</b>
				235	<c/></doc></pre>
				236
				237	<p>We modifying the processNode() function to also report the node Value:</p>
				238	<pre>def processNode(reader):
				239	print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
				240	reader.Name(), reader.IsEmptyElement(),
				241	reader.Value())</pre>
				242
				243	<p>The result of the test is:</p>
				244	<pre>0 1 doc 0 None
				245	1 1 a 1 None
				246	1 1 b 0 None
				247	2 3 #text 0 some text
				248	1 15 b 0 None
				249	1 3 #text 0
				250
				251	1 1 c 1 None
				252	0 15 doc 0 None</pre>
				253
				254	<p>There is a few things to note:</p>
				255	<ul>
				256	<li>the increase of the depth value (first row) as children nodes are
				257	explored</li>
				258	<li>the text node child of the b element, of type 3 and its content</li>
				259	<li>the text node containing the line return between elements b and c</li>
				260	<li>that elements have the Value None (or NULL in C)</li>
				261	</ul>
				262
				263	<p>The equivalent routine for <code>processNode()</code> as used by
				264	<code>xmllint --stream --debug</code> is the following and can be found in
				265	the xmllint.c module in the source distribution:</p>
				266	<pre>static void processNode(xmlTextReaderPtr reader) {
				267	xmlChar name, value;
				268
				269	name = xmlTextReaderName(reader);
				270	if (name == NULL)
				271	name = xmlStrdup(BAD_CAST "--");
				272	value = xmlTextReaderValue(reader);
				273
				274	printf("%d %d %s %d",
				275	xmlTextReaderDepth(reader),
				276	xmlTextReaderNodeType(reader),
				277	name,
				278	xmlTextReaderIsEmptyElement(reader));
				279	xmlFree(name);
				280	if (value == NULL)
				281	printf("\n");
				282	else {
				283	printf(" %s\n", value);
				284	xmlFree(value);
				285	}
				286	}</pre>
				287
				288	<h2><a name="Extracting1">Extracting informations for the attributes</a></h2>
				289
				290	<p>The previous examples don't indicate how attributes are processed. The
				291	simple test "<code><doc a="b"/></code>" provides the following
				292	result:</p>
				293	<pre>0 1 doc 1 None</pre>
				294
				295	<p>This prove that attributes nodes are not traversed by default. The
				296	<em>HasAttributes</em> property allow to detect their presence. To check
				297	their content the API has special instructions basically 2 kind of operations
				298	are possible:</p>
				299	<ol>
				300	<li>to move the reader to the attribute nodes of the current element, in
				301	that case the cursor is positionned on the attribute node</li>
				302	<li>to directly query the element node for the attribute value</li>
				303	</ol>
				304
				305	<p>In both case the attribute can be designed either by its position in the
				306	list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
				307	by their name (and namespace):</p>
				308	<ul>
				309	<li><em>GetAttributeNo</em>(no): provides the value of the attribute with
				310	the specified index no relative to the containing element.</li>
				311	<li><em>GetAttribute</em>(name): provides the value of the attribute with
				312	the specified qualified name.</li>
				313	<li>GetAttributeNs(localName, namespaceURI): provides the value of the
				314	attribute with the specified local name and namespace URI.</li>
				315	<li><em>MoveToAttributeNo</em>(no): moves the position of the current
				316	instance to the attribute with the specified index relative to the
				317	containing element.</li>
				318	<li><em>MoveToAttribute</em>(name): moves the position of the current
				319	instance to the attribute with the specified qualified name.</li>
				320	<li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
				321	of the current instance to the attribute with the specified local name
				322	and namespace URI.</li>
				323	<li><em>MoveToFirstAttribute</em>: moves the position of the current
				324	instance to the first attribute associated with the current node.</li>
				325	<li><em>MoveToNextAttribute</em>: moves the position of the current
				326	instance to the next attribute associated with the current node.</li>
				327	<li><em>MoveToElement</em>: moves the position of the current instance to
				328	the node that contains the current Attribute node.</li>
				329	</ul>
				330
				331	<p>After modifying the processNode() function to show attributes:</p>
				332	<pre>def processNode(reader):
				333	print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
				334	reader.Name(), reader.IsEmptyElement(),
				335	reader.Value())
				336	if reader.NodeType() == 1: # Element
				337	while reader.MoveToNextAttribute():
				338	print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
				339	reader.Name(),reader.Value())</pre>
				340
				341	<p>the output for the same input document reflects the attribute:</p>
				342	<pre>0 1 doc 1 None
				343	-- 1 2 (a) [b]</pre>
				344
				345	<p>There is a couple of things to note on the attribute processing:</p>
				346	<ul>
				347	<li>their depth is the one of the carrying element plus one</li>
				348	<li>namespace declarations are seen as attributes like in DOM</li>
				349	</ul>
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	350
				351	<h2><a name="Validating">Validating a document</a></h2>
				352
Daniel Veillard	e59494f	2003-01-04 16:35:29 +0000	[diff] [blame^]	353	<p>Libxml2 implementation adds some extra feature on top of the XmlTextReader
				354	API, the main one is the ability to DTD validate the parsed document
				355	progressively. This is simply the activation of the associated feature of the
				356	parser used by the reader structure. There are a few options available
				357	defined as the enum xmlParserProperties in the libxml/xmlreader.h header
				358	file:</p>
				359	<ul>
				360	<li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
				361	<li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
				362	loading the DTD)</li>
				363	<li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
				364	the DTD)</li>
				365	<li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
				366	reference nodes are not generated and are replaced by their expanded
				367	content.</li>
				368	<li>more settings might be added, those were the one available at the 2.5.0
				369	release...</li>
				370	</ul>
				371
				372	<p>The GetParserProp() and SetParserProp() methods can then be used to get
				373	and set the values of those parser properties of the reader. For example</p>
				374	<pre>def parseAndValidate(file):
				375	reader = libxml2.newTextReaderFilename(file)
				376	reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
				377	ret = reader.Read()
				378	while ret == 1:
				379	ret = reader.Read()
				380	if ret != 0:
				381	print "Error parsing and validating %s" % (file)</pre>
				382
				383	<p>This routine will parse and validate the file. Errors message can be
				384	captured by registering an error handler. See python/tests/reader2.py for
				385	more complete Python examples. At the C level the equivalent call to cativate
				386	the validation feature is just:</p>
				387	<pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
				388
				389	<p>and a return value of 0 indicates success.</p>
				390
Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame]	391	<h2><a name="Entities">Entities substitution</a></h2>
				392
				393	<p> </p>
				394
				395	<p><a href="mailto:veillard@redhat.com">Daniel Veillard</a></p>
				396
				397	<p>$Id$</p>
				398
				399	<p></p>
				400	</body>
				401	</html>