Blame - doc/xmlreader.html - fp2-dev/platform/external/libxml2

blob: d776ec0d6c6110bda11b4d4eb9ad1a25d266b942 [file] [log] [blame]

Daniel Veillard	66b8289	2003-01-04 00:44:13 +0000	[diff] [blame^]	1	<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
				2	"http://www.w3.org/TR/html4/loose.dtd">
				3	<html>
				4	<head>
				5	<meta http-equiv="Content-Type" content="text/html">
				6	<style type="text/css">
				7	<!--
				8	TD {font-family: Verdana,Arial,Helvetica}
				9	BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
				10	H1 {font-family: Verdana,Arial,Helvetica}
				11	H2 {font-family: Verdana,Arial,Helvetica}
				12	H3 {font-family: Verdana,Arial,Helvetica}
				13	A:link, A:visited, A:active { text-decoration: underline }-->
				14
				15
				16	</style>
				17	<title>XML resources publication guidelines</title>
				18	</head>
				19
				20	<body bgcolor="#fffacd" text="#000000">
				21	<h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
				22
				23	<p></p>
				24
				25	<p>This document describes the use of the XmlTextReader streaming API added
				26	to libxml2 in version 2.5.0 . This API is closely modelled on the <a
				27	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
				28	and <a
				29	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
				30	classes of the C# language.</p>
				31
				32	<p>This tutorial will present the key points of this API, and working
				33	examples using both C and the Python bindings:</p>
				34
				35	<p>Table of content:</p>
				36	<ul>
				37	<li><a href="#Introducti">Introduction: why a new API</a></li>
				38	<li><a href="#Walking">Walking a simple tree</a></li>
				39	<li><a href="#Extracting">Extracting informations for the current
				40	node</a></li>
				41	<li><a href="#Validating">Validating a document</a></li>
				42	<li><a href="#Entities">Entities substitution</a></li>
				43	</ul>
				44
				45	<p></p>
				46
				47	<h2><a name="Introducti">Introduction: why a new API</a></h2>
				48
				49	<p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
				50	tree based</a>, where the parsing operation results in a document loaded
				51	completely in memory, and expose it as a tree of nodes all availble at the
				52	same time. This is very simple and quite powerful, but has the major
				53	limitation that the size of the document that can be hamdled is limited by
				54	the size of the memory available. Libxml2 also provide a <a
				55	href="http://www.saxproject.org/">SAX</a> based API, but that version was
				56	designed upon one of the early <a
				57	href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
				58	also not formally defined for C. SAX basically work by registering callbacks
				59	which are called directly by the parser as it progresses through the document
				60	streams. The problem is that this programming model is relatively complex,
				61	not well standardized, cannot provide validation directly, makes entity,
				62	namespace and base processing relatively hard.</p>
				63
				64	<p>The <a
				65	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
				66	API from C#</a> provides a far simpler programming model, the API act as a
				67	cursor going forward on the document stream and stopping at each node in the
				68	way. The user code keep the control of the progresses and simply call a
				69	Read() function repeatedly to progress to each node in sequence in document
				70	order. There is direct support for namespaces, xml:base, entity handling and
				71	adding DTD validation on top of it was relatively simple. This API is really
				72	close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
				73	specification</a> This provides a far more standard, easy to use and powerful
				74	API than the existing SAX. Moreover integrating extension feature based on
				75	the tree seems relatively easy.</p>
				76
				77	<p>In a nutshell the XmlTextReader API provides a simpler, more standard and
				78	more extensible interface to handle large document than the existing SAX
				79	version.</p>
				80
				81	<h2><a name="Walking">Walking a simple tree</a></h2>
				82
				83	<p>Basically the XmlTextReader API is a forward only tree walking interface.
				84	The basic steps are:</p>
				85	<ol>
				86	<li>prepare a reader context operating on some input</li>
				87	<li>run a loop iterating over all nodes in the document</li>
				88	<li>free up the reader context</li>
				89	</ol>
				90
				91	<p>Here is a basic C sample doing this:</p>
				92	<pre>#include <libxml/xmlreader.h>
				93
				94	void processNode(xmlTextReaderPtr reader) {
				95	/* handling of a node in the tree */
				96	}
				97
				98	int streamFile(char *filename) {
				99	xmlTextReaderPtr reader;
				100	int ret;
				101
				102	reader = xmlNewTextReaderFilename(filename);
				103	if (reader != NULL) {
				104	ret = xmlTextReaderRead(reader);
				105	while (ret == 1) {
				106	processNode(reader);
				107	ret = xmlTextReaderRead(reader);
				108	}
				109	xmlFreeTextReader(reader);
				110	if (ret != 0) {
				111	printf("%s : failed to parse\n", filename);
				112	}
				113	} else {
				114	printf("Unable to open %s\n", filename);
				115	}
				116	}</pre>
				117
				118	<p>A few things to notice:</p>
				119	<ul>
				120	<li>the include file needed : <code>libxml/xmlreader.h</code></li>
				121	<li>the creation of the reader using a filename</li>
				122	<li>the repeated call to xmlTextReaderRead() and how any return value
				123	different from 1 should stop the loop</li>
				124	<li>that a negative return mean a parsing error</li>
				125	<li>how xmlFreeTextReader() should be used to free up the resources used by
				126	the reader.</li>
				127	</ul>
				128
				129	<p>Here is a similar code in python for exactly the same processing:</p>
				130	<pre>import libxml2
				131
				132	def processNode(reader):
				133	pass
				134
				135	try:
				136	reader = newTextReaderFilename(filename)
				137	except:
				138	print "unable to open %s" % (filename)
				139
				140
				141	ret = reader.Read()
				142	while ret == 1:
				143	processNode(reader)
				144	ret = reader.Read()
				145	if ret != 0:
				146	print "%s : failed to parse" % (filename)
				147	</pre>
				148
				149	<p>The only things worth adding are that the <a
				150	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
				151	is abstracted as a class like in C#</a> with the same method names (but the
				152	properties are currently accessed with methods) and to note one doesn't need
				153	to free the reader at the end of the processing, it will get garbage
				154	collected once all references have disapeared</p>
				155
				156	<h2><a name="Extracting">Extracting informations for the current node</a></h2>
				157
				158	<p>So far the example code did not indicate how informations were extracted
				159	from the reader, it was abstrated as a call to the processNode() routine,
				160	with the reader as the argument. At each invocation, the parser is stopped on
				161	a given node and the reader can be used to query those node properties. Each
				162	<em>Property</em> is available at the C level as a function taking a single
				163	xmlTextReaderPtr argument whose name is
				164	<code>xmlTextReader</code><em>Property</em> , if the return type is an
				165	<code>xmlChar *</code> string then it must be deallocated with
				166	<code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
				167	<em>Property</em> method to the reader class that can be called on the
				168	instance. The list of the properties is based on the <a
				169	href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
				170	XmlTextReader class</a> set of properties and methods:</p>
				171	<ul>
				172	<li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
				173	element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
				174	entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
				175	9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
				176	fragment and 12 for notation nodes.</li>
				177	<li><em>Name</em>: the <a
				178	href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
				179	name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
				180	<li><em>LocalName</em>: the <a
				181	href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
				182	the node.</li>
				183	<li><em>Prefix</em>: a shorthand reference to the <a
				184	href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
				185	the node.</li>
				186	<li><em>NamespaceUri</em>: the URI defining the <a
				187	href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
				188	the node.</li>
				189	<li><em>BaseUri:</em> the base URI of the node. See the <a
				190	href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
				191	<li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
				192	root node.</li>
				193	<li><em>HasAttributes</em>: whether the node has attributes.</li>
				194	<li><em>HasValue</em>: whether the node can have a text value.</li>
				195	<li><em>Value</em>: provides the text value of the node if present.</li>
				196	<li><em>IsDefault</em>: whether an Attribute node was generated from the
				197	default value defined in the DTD or schema (<em>unsupported
				198	yet</em>).</li>
				199	<li><em>XmlLang</em>: the <a
				200	href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
				201	within which the node resides.</li>
				202	<li><em>IsEmptyElement</em>: check if the current node is empty, this is a
				203	bit bizarre in the sense that <code><a/></code> will be considered
				204	empty while <code><a></a></code> will not.</li>
				205	<li><em>AttributeCount</em>: provides the number of attributes of the
				206	current node.</li>
				207	</ul>
				208
				209	<p></p>
				210
				211	<h2><a name="Validating">Validating a document</a></h2>
				212
				213	<h2><a name="Entities">Entities substitution</a></h2>
				214
				215	<p> </p>
				216
				217	<p><a href="mailto:veillard@redhat.com">Daniel Veillard</a></p>
				218
				219	<p>$Id$</p>
				220
				221	<p></p>
				222	</body>
				223	</html>