Blame - doc/encoding.html - platform/external/libxml2

blob: fcc05a2ff155c6d53091c01f4f127dceb6fb6128 [file] [log] [blame]

Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	2	<html>
				3	<head>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	4	<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
				5	<style type="text/css"><!--
				6	TD {font-size: 10pt; font-family: Verdana,Arial,Helvetica}
				7	BODY {font-size: 10pt; font-family: Verdana,Arial,Helvetica; margin-top: 5pt; margin-left: 0pt; margin-right: 0pt}
				8	H1 {font-size: 16pt; font-family: Verdana,Arial,Helvetica}
				9	H2 {font-size: 14pt; font-family: Verdana,Arial,Helvetica}
				10	H3 {font-size: 12pt; font-family: Verdana,Arial,Helvetica}
				11	A:link, A:visited, A:active { text-decoration: underline }
				12	--></style>
				13	<title>Encodings support</title>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	14	</head>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	15	<body bgcolor="#8b7765" text="#000000" link="#000000" vlink="#000000">
				16	<table border="0" width="100%" cellpadding="5" cellspacing="0" align="center"><tr>
				17	<td width="180">
				18	<a href="http://www.gnome.org/"><img src="smallfootonly.gif" alt="Gnome Logo"></a><a href="http://www.w3.org/Status"><img src="w3c.png" alt="W3C Logo"></a><a href="http://www.redhat.com/"><img src="redhat.gif" alt="Red Hat Logo"></a>
				19	</td>
				20	<td><table border="0" width="90%" cellpadding="2" cellspacing="0" align="center" bgcolor="#000000"><tr><td><table width="100%" border="0" cellspacing="1" cellpadding="3" bgcolor="#fffacd"><tr><td align="center">
				21	<h1>The XML C library for Gnome</h1>
				22	<h2>Encodings support</h2>
				23	</td></tr></table></td></tr></table></td>
				24	</tr></table>
				25	<table border="0" cellpadding="4" cellspacing="0" width="100%" align="center"><tr><td bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="2" width="100%"><tr>
				26	<td valign="top" width="200" bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="1" width="100%" bgcolor="#000000"><tr><td>
				27	<table width="100%" border="0" cellspacing="1" cellpadding="3">
				28	<tr><td colspan="1" bgcolor="#eecfa1" align="center"><center><b>Main Menu</b></center></td></tr>
				29	<tr><td bgcolor="#fffacd"><ul style="margin-left: -2pt">
				30	<li><a href="index.html">Home</a></li>
				31	<li><a href="intro.html">Introduction</a></li>
				32	<li><a href="FAQ.html">FAQ</a></li>
				33	<li><a href="docs.html">Documentation</a></li>
				34	<li><a href="bugs.html">Reporting bugs and getting help</a></li>
				35	<li><a href="help.html">How to help</a></li>
				36	<li><a href="downloads.html">Downloads</a></li>
				37	<li><a href="news.html">News</a></li>
				38	<li><a href="XML.html">XML</a></li>
				39	<li><a href="XSLT.html">XSLT</a></li>
				40	<li><a href="architecture.html">libxml architecture</a></li>
				41	<li><a href="tree.html">The tree output</a></li>
				42	<li><a href="interface.html">The SAX interface</a></li>
				43	<li><a href="xmldtd.html">Validation & DTDs</a></li>
				44	<li><a href="xmlmem.html">Memory Management</a></li>
				45	<li><a href="encoding.html">Encodings support</a></li>
				46	<li><a href="xmlio.html">I/O Interfaces</a></li>
				47	<li><a href="catalog.html">Catalog support</a></li>
				48	<li><a href="library.html">The parser interfaces</a></li>
				49	<li><a href="entities.html">Entities or no entities</a></li>
				50	<li><a href="namespaces.html">Namespaces</a></li>
				51	<li><a href="upgrade.html">Upgrading 1.x code</a></li>
				52	<li><a href="DOM.html">DOM Principles</a></li>
				53	<li><a href="example.html">A real example</a></li>
				54	<li><a href="contribs.html">Contributions</a></li>
				55	<li>
				56	<a href="xml.html">flat page</a>, <a href="site.xsl">stylesheet</a>
				57	</li>
				58	</ul></td></tr>
				59	</table>
				60	<table width="100%" border="0" cellspacing="1" cellpadding="3">
				61	<tr><td colspan="1" bgcolor="#eecfa1" align="center"><center><b>Related links</b></center></td></tr>
				62	<tr><td bgcolor="#fffacd"><ul style="margin-left: -2pt">
				63	<li><a href="http://mail.gnome.org/archives/xml/">Mail archive</a></li>
				64	<li><a href="http://xmlsoft.org/XSLT/">XSLT libxslt</a></li>
				65	<li><a href="http://www.cs.unibo.it/~casarini/gdome2/">DOM gdome2</a></li>
				66	<li><a href="ftp://xmlsoft.org/">FTP</a></li>
				67	<li><a href="http://www.fh-frankfurt.de/~igor/projects/libxml/">Windows binaries</a></li>
				68	<li><a href="http://pages.eidosnet.co.uk/~garypen/libxml/">Solaris binaries</a></li>
Daniel Veillard	c6271d2	2001-10-27 07:50:58 +0000	[diff] [blame]	69	<li><a href="http://bugzilla.gnome.org/buglist.cgi?product=libxml">Bug Tracker</a></li>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	70	</ul></td></tr>
				71	</table>
				72	</td></tr></table></td>
				73	<td valign="top" bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="1" width="100%"><tr><td><table border="0" cellspacing="0" cellpadding="1" width="100%" bgcolor="#000000"><tr><td><table border="0" cellpadding="3" cellspacing="1" width="100%"><tr><td bgcolor="#fffacd">
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	74	<p>Table of Content:</p>
				75	<ol>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	76	<li><a href="encoding.html#What">What does internationalization support
				77	mean ?</a></li>
				78	<li><a href="encoding.html#internal">The internal encoding, how and
				79	why</a></li>
				80	<li><a href="encoding.html#implemente">How is it implemented ?</a></li>
				81	<li><a href="encoding.html#Default">Default supported encodings</a></li>
				82	<li><a href="encoding.html#extend">How to extend the existing
				83	support</a></li>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	84	</ol>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	85	<h3><a name="What">What does internationalization support mean ?</a></h3>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	86	<p>XML was designed from the start to allow the support of any character set
				87	by using Unicode. Any conformant XML parser has to support the UTF-8 and
				88	UTF-16 default encodings which can both express the full unicode ranges. UTF8
				89	is a variable length encoding whose greatest point are to resuse the same
				90	emcoding for ASCII and to save space for Western encodings, but it is a bit
				91	more complex to handle in practice. UTF-16 use 2 bytes per characters (and
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	92	sometimes combines two pairs), it makes implementation easier, but looks a
				93	bit overkill for Western languages encoding. Moreover the XML specification
				94	allows document to be encoded in other encodings at the condition that they
				95	are clearly labelled as such. For example the following is a wellformed XML
Daniel Veillard	0d6b170	2000-08-22 23:52:16 +0000	[diff] [blame]	96	document encoded in ISO-8859 1 and using accentuated letter that we French
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	97	likes for both markup and content:</p>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	98	<pre><?xml version="1.0" encoding="ISO-8859-1"?>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	99	<très>là</très></pre>
Daniel Veillard	0d6b170	2000-08-22 23:52:16 +0000	[diff] [blame]	100	<p>Having internationalization support in libxml means the foolowing:</p>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	101	<ul>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	102	<li>the document is properly parsed</li>
				103	<li>informations about it's encoding are saved</li>
				104	<li>it can be modified</li>
				105	<li>it can be saved in its original encoding</li>
				106	<li>it can also be saved in another encoding supported by libxml (for
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	107	example straight UTF8 or even an ASCII form)</li>
				108	</ul>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	109	<p>Another very important point is that the whole libxml API, with the
				110	exception of a few routines to read with a specific encoding or save to a
				111	specific encoding, is completely agnostic about the original encoding of the
				112	document.</p>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	113	<p>It should be noted too that the HTML parser embedded in libxml now obbey
				114	the same rules too, the following document will be (as of 2.2.2) handled in
				115	an internationalized fashion by libxml too:</p>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	116	<pre><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
				117	"http://www.w3.org/TR/REC-html40/loose.dtd">
				118	<html lang="fr">
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	119	<head>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	120	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	121	</head>
				122	<body>
				123	<p>W3C crée des standards pour le Web.</body>
				124	</html></pre>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	125	<h3><a name="internal">The internal encoding, how and why</a></h3>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	126	<p>One of the core decision was to force all documents to be converted to a
				127	default internal encoding, and that encoding to be UTF-8, here are the
				128	rationale for those choices:</p>
				129	<ul>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	130	<li>keeping the native encoding in the internal form would force the libxml
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	131	users (or the code associated) to be fully aware of the encoding of the
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	132	original document, for examples when adding a text node to a document,
				133	the content would have to be provided in the document encoding, i.e. the
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	134	client code would have to check it before hand, make sure it's conformant
				135	to the encoding, etc ... Very hard in practice, though in some specific
				136	cases this may make sense.</li>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	137	<li>the second decision was which encoding. From the XML spec only UTF8 and
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	138	UTF16 really makes sense as being the two only encodings for which there
				139	is amndatory support. UCS-4 (32 bits fixed size encoding) could be
				140	considered an intelligent choice too since it's a direct Unicode mapping
				141	support. I selected UTF-8 on the basis of efficiency and compatibility
				142	with surrounding software:
				143	<ul>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	144	<li>UTF-8 while a bit more complex to convert from/to (i.e. slightly
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	145	more costly to import and export CPU wise) is also far more compact
				146	than UTF-16 (and UCS-4) for a majority of the documents I see it used
				147	for right now (RPM RDF catalogs, advogato data, various configuration
				148	file formats, etc.) and the key point for today's computer
				149	architecture is efficient uses of caches. If one nearly double the
				150	memory requirement to store the same amount of data, this will trash
				151	caches (main memory/external caches/internal caches) and my take is
				152	that this harms the system far more than the CPU requirements needed
				153	for the conversion to UTF-8</li>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	154	<li>Most of libxml version 1 users were using it with straight ASCII
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	155	most of the time, doing the conversion with an internal encoding
				156	requiring all their code to be rewritten was a serious show-stopper
				157	for using UTF-16 or UCS-4.</li>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	158	<li>UTF-8 is being used as the de-facto internal encoding standard for
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	159	related code like the <a href="http://www.pango.org/">pango</a>
				160	upcoming Gnome text widget, and a lot of Unix code (yep another place
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	161	where Unix programmer base takes a different approach from Microsoft
				162	- they are using UTF-16)</li>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	163	</ul>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	164	</li>
				165	</ul>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	166	<p>What does this mean in practice for the libxml user:</p>
				167	<ul>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	168	<li>xmlChar, the libxml data type is a byte, those bytes must be assembled
				169	as UTF-8 valid strings. The proper way to terminate an xmlChar * string
				170	is simply to append 0 byte, as usual.</li>
				171	<li>One just need to make sure that when using chars outside the ASCII set,
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	172	the values has been properly converted to UTF-8</li>
				173	</ul>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	174	<h3><a name="implemente">How is it implemented ?</a></h3>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	175	<p>Let's describe how all this works within libxml, basically the I18N
				176	(internationalization) support get triggered only during I/O operation, i.e.
				177	when reading a document or saving one. Let's look first at the reading
				178	sequence:</p>
				179	<ol>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	180	<li>when a document is processed, we usually don't know the encoding, a
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	181	simple heuristic allows to detect UTF-18 and UCS-4 from whose where the
				182	ASCII range (0-0x7F) maps with ASCII</li>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	183	<li>the xml declaration if available is parsed, including the encoding
				184	declaration. At that point, if the autodetected encoding is different
				185	from the one declared a call to xmlSwitchEncoding() is issued.</li>
				186	<li>If there is no encoding declaration, then the input has to be in either
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	187	UTF-8 or UTF-16, if it is not then at some point when processing the
				188	input, the converter/checker of UTF-8 form will raise an encoding error.
				189	You may end-up with a garbled document, or no document at all ! Example:
				190	<pre>~/XML -> ./xmllint err.xml
				191	err.xml:1: error: Input is not proper UTF-8, indicate encoding !
				192	<très>là</très>
				193	^
				194	err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
				195	<très>là</très>
				196	^</pre>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	197	</li>
				198	<li>xmlSwitchEncoding() does an encoding name lookup, canonalize it, and
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	199	then search the default registered encoding converters for that encoding.
				200	If it's not within the default set and iconv() support has been compiled
				201	it, it will ask iconv for such an encoder. If this fails then the parser
				202	will report an error and stops processing:
				203	<pre>~/XML -> ./xmllint err2.xml
				204	err2.xml:1: error: Unsupported encoding UnsupportedEnc
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	205	<?xml version="1.0" encoding="UnsupportedEnc"?>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	206	^</pre>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	207	</li>
				208	<li>From that point the encoder process progressingly the input (it is
				209	plugged as a front-end to the I/O module) for that entity. It captures
				210	and convert on-the-fly the document to be parsed to UTF-8. The parser
				211	itself just does UTF-8 checking of this input and process it
				212	transparently. The only difference is that the encoding information has
				213	been added to the parsing context (more precisely to the input
				214	corresponding to this entity).</li>
				215	<li>The result (when using DOM) is an internal form completely in UTF-8
				216	with just an encoding information on the document node.</li>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	217	</ol>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	218	<p>Ok then what's happen when saving the document (assuming you
				219	colllected/built an xmlDoc DOM like structure) ? It depends on the function
				220	called, xmlSaveFile() will just try to save in the original encoding, while
				221	xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given
				222	encoding:</p>
				223	<ol>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	224	<li>if no encoding is given, libxml will look for an encoding value
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	225	associated to the document and if it exists will try to save to that
				226	encoding,
				227	<p>otherwise everything is written in the internal form, i.e. UTF-8</p>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	228	</li>
				229	<li>so if an encoding was specified, either at the API level or on the
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	230	document, libxml will again canonalize the encoding name, lookup for a
				231	converter in the registered set or through iconv. If not found the
				232	function will return an error code</li>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	233	<li>the converter is placed before the I/O buffer layer, as another kind of
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	234	buffer, then libxml will simply push the UTF-8 serialization to through
				235	that buffer, which will then progressively be converted and pushed onto
				236	the I/O layer.</li>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	237	<li>It is possible that the converter code fails on some input, for example
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	238	trying to push an UTF-8 encoded chinese character through the UTF-8 to
Daniel Veillard	0d6b170	2000-08-22 23:52:16 +0000	[diff] [blame]	239	ISO-8859-1 converter won't work. Since the encoders are progressive they
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	240	will just report the error and the number of bytes converted, at that
				241	point libxml will decode the offending character, remove it from the
				242	buffer and replace it with the associated charRef encoding &#123; and
				243	resume the convertion. This guarante that any document will be saved
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	244	without losses (except for markup names where this is not legal, this is
				245	a problem in the current version, in pactice avoid using non-ascci
				246	characters for tags or attributes names @@). A special "ascii" encoding
Daniel Veillard	0d6b170	2000-08-22 23:52:16 +0000	[diff] [blame]	247	name is used to save documents to a pure ascii form can be used when
				248	portability is really crucial</li>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	249	</ol>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	250	<p>Here is a few examples based on the same test document:</p>
				251	<pre>~/XML -> ./xmllint isolat1
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	252	<?xml version="1.0" encoding="ISO-8859-1"?>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	253	<très>là</très>
				254	~/XML -> ./xmllint --encode UTF-8 isolat1
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	255	<?xml version="1.0" encoding="UTF-8"?>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	256	<trÃ¨s>lÃ </trÃ¨s>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	257	~/XML -> </pre>
Daniel Veillard	0d6b170	2000-08-22 23:52:16 +0000	[diff] [blame]	258	<p>The same processing is applied (and reuse most of the code) for HTML I18N
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	259	processing. Looking up and modifying the content encoding is a bit more
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	260	difficult since it is located in a <meta> tag under the <head>,
				261	so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	262	been provided. The parser also attempts to switch encoding on the fly when
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	263	detecting such a tag on input. Except for that the processing is the same
				264	(and again reuses the same code).</p>
				265	<h3><a name="Default">Default supported encodings</a></h3>
				266	<p>libxml has a set of default converters for the following encodings
				267	(located in encoding.c):</p>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	268	<ol>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	269	<li>UTF-8 is supported by default (null handlers)</li>
				270	<li>UTF-16, both little and big endian</li>
				271	<li>ISO-Latin-1 (ISO-8859-1) covering most western languages</li>
				272	<li>ASCII, useful mostly for saving</li>
				273	<li>HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	274	predefined entities like &copy; for the Copyright sign.</li>
				275	</ol>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	276	<p>More over when compiled on an Unix platfor with iconv support the full set
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	277	of encodings supported by iconv can be instantly be used by libxml. On a
				278	linux machine with glibc-2.1 the list of supported encodings and aliases fill
				279	3 full pages, and include UCS-4, the full set of ISO-Latin encodings, and the
				280	various Japanese ones.</p>
				281	<h4>Encoding aliases</h4>
				282	<p>From 2.2.3, libxml has support to register encoding names aliases. The
				283	goal is to be able to parse document whose encoding is supported but where
				284	the name differs (for example from the default set of names accepted by
				285	iconv). The following functions allow to register and handle new aliases for
				286	existing encodings. Once registered libxml will automatically lookup the
				287	aliases when handling a document:</p>
Daniel Veillard	088f428	2000-08-25 23:46:50 +0000	[diff] [blame]	288	<ul>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	289	<li>int xmlAddEncodingAlias(const char name, const char alias);</li>
				290	<li>int xmlDelEncodingAlias(const char *alias);</li>
				291	<li>const char * xmlGetEncodingAlias(const char *alias);</li>
				292	<li>void xmlCleanupEncodingAliases(void);</li>
Daniel Veillard	088f428	2000-08-25 23:46:50 +0000	[diff] [blame]	293	</ul>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	294	<h3><a name="extend">How to extend the existing support</a></h3>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	295	<p>Well adding support for new encoding, or overriding one of the encoders
				296	(assuming it is buggy) should not be hard, just write an input and output
				297	conversion routines to/from UTF-8, and register them using
				298	xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they will be
				299	called automatically if the parser(s) encounter such an encoding name
				300	(register it uppercase, this will help). The description of the encoders,
				301	their arguments and expected return values are described in the encoding.h
				302	header.</p>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	303	<p>A quick note on the topic of subverting the parser to use a different
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	304	internal encoding than UTF-8, in some case people will absolutely want to
				305	keep the internal encoding different, I think it's still possible (but the
				306	encoding must be compliant with ASCII on the same subrange) though I didn't
				307	tried it. The key is to override the default conversion routines (by
				308	registering null encoders/decoders for your charsets), and bypass the UTF-8
				309	checking of the parser by setting the parser context charset
				310	(ctxt->charset) to something different than XML_CHAR_ENCODING_UTF8, but
				311	there is no guarantee taht this will work. You may also have some troubles
				312	saving back.</p>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	313	<p>Basically proper I18N support is important, this requires at least
				314	libxml-2.0.0, but a lot of features and corrections are really available only
				315	starting 2.2.</p>
Daniel Veillard	c5d6434	2001-06-24 12:13:24 +0000	[diff] [blame]	316	<p><a href="mailto:daniel@veillard.com">Daniel Veillard</a></p>
Daniel Veillard	b8cfbd1	2001-10-25 10:53:28 +0000	[diff] [blame]	317	</td></tr></table></td></tr></table></td></tr></table></td>
				318	</tr></table></td></tr></table>
Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame]	319	</body>
				320	</html>