Blame - doc/encoding.html - platform/external/libxml2

blob: 6135bfcc7976fb83fe26bbe4eaeda6d004c0d945 [file] [log] [blame]

Daniel Veillard	be40c8b	2000-07-14 12:10:59 +0000	[diff] [blame^]	1	<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
				2	"http://www.w3.org/TR/REC-html40/loose.dtd">
				3	<html>
				4	<head>
				5	<title>Libxml Internationalization support</title>
				6	<meta name="GENERATOR" content="amaya V3.2">
				7	<meta http-equiv="Content-Type" content="text/html">
				8	</head>
				9
				10	<body bgcolor="#ffffff">
				11	<h1 align="center">Libxml Internationalization support</h1>
				12
				13	<p>Location: <a
				14	href="http://xmlsoft.org/encoding.html">http://xmlsoft.org/encoding.html</a></p>
				15
				16	<p>Libxml home page: <a href="http://xmlsoft.org/">http://xmlsoft.org/</a></p>
				17
				18	<p>Mailing-list archive: <a
				19	href="http://xmlsoft.org/messages/">http://xmlsoft.org/messages/</a></p>
				20
				21	<p>Version: $Revision$</p>
				22
				23	<p>Table of Content:</p>
				24	<ol>
				25	<li><a href="#What">What does internationalization support mean ?</a></li>
				26	<li><a href="#internal">The internal encoding, how and why</a></li>
				27	<li><a href="#implemente">How is it implemented ?</a></li>
				28	<li><a href="#Default">Default supported encodings</a></li>
				29	<li><a href="#extend">How to extend the existing support</a></li>
				30	</ol>
				31
				32	<h2><a name="What">What does internationalization support mean ?</a></h2>
				33
				34	<p>XML was designed from the start to allow the support of any character set
				35	by using Unicode. Any conformant XML parser has to support the UTF-8 and
				36	UTF-16 default encodings which can both express the full unicode ranges. UTF8
				37	is a variable length encoding whose greatest point are to resuse the same
				38	emcoding for ASCII and to save space for Western encodings, but it is a bit
				39	more complex to handle in practice. UTF-16 use 2 bytes per characters (and
				40	sometimes combines two pairs), it makes implementation easier, but looks a bit
				41	overkill for Western languages encoding. Moreover the XML specification allows
				42	document to be encoded in other encodings at the condition that they are
				43	clearly labelled as such. For example the following is a wellformed XML
				44	document encoded in ISO-Latin 1 and using accentuated letter that we French
				45	likes for both markup and content:</p>
				46	<pre><?xml version="1.0" encoding="ISO-8859-1"?>
				47	<très>là</très></pre>
				48
				49	<p> Having internationalization support in libxml means the foolowing:</p>
				50	<ul>
				51	<li>the document is properly parsed</li>
				52	<li>informations about it's encoding are saved</li>
				53	<li>it can be modified</li>
				54	<li>it can be saved in its original encoding</li>
				55	<li>it can also be saved in another encoding supported by libxml (for
				56	example straight UTF8 or even an ASCII form)</li>
				57	</ul>
				58
				59	<p>Another very important point is that the whole libxml API, with the
				60	exception of a few routines to read with a specific encoding or save to a
				61	specific encoding, is completely agnostic about the original encoding of the
				62	document.</p>
				63
				64	<p>It should be noted too that the HTML parser embedded in libxml now obbey
				65	the same rules too, the following document will be (as of 2.2.2) handled in
				66	an internationalized fashion by libxml too:</p>
				67	<pre><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
				68	"http://www.w3.org/TR/REC-html40/loose.dtd">
				69	<html lang="fr">
				70	<head>
				71	<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-latin-1">
				72	</head>
				73	<body>
				74	<p>W3C crée des standards pour le Web.</body>
				75	</html></pre>
				76
				77	<h2><a name="internal">The internal encoding, how and why</a></h2>
				78
				79	<p>One of the core decision was to force all documents to be converted to a
				80	default internal encoding, and that encoding to be UTF-8, here are the
				81	rationale for those choices:</p>
				82	<ul>
				83	<li>keeping the native encoding in the internal form would force the libxml
				84	users (or the code associated) to be fully aware of the encoding of the
				85	original document, for examples when adding a text node to a document, the
				86	content would have to be provided in the document encoding, i.e. the
				87	client code would have to check it before hand, make sure it's conformant
				88	to the encoding, etc ... Very hard in practice, though in some specific
				89	cases this may make sense.</li>
				90	<li>the second decision was which encoding. From the XML spec only UTF8 and
				91	UTF16 really makes sense as being the two only encodings for which there
				92	is amndatory support. UCS-4 (32 bits fixed size encoding) could be
				93	considered an intelligent choice too since it's a direct Unicode mapping
				94	support. I selected UTF-8 on the basis of efficiency and compatibility
				95	with surrounding software:
				96	<ul>
				97	<li>UTF-8 while a bit more complex to convert from/to (i.e. slightly
				98	more costly to import and export CPU wise) is also far more compact
				99	than UTF-16 (and UCS-4) for a majority of the documents I see it used
				100	for right now (RPM RDF catalogs, advogato data, various configuration
				101	file formats, etc.) and the key point for today's computer
				102	architecture is efficient uses of caches. If one nearly double the
				103	memory requirement to store the same amount of data, this will trash
				104	caches (main memory/external caches/internal caches) and my take is
				105	that this harms the system far more than the CPU requirements needed
				106	for the conversion to UTF-8</li>
				107	<li>Most of libxml version 1 users were using it with straight ASCII
				108	most of the time, doing the conversion with an internal encoding
				109	requiring all their code to be rewritten was a serious show-stopper
				110	for using UTF-16 or UCS-4.</li>
				111	<li>UTF-8 is being used as the de-facto internal encoding standard for
				112	related code like the <a href="http://www.pango.org/">pango</a>
				113	upcoming Gnome text widget, and a lot of Unix code (yep another place
				114	where Unix programmer base takes a different approach from Microsoft -
				115	they are using UTF-16)</li>
				116	</ul>
				117	</li>
				118	</ul>
				119
				120	<p>What does this mean in practice for the libxml user:</p>
				121	<ul>
				122	<li>xmlChar, the libxml data type is a byte, those bytes must be assembled
				123	as UTF-8 valid strings. The proper way to terminate an xmlChar * string is
				124	simply to append 0 byte, as usual.</li>
				125	<li> One just need to make sure that when using chars outside the ASCII set,
				126	the values has been properly converted to UTF-8</li>
				127	</ul>
				128
				129	<h2><a name="implemente">How is it implemented ?</a></h2>
				130
				131	<p>Let's describe how all this works within libxml, basically the I18N
				132	(internationalization) support get triggered only during I/O operation, i.e.
				133	when reading a document or saving one. Let's look first at the reading
				134	sequence:</p>
				135	<ol>
				136	<li>when a document is processed, we usually don't know the encoding, a
				137	simple heuristic allows to detect UTF-18 and UCS-4 from whose where the
				138	ASCII range (0-0x7F) maps with ASCII</li>
				139	<li>the xml declaration if available is parsed, including the encoding
				140	declaration. At that point, if the autodetected encoding is different from
				141	the one declared a call to xmlSwitchEncoding() is issued.</li>
				142	<li>If there is no encoding declaration, then the input has to be in either
				143	UTF-8 or UTF-16, if it is not then at some point when processing the
				144	input, the converter/checker of UTF-8 form will raise an encoding error.
				145	You may end-up with a garbled document, or no document at all ! Example:
				146	<pre>~/XML -> ./xmllint err.xml
				147	err.xml:1: error: Input is not proper UTF-8, indicate encoding !
				148	<très>là</très>
				149	^
				150	err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
				151	<très>là</très>
				152	^</pre>
				153	</li>
				154	<li>xmlSwitchEncoding() does an encoding name lookup, canonalize it, and
				155	then search the default registered encoding converters for that encoding.
				156	If it's not within the default set and iconv() support has been compiled
				157	it, it will ask iconv for such an encoder. If this fails then the parser
				158	will report an error and stops processing:
				159	<pre>~/XML -> ./xmllint err2.xml
				160	err2.xml:1: error: Unsupported encoding UnsupportedEnc
				161	<?xml version="1.0" encoding="UnsupportedEnc"?>
				162	^</pre>
				163	</li>
				164	<li> From that point the encoder process progressingly the input (it is
				165	plugged as a front-end to the I/O module) for that entity. It captures and
				166	convert on-the-fly the document to be parsed to UTF-8. The parser itself
				167	just does UTF-8 checking of this input and process it transparently. The
				168	only difference is that the encoding information has been added to the
				169	parsing context (more precisely to the input corresponding to this
				170	entity).</li>
				171	<li>The result (when using DOM) is an internal form completely in UTF-8 with
				172	just an encoding information on the document node.</li>
				173	</ol>
				174
				175	<p>Ok then what's happen when saving the document (assuming you
				176	colllected/built an xmlDoc DOM like structure) ? It depends on the function
				177	called, xmlSaveFile() will just try to save in the original encoding, while
				178	xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given
				179	encoding:</p>
				180	<ol>
				181	<li> if no encoding is given, libxml will look for an encoding value
				182	associated to the document and if it exists will try to save to that
				183	encoding,
				184	<p>otherwise everything is written in the internal form, i.e. UTF-8</p>
				185	</li>
				186	<li>so if an encoding was specified, either at the API level or on the
				187	document, libxml will again canonalize the encoding name, lookup for a
				188	converter in the registered set or through iconv. If not found the
				189	function will return an error code</li>
				190	<li>the converter is placed before the I/O buffer layer, as another kind of
				191	buffer, then libxml will simply push the UTF-8 serialization to through
				192	that buffer, which will then progressively be converted and pushed onto
				193	the I/O layer.</li>
				194	<li>It is possible that the converter code fails on some input, for example
				195	trying to push an UTF-8 encoded chinese character through the UTF-8 to
				196	ISO-Latin-1 converter won't work. Since the encoders are progressive they
				197	will just report the error and the number of bytes converted, at that
				198	point libxml will decode the offending character, remove it from the
				199	buffer and replace it with the associated charRef encoding &#123; and
				200	resume the convertion. This guarante that any document will be saved
				201	without losses. A special "ascii" encoding name is used to save documents
				202	to a pure ascii form can be used when portability is really crucial</li>
				203	</ol>
				204
				205	<p>Here is a few examples based on the same test document:</p>
				206	<pre>~/XML -> ./xmllint isolat1
				207	<?xml version="1.0" encoding="ISO-8859-1"?>
				208	<très>là</très>
				209	~/XML -> ./xmllint --encode UTF-8 isolat1
				210	<?xml version="1.0" encoding="UTF-8"?>
				211	<trÃ¨s>lÃ </trÃ¨s>
				212	~/XML -> ./xmllint --encode ascii isolat1
				213	<?xml version="1.0" encoding="ascii"?>
				214	<tr&#xE8;s>l&#xE0;</tr&#xE8;s>
				215	~/XML -> </pre>
				216
				217	<p> The same processing is applied (and reuse most of the code) for HTML I18N
				218	processing. Looking up and modifying the content encoding is a bit more
				219	difficult since it is located in a <meta> tag under the <head>, so
				220	a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have
				221	been provided. The parser also attempts to switch encoding on the fly when
				222	detecting such a tag on input. Except for that the processing is the same (and
				223	again reuses the same code). </p>
				224
				225	<h2><a name="Default">Default supported encodings</a></h2>
				226
				227	<p>libxml has a set of default converters for the following encodings (located
				228	in encoding.c):</p>
				229	<ol>
				230	<li>UTF-8 is supported by default (null handlers)</li>
				231	<li>UTF-16, both little and big endian</li>
				232	<li>ISO-Latin-1 (ISO-8859-1) covering most western languages</li>
				233	<li>ASCII, useful mostly for saving</li>
				234	<li>HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML
				235	predefined entities like &copy; for the Copyright sign.</li>
				236	</ol>
				237
				238	<p>More over when compiled on an Unix platfor with iconv support the full set
				239	of encodings supported by iconv can be instantly be used by libxml. On a linux
				240	machine with glibc-2.1 the list of supported encodings and aliases fill 3 full
				241	pages, and include UCS-4, the full set of ISO-Latin encodings, and the various
				242	Japanese ones.</p>
				243
				244	<h2><a name="extend">How to extend the existing support</a></h2>
				245
				246	<p>Well adding support for new encoding, or overriding one of the encoders
				247	(assuming it is buggy) should not be hard, just write an input and output
				248	conversion routines to/from UTF-8, and register them using
				249	xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they will be
				250	called automatically if the parser(s) encounter such an encoding name
				251	(register it uppercase, this will help). The description of the encoders,
				252	their arguments and expected return values are described in the encoding.h
				253	header.</p>
				254
				255	<p>A quick note on the topic of subverting the parser to use a different
				256	internal encoding than UTF-8, in some case people will absolutely want to keep
				257	the internal encoding different, I think it's still possible (but the encoding
				258	must be compliant with ASCII on the same subrange) though I didn't tried it.
				259	The key is to override the default conversion routines (by registering null
				260	encoders/decoders for your charsets), and bypass the UTF-8 checking of the
				261	parser by setting the parser context charset (ctxt->charset) to something
				262	different than XML_CHAR_ENCODING_UTF8, but there is no guarantee taht this
				263	will work. You may also have some troubles saving back.</p>
				264
				265	<p>Basically proper I18N support is important, this requires at least
				266	libxml-2.0.0, but a lot of features and corrections are really available only
				267	starting 2.2.</p>
				268
				269	<p><a href="mailto:Daniel.Veillard@w3.org">Daniel Veillard</a></p>
				270
				271	<p>$Id$</p>
				272	</body>
				273	</html>