blob: 8db787eaf74c1eaf3b6daf39ee01825100f739f9 [file] [log] [blame]
Daniel Veillard1177ca42003-04-26 22:29:54 +00001<?xml version="1.0" encoding="ISO-8859-1"?>
2<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
3<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /><link rel="SHORTCUT ICON" href="/favicon.ico" /><style type="text/css">
Daniel Veillard373a4752002-02-21 14:46:29 +00004TD {font-family: Verdana,Arial,Helvetica}
5BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
6H1 {font-family: Verdana,Arial,Helvetica}
7H2 {font-family: Verdana,Arial,Helvetica}
8H3 {font-family: Verdana,Arial,Helvetica}
Daniel Veillardb8cfbd12001-10-25 10:53:28 +00009A:link, A:visited, A:active { text-decoration: underline }
Daniel Veillardfabafd52006-06-08 08:16:33 +000010</style><title>Encodings support</title></head><body bgcolor="#8b7765" text="#000000" link="#a06060" vlink="#000000"><table border="0" width="100%" cellpadding="5" cellspacing="0" align="center"><tr><td width="120"><a href="http://swpat.ffii.org/"><img src="epatents.png" alt="Action against software patents" /></a></td><td width="180"><a href="http://www.gnome.org/"><img src="gnome2.png" alt="Gnome2 Logo" /></a><a href="http://www.w3.org/Status"><img src="w3c.png" alt="W3C Logo" /></a><a href="http://www.redhat.com/"><img src="redhat.gif" alt="Red Hat Logo" /></a><div align="left"><a href="http://xmlsoft.org/"><img src="Libxml2-Logo-180x168.gif" alt="Made with Libxml2 Logo" /></a></div></td><td><table border="0" width="90%" cellpadding="2" cellspacing="0" align="center" bgcolor="#000000"><tr><td><table width="100%" border="0" cellspacing="1" cellpadding="3" bgcolor="#fffacd"><tr><td align="center"><h1>The XML C parser and toolkit of Gnome</h1><h2>Encodings support</h2></td></tr></table></td></tr></table></td></tr></table><table border="0" cellpadding="4" cellspacing="0" width="100%" align="center"><tr><td bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="2" width="100%"><tr><td valign="top" width="200" bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="1" width="100%" bgcolor="#000000"><tr><td><table width="100%" border="0" cellspacing="1" cellpadding="3"><tr><td colspan="1" bgcolor="#eecfa1" align="center"><center><b>Main Menu</b></center></td></tr><tr><td bgcolor="#fffacd"><form action="search.php" enctype="application/x-www-form-urlencoded" method="get"><input name="query" type="text" size="20" value="" /><input name="submit" type="submit" value="Search ..." /></form><ul><li><a href="index.html">Home</a></li><li><a href="html/index.html">Reference Manual</a></li><li><a href="intro.html">Introduction</a></li><li><a href="FAQ.html">FAQ</a></li><li><a href="docs.html" style="font-weight:bold">Developer Menu</a></li><li><a href="bugs.html">Reporting bugs and getting help</a></li><li><a href="help.html">How to help</a></li><li><a href="downloads.html">Downloads</a></li><li><a href="news.html">Releases</a></li><li><a href="XMLinfo.html">XML</a></li><li><a href="XSLT.html">XSLT</a></li><li><a href="xmldtd.html">Validation &amp; DTDs</a></li><li><a href="encoding.html">Encodings support</a></li><li><a href="catalog.html">Catalog support</a></li><li><a href="namespaces.html">Namespaces</a></li><li><a href="contribs.html">Contributions</a></li><li><a href="examples/index.html" style="font-weight:bold">Code Examples</a></li><li><a href="html/index.html" style="font-weight:bold">API Menu</a></li><li><a href="guidelines.html">XML Guidelines</a></li><li><a href="ChangeLog.html">Recent Changes</a></li></ul></td></tr></table><table width="100%" border="0" cellspacing="1" cellpadding="3"><tr><td colspan="1" bgcolor="#eecfa1" align="center"><center><b>Related links</b></center></td></tr><tr><td bgcolor="#fffacd"><ul><li><a href="http://mail.gnome.org/archives/xml/">Mail archive</a></li><li><a href="http://xmlsoft.org/XSLT/">XSLT libxslt</a></li><li><a href="http://phd.cs.unibo.it/gdome2/">DOM gdome2</a></li><li><a href="http://www.aleksey.com/xmlsec/">XML-DSig xmlsec</a></li><li><a href="ftp://xmlsoft.org/">FTP</a></li><li><a href="http://www.zlatkovic.com/projects/libxml/">Windows binaries</a></li><li><a href="http://www.blastwave.org/packages.php/libxml2">Solaris binaries</a></li><li><a href="http://www.explain.com.au/oss/libxml2xslt.html">MacOsX binaries</a></li><li><a href="http://libxmlplusplus.sourceforge.net/">C++ bindings</a></li><li><a href="http://www.zend.com/php5/articles/php5-xmlphp.php#Heading4">PHP bindings</a></li><li><a href="http://sourceforge.net/projects/libxml2-pas/">Pascal bindings</a></li><li><a href="http://libxml.rubyforge.org/">Ruby bindings</a></li><li><a href="http://tclxml.sourceforge.net/">Tcl bindings</a></li><li><a href="http://bugzilla.gnome.org/buglist.cgi?product=libxml2">Bug Tracker</a></li></ul></td></tr></table></td></tr></table></td><td valign="top" bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="1" width="100%"><tr><td><table border="0" cellspacing="0" cellpadding="1" width="100%" bgcolor="#000000"><tr><td><table border="0" cellpadding="3" cellspacing="1" width="100%"><tr><td bgcolor="#fffacd"><p>If you are not really familiar with Internationalization (usual
11shortcutisI18N) , Unicode, characters and glyphs, I suggest you read a <a href="http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode">presentation</a>byTim
12Bray on Unicode and why you should care about it.</p><p>If you don't understand why <b>it does not make sense to have
13astringwithout knowing what encoding it uses</b>, then as Joel Spolsky said
14<a href="http://www.joelonsoftware.com/articles/Unicode.html">please do
15notwriteanother line of code until you finish reading that article.</a>. It
16isaprerequisite to understand this page, and avoid a lot of
17problemswithlibxml2, XML or text processing in general.</p><p>Table of Content:</p><ol><li><a href="encoding.html#What">What does internationalization
18 supportmean?</a></li>
19 <li><a href="encoding.html#internal">The internal encoding,
20 howandwhy</a></li>
Daniel Veillard0b28e882002-07-24 23:47:05 +000021 <li><a href="encoding.html#implemente">How is it implemented ?</a></li>
22 <li><a href="encoding.html#Default">Default supported encodings</a></li>
Daniel Veillardfabafd52006-06-08 08:16:33 +000023 <li><a href="encoding.html#extend">How to extend theexistingsupport</a></li>
24</ol><h3><a name="What" id="What">What does internationalization support mean ?</a></h3><p>XML was designed from the start to allow the support of any charactersetby
25using Unicode. Any conformant XML parser has to support the UTF-8andUTF-16
26default encodings which can both express the full unicode ranges.UTF8is a
27variable length encoding whose greatest points are to reuse thesameencoding
28for ASCII and to save space for Western encodings, but it is abitmore complex
29to handle in practice. UTF-16 use 2 bytes per character(andsometimes combines
30two pairs), it makes implementation easier, but looksabit overkill for
31Western languages encoding. Moreover the XMLspecificationallows the document
32to be encoded in other encodings at thecondition thatthey are clearly labeled
33as such. For example the following isa wellformedXML document encoded in
34ISO-8859-1 and using accentuated lettersthat weFrench like for both markup
35and content:</p><pre>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
Daniel Veillard8a469172003-06-12 16:05:07 +000036&lt;très&gt;là&lt;/très&gt;</pre><p>Having internationalization support in libxml2 means the following:</p><ul><li>the document is properly parsed</li>
Daniel Veillard0b28e882002-07-24 23:47:05 +000037 <li>informations about it's encoding are saved</li>
38 <li>it can be modified</li>
39 <li>it can be saved in its original encoding</li>
Daniel Veillardfabafd52006-06-08 08:16:33 +000040 <li>it can also be saved in another encoding supported by
41 libxml2(forexample straight UTF8 or even an ASCII form)</li>
42</ul><p>Another very important point is that the whole libxml2 API,
43withtheexception of a few routines to read with a specific encoding or save
44toaspecific encoding, is completely agnostic about the original encoding
45ofthedocument.</p><p>It should be noted too that the HTML parser embedded in libxml2 nowobeythe
46same rules too, the following document will be (as of 2.2.2) handledinan
47internationalized fashion by libxml2 too:</p><pre>&lt;!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
Daniel Veillard024f1992003-12-10 16:43:49 +000048 "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
49&lt;html lang="fr"&gt;
Daniel Veillardbe40c8b2000-07-14 12:10:59 +000050&lt;head&gt;
Daniel Veillard024f1992003-12-10 16:43:49 +000051 &lt;META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"&gt;
Daniel Veillardbe40c8b2000-07-14 12:10:59 +000052&lt;/head&gt;
53&lt;body&gt;
54&lt;p&gt;W3C crée des standards pour le Web.&lt;/body&gt;
Daniel Veillardfabafd52006-06-08 08:16:33 +000055&lt;/html&gt;</pre><h3><a name="internal" id="internal">The internal encoding, how and why</a></h3><p>One of the core decisions was to force all documents to be converted
56toadefault internal encoding, and that encoding to be UTF-8, here
57aretherationales for those choices:</p><ul><li>keeping the native encoding in the internal form would force
58 thelibxmlusers (or the code associated) to be fully aware of the encoding
59 oftheoriginal document, for examples when adding a text node to
60 adocument,the content would have to be provided in the document
61 encoding,i.e. theclient code would have to check it before hand, make
62 sure it'sconformantto the encoding, etc ... Very hard in practice, though
63 in somespecificcases this may make sense.</li>
64 <li>the second decision was which encoding. From the XML spec only
65 UTF8andUTF16 really makes sense as being the two only encodings for
66 whichthereis mandatory support. UCS-4 (32 bits fixed size encoding)
67 couldbeconsidered an intelligent choice too since it's a direct
68 Unicodemappingsupport. I selected UTF-8 on the basis of efficiency
69 andcompatibilitywith surrounding software:
70 <ul><li>UTF-8 while a bit more complex to convert from/to (i.e.slightlymore
71 costly to import and export CPU wise) is also far morecompactthan
72 UTF-16 (and UCS-4) for a majority of the documents I seeit usedfor
73 right now (RPM RDF catalogs, advogato data, variousconfigurationfile
74 formats, etc.) and the key point for today'scomputerarchitecture is
75 efficient uses of caches. If one nearlydouble thememory requirement
76 to store the same amount of data, thiswill trashcaches (main
77 memory/external caches/internal caches) and mytake isthat this harms
78 the system far more than the CPU requirementsneededfor the conversion
79 to UTF-8</li>
80 <li>Most of libxml2 version 1 users were using it with
81 straightASCIImost of the time, doing the conversion with an
82 internalencodingrequiring all their code to be rewritten was a
83 seriousshow-stopperfor using UTF-16 or UCS-4.</li>
84 <li>UTF-8 is being used as the de-facto internal encoding
85 standardforrelated code like the <a href="http://www.pango.org/">pango</a>upcoming Gnome text widget,
86 anda lot of Unix code (yet another placewhere Unix programmer base
87 takesa different approach from Microsoft- they are using UTF-16)</li>
Daniel Veillard1177ca42003-04-26 22:29:54 +000088 </ul></li>
Daniel Veillardfabafd52006-06-08 08:16:33 +000089</ul><p>What does this mean in practice for the libxml2 user:</p><ul><li>xmlChar, the libxml2 data type is a byte, those bytes must
90 beassembledas UTF-8 valid strings. The proper way to terminate an xmlChar
91 *stringis simply to append 0 byte, as usual.</li>
92 <li>One just need to make sure that when using chars outside the
93 ASCIIset,the values has been properly converted to UTF-8</li>
94</ul><h3><a name="implemente" id="implemente">How is it implemented ?</a></h3><p>Let's describe how all this works within libxml, basically
95theI18N(internationalization) support get triggered only during I/O
96operation,i.e.when reading a document or saving one. Let's look first at
97thereadingsequence:</p><ol><li>when a document is processed, we usually don't know the
98 encoding,asimple heuristic allows to detect UTF-16 and UCS-4 from
99 encodingswherethe ASCII range (0-0x7F) maps with ASCII</li>
100 <li>the xml declaration if available is parsed, including
101 theencodingdeclaration. At that point, if the autodetected encoding
102 isdifferentfrom the one declared a call to xmlSwitchEncoding()
103 isissued.</li>
104 <li>If there is no encoding declaration, then the input has to be
105 ineitherUTF-8 or UTF-16, if it is not then at some point when
106 processingtheinput, the converter/checker of UTF-8 form will raise an
107 encodingerror.You may end-up with a garbled document, or no document at
108 all !Example:
Daniel Veillardbe40c8b2000-07-14 12:10:59 +0000109 <pre>~/XML -&gt; ./xmllint err.xml
110err.xml:1: error: Input is not proper UTF-8, indicate encoding !
111&lt;très&gt;là&lt;/très&gt;
112 ^
113err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
114&lt;très&gt;là&lt;/très&gt;
115 ^</pre>
Daniel Veillard0b28e882002-07-24 23:47:05 +0000116 </li>
Daniel Veillardfabafd52006-06-08 08:16:33 +0000117 <li>xmlSwitchEncoding() does an encoding name lookup, canonicalize
118 it,andthen search the default registered encoding converters for
119 thatencoding.If it's not within the default set and iconv() support has
120 beencompiledit, it will ask iconv for such an encoder. If this fails then
121 theparserwill report an error and stops processing:
Daniel Veillardbe40c8b2000-07-14 12:10:59 +0000122 <pre>~/XML -&gt; ./xmllint err2.xml
123err2.xml:1: error: Unsupported encoding UnsupportedEnc
Daniel Veillard024f1992003-12-10 16:43:49 +0000124&lt;?xml version="1.0" encoding="UnsupportedEnc"?&gt;
Daniel Veillardbe40c8b2000-07-14 12:10:59 +0000125 ^</pre>
Daniel Veillard0b28e882002-07-24 23:47:05 +0000126 </li>
Daniel Veillardfabafd52006-06-08 08:16:33 +0000127 <li>From that point the encoder processes progressively the input
128 (itisplugged as a front-end to the I/O module) for that entity.
129 Itcapturesand converts on-the-fly the document to be parsed to UTF-8.
130 Theparseritself just does UTF-8 checking of this input and
131 processittransparently. The only difference is that the encoding
132 informationhasbeen added to the parsing context (more precisely to
133 theinputcorresponding to this entity).</li>
134 <li>The result (when using DOM) is an internal form completely in
135 UTF-8withjust an encoding information on the document node.</li>
136</ol><p>Ok then what happens when saving the document (assuming
137youcollected/builtan xmlDoc DOM like structure) ? It depends on the
138functioncalled,xmlSaveFile() will just try to save in the original
139encoding,whilexmlSaveFileTo() and xmlSaveFileEnc() can optionally save to
140agivenencoding:</p><ol><li>if no encoding is given, libxml2 will look for an
141 encodingvalueassociated to the document and if it exists will try to save
142 tothatencoding,
Daniel Veillardbe40c8b2000-07-14 12:10:59 +0000143 <p>otherwise everything is written in the internal form, i.e. UTF-8</p>
Daniel Veillard0b28e882002-07-24 23:47:05 +0000144 </li>
Daniel Veillardfabafd52006-06-08 08:16:33 +0000145 <li>so if an encoding was specified, either at the API level or
146 onthedocument, libxml2 will again canonicalize the encoding name,
147 lookupfor aconverter in the registered set or through iconv. If not
148 foundthefunction will return an error code</li>
149 <li>the converter is placed before the I/O buffer layer, as another
150 kindofbuffer, then libxml2 will simply push the UTF-8 serialization
151 tothroughthat buffer, which will then progressively be converted and
152 pushedontothe I/O layer.</li>
153 <li>It is possible that the converter code fails on some input,
154 forexampletrying to push an UTF-8 encoded Chinese character through
155 theUTF-8 toISO-8859-1 converter won't work. Since the encoders
156 areprogressive theywill just report the error and the number of
157 bytesconverted, at thatpoint libxml2 will decode the offending
158 character,remove it from thebuffer and replace it with the associated
159 charRefencoding &amp;#123; andresume the conversion. This guarantees that
160 anydocument will be savedwithout losses (except for markup names where
161 thisis not legal, this isa problem in the current version, in practice
162 avoidusing non-asciicharacters for tag or attribute names). A special
163 "ascii"encoding nameis used to save documents to a pure ascii form can be
164 usedwhenportability is really crucial</li>
Daniel Veillardabfca612004-01-07 23:38:02 +0000165</ol><p>Here are a few examples based on the same test document:</p><pre>~/XML -&gt; ./xmllint isolat1
Daniel Veillard024f1992003-12-10 16:43:49 +0000166&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
Daniel Veillardbe40c8b2000-07-14 12:10:59 +0000167&lt;très&gt;là&lt;/très&gt;
168~/XML -&gt; ./xmllint --encode UTF-8 isolat1
Daniel Veillard024f1992003-12-10 16:43:49 +0000169&lt;?xml version="1.0" encoding="UTF-8"?&gt;
Daniel Veillardbe40c8b2000-07-14 12:10:59 +0000170&lt;très&gt;là  &lt;/très&gt;
Daniel Veillardfabafd52006-06-08 08:16:33 +0000171~/XML -&gt; </pre><p>The same processing is applied (and reuse most of the code) for
172HTMLI18Nprocessing. Looking up and modifying the content encoding is a
173bitmoredifficult since it is located in a &lt;meta&gt; tag under
174the&lt;head&gt;,so a couple of functions htmlGetMetaEncoding()
175andhtmlSetMetaEncoding() havebeen provided. The parser also attempts to
176switchencoding on the fly whendetecting such a tag on input. Except for that
177theprocessing is the same(and again reuses the same code).</p><h3><a name="Default" id="Default">Default supported encodings</a></h3><p>libxml2 has a set of default converters for the followingencodings(located
178in encoding.c):</p><ol><li>UTF-8 is supported by default (null handlers)</li>
Daniel Veillard0b28e882002-07-24 23:47:05 +0000179 <li>UTF-16, both little and big endian</li>
180 <li>ISO-Latin-1 (ISO-8859-1) covering most western languages</li>
181 <li>ASCII, useful mostly for saving</li>
Daniel Veillardfabafd52006-06-08 08:16:33 +0000182 <li>HTML, a specific handler for the conversion of UTF-8 to ASCII
183 withHTMLpredefined entities like &amp;copy; for the Copyright sign.</li>
184</ol><p>More over when compiled on an Unix platform with iconv support the
185fullsetof encodings supported by iconv can be instantly be used by libxml. On
186alinuxmachine with glibc-2.1 the list of supported encodings and aliases
187fill3 fullpages, and include UCS-4, the full set of ISO-Latin encodings, and
188thevariousJapanese ones.</p><p>To convert from the UTF-8 values returned from the API to
189anotherencodingthen it is possible to use the function provided from <a href="html/libxml-encoding.html">the encoding module</a>like <a href="html/libxml-encoding.html#UTF8Toisolat1">UTF8Toisolat1</a>, or
190usethePOSIX <a href="http://www.opengroup.org/onlinepubs/009695399/functions/iconv.html">iconv()</a>APIdirectly.</p><h4>Encoding aliases</h4><p>From 2.2.3, libxml2 has support to register encoding names aliases.Thegoal
191is to be able to parse document whose encoding is supported butwherethe name
192differs (for example from the default set of names acceptedbyiconv). The
193following functions allow to register and handle new aliasesforexisting
194encodings. Once registered libxml2 will automatically lookupthealiases when
195handling a document:</p><ul><li>int xmlAddEncodingAlias(const char *name, const char *alias);</li>
Daniel Veillard0b28e882002-07-24 23:47:05 +0000196 <li>int xmlDelEncodingAlias(const char *alias);</li>
197 <li>const char * xmlGetEncodingAlias(const char *alias);</li>
198 <li>void xmlCleanupEncodingAliases(void);</li>
Daniel Veillardfabafd52006-06-08 08:16:33 +0000199</ul><h3><a name="extend" id="extend">How to extend the existing support</a></h3><p>Well adding support for new encoding, or overriding one of
200theencoders(assuming it is buggy) should not be hard, just write input
201andoutputconversion routines to/from UTF-8, and register
202themusingxmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they
203willbecalled automatically if the parser(s) encounter such an
204encodingname(register it uppercase, this will help). The description of
205theencoders,their arguments and expected return values are described in
206theencoding.hheader.</p><p><a href="bugs.html">Daniel Veillard</a></p></td></tr></table></td></tr></table></td></tr></table></td></tr></table></td></tr></table></body></html>