Daniel Veillard | 43d3f61 | 2001-11-10 11:57:23 +0000 | [diff] [blame] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd"> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 2 | <html> |
| 3 | <head> |
Daniel Veillard | 7216cfd | 2002-11-08 15:10:00 +0000 | [diff] [blame^] | 4 | <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
Daniel Veillard | c332dab | 2002-03-29 14:08:27 +0000 | [diff] [blame] | 5 | <link rel="SHORTCUT ICON" href="/favicon.ico"> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 6 | <style type="text/css"><!-- |
Daniel Veillard | 373a475 | 2002-02-21 14:46:29 +0000 | [diff] [blame] | 7 | TD {font-family: Verdana,Arial,Helvetica} |
| 8 | BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em} |
| 9 | H1 {font-family: Verdana,Arial,Helvetica} |
| 10 | H2 {font-family: Verdana,Arial,Helvetica} |
| 11 | H3 {font-family: Verdana,Arial,Helvetica} |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 12 | A:link, A:visited, A:active { text-decoration: underline } |
| 13 | --></style> |
| 14 | <title>Encodings support</title> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 15 | </head> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 16 | <body bgcolor="#8b7765" text="#000000" link="#000000" vlink="#000000"> |
| 17 | <table border="0" width="100%" cellpadding="5" cellspacing="0" align="center"><tr> |
| 18 | <td width="180"> |
Daniel Veillard | 8f40f1e | 2002-08-28 21:18:45 +0000 | [diff] [blame] | 19 | <a href="http://www.gnome.org/"><img src="gnome2.png" alt="Gnome2 Logo"></a><a href="http://www.w3.org/Status"><img src="w3c.png" alt="W3C Logo"></a><a href="http://www.redhat.com/"><img src="redhat.gif" alt="Red Hat Logo"></a><div align="left"><a href="http://xmlsoft.org/"><img src="Libxml2-Logo-180x168.gif" alt="Made with Libxml2 Logo"></a></div> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 20 | </td> |
| 21 | <td><table border="0" width="90%" cellpadding="2" cellspacing="0" align="center" bgcolor="#000000"><tr><td><table width="100%" border="0" cellspacing="1" cellpadding="3" bgcolor="#fffacd"><tr><td align="center"> |
| 22 | <h1>The XML C library for Gnome</h1> |
| 23 | <h2>Encodings support</h2> |
| 24 | </td></tr></table></td></tr></table></td> |
| 25 | </tr></table> |
| 26 | <table border="0" cellpadding="4" cellspacing="0" width="100%" align="center"><tr><td bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="2" width="100%"><tr> |
| 27 | <td valign="top" width="200" bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="1" width="100%" bgcolor="#000000"><tr><td> |
| 28 | <table width="100%" border="0" cellspacing="1" cellpadding="3"> |
| 29 | <tr><td colspan="1" bgcolor="#eecfa1" align="center"><center><b>Main Menu</b></center></td></tr> |
Daniel Veillard | 8acca11 | 2002-01-21 09:52:27 +0000 | [diff] [blame] | 30 | <tr><td bgcolor="#fffacd"><ul> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 31 | <li><a href="index.html">Home</a></li> |
| 32 | <li><a href="intro.html">Introduction</a></li> |
| 33 | <li><a href="FAQ.html">FAQ</a></li> |
| 34 | <li><a href="docs.html">Documentation</a></li> |
| 35 | <li><a href="bugs.html">Reporting bugs and getting help</a></li> |
| 36 | <li><a href="help.html">How to help</a></li> |
| 37 | <li><a href="downloads.html">Downloads</a></li> |
| 38 | <li><a href="news.html">News</a></li> |
Daniel Veillard | 7b602b4 | 2002-01-08 13:26:00 +0000 | [diff] [blame] | 39 | <li><a href="XMLinfo.html">XML</a></li> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 40 | <li><a href="XSLT.html">XSLT</a></li> |
Daniel Veillard | 6dbcaf8 | 2002-02-20 14:37:47 +0000 | [diff] [blame] | 41 | <li><a href="python.html">Python and bindings</a></li> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 42 | <li><a href="architecture.html">libxml architecture</a></li> |
| 43 | <li><a href="tree.html">The tree output</a></li> |
| 44 | <li><a href="interface.html">The SAX interface</a></li> |
| 45 | <li><a href="xmldtd.html">Validation & DTDs</a></li> |
| 46 | <li><a href="xmlmem.html">Memory Management</a></li> |
| 47 | <li><a href="encoding.html">Encodings support</a></li> |
| 48 | <li><a href="xmlio.html">I/O Interfaces</a></li> |
| 49 | <li><a href="catalog.html">Catalog support</a></li> |
| 50 | <li><a href="library.html">The parser interfaces</a></li> |
| 51 | <li><a href="entities.html">Entities or no entities</a></li> |
| 52 | <li><a href="namespaces.html">Namespaces</a></li> |
| 53 | <li><a href="upgrade.html">Upgrading 1.x code</a></li> |
Daniel Veillard | 52dcab3 | 2001-10-30 12:51:17 +0000 | [diff] [blame] | 54 | <li><a href="threads.html">Thread safety</a></li> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 55 | <li><a href="DOM.html">DOM Principles</a></li> |
| 56 | <li><a href="example.html">A real example</a></li> |
| 57 | <li><a href="contribs.html">Contributions</a></li> |
Daniel Veillard | fc59c09 | 2002-06-05 14:48:26 +0000 | [diff] [blame] | 58 | <li><a href="tutorial/index.html">Tutorial</a></li> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 59 | <li> |
| 60 | <a href="xml.html">flat page</a>, <a href="site.xsl">stylesheet</a> |
| 61 | </li> |
| 62 | </ul></td></tr> |
| 63 | </table> |
| 64 | <table width="100%" border="0" cellspacing="1" cellpadding="3"> |
Daniel Veillard | 3bf65be | 2002-01-23 12:36:34 +0000 | [diff] [blame] | 65 | <tr><td colspan="1" bgcolor="#eecfa1" align="center"><center><b>API Indexes</b></center></td></tr> |
Daniel Veillard | 5ede35e | 2002-10-01 11:37:35 +0000 | [diff] [blame] | 66 | <tr><td bgcolor="#fffacd"> |
Daniel Veillard | 595978c | 2002-10-09 18:46:35 +0000 | [diff] [blame] | 67 | <form action="search.php" enctype="application/x-www-form-urlencoded" method="GET"> |
Daniel Veillard | 5ede35e | 2002-10-01 11:37:35 +0000 | [diff] [blame] | 68 | <input name="query" type="TEXT" size="20" value=""><input name="submit" type="submit" value="Search ..."> |
| 69 | </form> |
| 70 | <ul> |
Daniel Veillard | f859256 | 2002-01-23 17:58:17 +0000 | [diff] [blame] | 71 | <li><a href="APIchunk0.html">Alphabetic</a></li> |
Daniel Veillard | 3bf65be | 2002-01-23 12:36:34 +0000 | [diff] [blame] | 72 | <li><a href="APIconstructors.html">Constructors</a></li> |
| 73 | <li><a href="APIfunctions.html">Functions/Types</a></li> |
| 74 | <li><a href="APIfiles.html">Modules</a></li> |
| 75 | <li><a href="APIsymbols.html">Symbols</a></li> |
Daniel Veillard | 5ede35e | 2002-10-01 11:37:35 +0000 | [diff] [blame] | 76 | </ul> |
| 77 | </td></tr> |
Daniel Veillard | 3bf65be | 2002-01-23 12:36:34 +0000 | [diff] [blame] | 78 | </table> |
| 79 | <table width="100%" border="0" cellspacing="1" cellpadding="3"> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 80 | <tr><td colspan="1" bgcolor="#eecfa1" align="center"><center><b>Related links</b></center></td></tr> |
Daniel Veillard | 8acca11 | 2002-01-21 09:52:27 +0000 | [diff] [blame] | 81 | <tr><td bgcolor="#fffacd"><ul> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 82 | <li><a href="http://mail.gnome.org/archives/xml/">Mail archive</a></li> |
| 83 | <li><a href="http://xmlsoft.org/XSLT/">XSLT libxslt</a></li> |
Daniel Veillard | 4a85920 | 2002-01-08 11:49:22 +0000 | [diff] [blame] | 84 | <li><a href="http://phd.cs.unibo.it/gdome2/">DOM gdome2</a></li> |
Daniel Veillard | 2d347fa | 2002-03-17 10:34:11 +0000 | [diff] [blame] | 85 | <li><a href="http://www.aleksey.com/xmlsec/">XML-DSig xmlsec</a></li> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 86 | <li><a href="ftp://xmlsoft.org/">FTP</a></li> |
| 87 | <li><a href="http://www.fh-frankfurt.de/~igor/projects/libxml/">Windows binaries</a></li> |
Daniel Veillard | db9dfd9 | 2001-11-26 17:25:02 +0000 | [diff] [blame] | 88 | <li><a href="http://garypennington.net/libxml2/">Solaris binaries</a></li> |
Daniel Veillard | cb7543b | 2002-09-09 10:54:06 +0000 | [diff] [blame] | 89 | <li><a href="http://www.zveno.com/open_source/libxml2xslt.html">MacOsX binaries</a></li> |
Daniel Veillard | e6d8e20 | 2002-05-02 06:11:10 +0000 | [diff] [blame] | 90 | <li><a href="http://sourceforge.net/projects/libxml2-pas/">Pascal bindings</a></li> |
Daniel Veillard | 2d347fa | 2002-03-17 10:34:11 +0000 | [diff] [blame] | 91 | <li><a href="http://bugzilla.gnome.org/buglist.cgi?product=libxml&product=libxml2">Bug Tracker</a></li> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 92 | </ul></td></tr> |
| 93 | </table> |
| 94 | </td></tr></table></td> |
| 95 | <td valign="top" bgcolor="#8b7765"><table border="0" cellspacing="0" cellpadding="1" width="100%"><tr><td><table border="0" cellspacing="0" cellpadding="1" width="100%" bgcolor="#000000"><tr><td><table border="0" cellpadding="3" cellspacing="1" width="100%"><tr><td bgcolor="#fffacd"> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 96 | <p>Table of Content:</p> |
| 97 | <ol> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 98 | <li><a href="encoding.html#What">What does internationalization support |
| 99 | mean ?</a></li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 100 | <li><a href="encoding.html#internal">The internal encoding, how and |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 101 | why</a></li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 102 | <li><a href="encoding.html#implemente">How is it implemented ?</a></li> |
| 103 | <li><a href="encoding.html#Default">Default supported encodings</a></li> |
| 104 | <li><a href="encoding.html#extend">How to extend the existing |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 105 | support</a></li> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 106 | </ol> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 107 | <h3><a name="What">What does internationalization support mean ?</a></h3> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 108 | <p>XML was designed from the start to allow the support of any character set |
| 109 | by using Unicode. Any conformant XML parser has to support the UTF-8 and |
| 110 | UTF-16 default encodings which can both express the full unicode ranges. UTF8 |
Daniel Veillard | 63d8314 | 2002-05-20 06:51:05 +0000 | [diff] [blame] | 111 | is a variable length encoding whose greatest points are to reuse the same |
| 112 | encoding for ASCII and to save space for Western encodings, but it is a bit |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 113 | more complex to handle in practice. UTF-16 use 2 bytes per characters (and |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 114 | sometimes combines two pairs), it makes implementation easier, but looks a |
| 115 | bit overkill for Western languages encoding. Moreover the XML specification |
| 116 | allows document to be encoded in other encodings at the condition that they |
Daniel Veillard | 63d8314 | 2002-05-20 06:51:05 +0000 | [diff] [blame] | 117 | are clearly labeled as such. For example the following is a wellformed XML |
Daniel Veillard | 0d6b170 | 2000-08-22 23:52:16 +0000 | [diff] [blame] | 118 | document encoded in ISO-8859 1 and using accentuated letter that we French |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 119 | likes for both markup and content:</p> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 120 | <pre><?xml version="1.0" encoding="ISO-8859-1"?> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 121 | <très>là</très></pre> |
Daniel Veillard | 63d8314 | 2002-05-20 06:51:05 +0000 | [diff] [blame] | 122 | <p>Having internationalization support in libxml means the following:</p> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 123 | <ul> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 124 | <li>the document is properly parsed</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 125 | <li>informations about it's encoding are saved</li> |
| 126 | <li>it can be modified</li> |
| 127 | <li>it can be saved in its original encoding</li> |
| 128 | <li>it can also be saved in another encoding supported by libxml (for |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 129 | example straight UTF8 or even an ASCII form)</li> |
| 130 | </ul> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 131 | <p>Another very important point is that the whole libxml API, with the |
| 132 | exception of a few routines to read with a specific encoding or save to a |
| 133 | specific encoding, is completely agnostic about the original encoding of the |
| 134 | document.</p> |
Daniel Veillard | 63d8314 | 2002-05-20 06:51:05 +0000 | [diff] [blame] | 135 | <p>It should be noted too that the HTML parser embedded in libxml now obey |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 136 | the same rules too, the following document will be (as of 2.2.2) handled in |
| 137 | an internationalized fashion by libxml too:</p> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 138 | <pre><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" |
| 139 | "http://www.w3.org/TR/REC-html40/loose.dtd"> |
| 140 | <html lang="fr"> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 141 | <head> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 142 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1"> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 143 | </head> |
| 144 | <body> |
| 145 | <p>W3C crée des standards pour le Web.</body> |
| 146 | </html></pre> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 147 | <h3><a name="internal">The internal encoding, how and why</a></h3> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 148 | <p>One of the core decision was to force all documents to be converted to a |
| 149 | default internal encoding, and that encoding to be UTF-8, here are the |
| 150 | rationale for those choices:</p> |
| 151 | <ul> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 152 | <li>keeping the native encoding in the internal form would force the libxml |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 153 | users (or the code associated) to be fully aware of the encoding of the |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 154 | original document, for examples when adding a text node to a document, |
| 155 | the content would have to be provided in the document encoding, i.e. the |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 156 | client code would have to check it before hand, make sure it's conformant |
| 157 | to the encoding, etc ... Very hard in practice, though in some specific |
| 158 | cases this may make sense.</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 159 | <li>the second decision was which encoding. From the XML spec only UTF8 and |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 160 | UTF16 really makes sense as being the two only encodings for which there |
Daniel Veillard | 63d8314 | 2002-05-20 06:51:05 +0000 | [diff] [blame] | 161 | is mandatory support. UCS-4 (32 bits fixed size encoding) could be |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 162 | considered an intelligent choice too since it's a direct Unicode mapping |
| 163 | support. I selected UTF-8 on the basis of efficiency and compatibility |
| 164 | with surrounding software: |
| 165 | <ul> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 166 | <li>UTF-8 while a bit more complex to convert from/to (i.e. slightly |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 167 | more costly to import and export CPU wise) is also far more compact |
| 168 | than UTF-16 (and UCS-4) for a majority of the documents I see it used |
| 169 | for right now (RPM RDF catalogs, advogato data, various configuration |
| 170 | file formats, etc.) and the key point for today's computer |
| 171 | architecture is efficient uses of caches. If one nearly double the |
| 172 | memory requirement to store the same amount of data, this will trash |
| 173 | caches (main memory/external caches/internal caches) and my take is |
| 174 | that this harms the system far more than the CPU requirements needed |
| 175 | for the conversion to UTF-8</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 176 | <li>Most of libxml version 1 users were using it with straight ASCII |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 177 | most of the time, doing the conversion with an internal encoding |
| 178 | requiring all their code to be rewritten was a serious show-stopper |
| 179 | for using UTF-16 or UCS-4.</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 180 | <li>UTF-8 is being used as the de-facto internal encoding standard for |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 181 | related code like the <a href="http://www.pango.org/">pango</a> |
| 182 | upcoming Gnome text widget, and a lot of Unix code (yep another place |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 183 | where Unix programmer base takes a different approach from Microsoft |
| 184 | - they are using UTF-16)</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 185 | </ul> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 186 | </li> |
| 187 | </ul> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 188 | <p>What does this mean in practice for the libxml user:</p> |
| 189 | <ul> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 190 | <li>xmlChar, the libxml data type is a byte, those bytes must be assembled |
| 191 | as UTF-8 valid strings. The proper way to terminate an xmlChar * string |
| 192 | is simply to append 0 byte, as usual.</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 193 | <li>One just need to make sure that when using chars outside the ASCII set, |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 194 | the values has been properly converted to UTF-8</li> |
| 195 | </ul> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 196 | <h3><a name="implemente">How is it implemented ?</a></h3> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 197 | <p>Let's describe how all this works within libxml, basically the I18N |
| 198 | (internationalization) support get triggered only during I/O operation, i.e. |
| 199 | when reading a document or saving one. Let's look first at the reading |
| 200 | sequence:</p> |
| 201 | <ol> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 202 | <li>when a document is processed, we usually don't know the encoding, a |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 203 | simple heuristic allows to detect UTF-18 and UCS-4 from whose where the |
| 204 | ASCII range (0-0x7F) maps with ASCII</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 205 | <li>the xml declaration if available is parsed, including the encoding |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 206 | declaration. At that point, if the autodetected encoding is different |
| 207 | from the one declared a call to xmlSwitchEncoding() is issued.</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 208 | <li>If there is no encoding declaration, then the input has to be in either |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 209 | UTF-8 or UTF-16, if it is not then at some point when processing the |
| 210 | input, the converter/checker of UTF-8 form will raise an encoding error. |
| 211 | You may end-up with a garbled document, or no document at all ! Example: |
| 212 | <pre>~/XML -> ./xmllint err.xml |
| 213 | err.xml:1: error: Input is not proper UTF-8, indicate encoding ! |
| 214 | <très>là</très> |
| 215 | ^ |
| 216 | err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C |
| 217 | <très>là</très> |
| 218 | ^</pre> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 219 | </li> |
| 220 | <li>xmlSwitchEncoding() does an encoding name lookup, canonicalize it, and |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 221 | then search the default registered encoding converters for that encoding. |
| 222 | If it's not within the default set and iconv() support has been compiled |
| 223 | it, it will ask iconv for such an encoder. If this fails then the parser |
| 224 | will report an error and stops processing: |
| 225 | <pre>~/XML -> ./xmllint err2.xml |
| 226 | err2.xml:1: error: Unsupported encoding UnsupportedEnc |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 227 | <?xml version="1.0" encoding="UnsupportedEnc"?> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 228 | ^</pre> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 229 | </li> |
| 230 | <li>From that point the encoder processes progressively the input (it is |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 231 | plugged as a front-end to the I/O module) for that entity. It captures |
| 232 | and convert on-the-fly the document to be parsed to UTF-8. The parser |
| 233 | itself just does UTF-8 checking of this input and process it |
| 234 | transparently. The only difference is that the encoding information has |
| 235 | been added to the parsing context (more precisely to the input |
| 236 | corresponding to this entity).</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 237 | <li>The result (when using DOM) is an internal form completely in UTF-8 |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 238 | with just an encoding information on the document node.</li> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 239 | </ol> |
Daniel Veillard | 63d8314 | 2002-05-20 06:51:05 +0000 | [diff] [blame] | 240 | <p>Ok then what happens when saving the document (assuming you |
| 241 | collected/built an xmlDoc DOM like structure) ? It depends on the function |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 242 | called, xmlSaveFile() will just try to save in the original encoding, while |
| 243 | xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given |
| 244 | encoding:</p> |
| 245 | <ol> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 246 | <li>if no encoding is given, libxml will look for an encoding value |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 247 | associated to the document and if it exists will try to save to that |
| 248 | encoding, |
| 249 | <p>otherwise everything is written in the internal form, i.e. UTF-8</p> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 250 | </li> |
| 251 | <li>so if an encoding was specified, either at the API level or on the |
Daniel Veillard | 63d8314 | 2002-05-20 06:51:05 +0000 | [diff] [blame] | 252 | document, libxml will again canonicalize the encoding name, lookup for a |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 253 | converter in the registered set or through iconv. If not found the |
| 254 | function will return an error code</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 255 | <li>the converter is placed before the I/O buffer layer, as another kind of |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 256 | buffer, then libxml will simply push the UTF-8 serialization to through |
| 257 | that buffer, which will then progressively be converted and pushed onto |
| 258 | the I/O layer.</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 259 | <li>It is possible that the converter code fails on some input, for example |
Daniel Veillard | 63d8314 | 2002-05-20 06:51:05 +0000 | [diff] [blame] | 260 | trying to push an UTF-8 encoded Chinese character through the UTF-8 to |
Daniel Veillard | 0d6b170 | 2000-08-22 23:52:16 +0000 | [diff] [blame] | 261 | ISO-8859-1 converter won't work. Since the encoders are progressive they |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 262 | will just report the error and the number of bytes converted, at that |
| 263 | point libxml will decode the offending character, remove it from the |
| 264 | buffer and replace it with the associated charRef encoding &#123; and |
Daniel Veillard | 63d8314 | 2002-05-20 06:51:05 +0000 | [diff] [blame] | 265 | resume the conversion. This guarantees that any document will be saved |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 266 | without losses (except for markup names where this is not legal, this is |
Daniel Veillard | 63d8314 | 2002-05-20 06:51:05 +0000 | [diff] [blame] | 267 | a problem in the current version, in practice avoid using non-ascii |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 268 | characters for tags or attributes names @@). A special "ascii" encoding |
Daniel Veillard | 0d6b170 | 2000-08-22 23:52:16 +0000 | [diff] [blame] | 269 | name is used to save documents to a pure ascii form can be used when |
| 270 | portability is really crucial</li> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 271 | </ol> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 272 | <p>Here is a few examples based on the same test document:</p> |
| 273 | <pre>~/XML -> ./xmllint isolat1 |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 274 | <?xml version="1.0" encoding="ISO-8859-1"?> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 275 | <très>là</très> |
| 276 | ~/XML -> ./xmllint --encode UTF-8 isolat1 |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 277 | <?xml version="1.0" encoding="UTF-8"?> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 278 | <très>là </très> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 279 | ~/XML -> </pre> |
Daniel Veillard | 0d6b170 | 2000-08-22 23:52:16 +0000 | [diff] [blame] | 280 | <p>The same processing is applied (and reuse most of the code) for HTML I18N |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 281 | processing. Looking up and modifying the content encoding is a bit more |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 282 | difficult since it is located in a <meta> tag under the <head>, |
| 283 | so a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 284 | been provided. The parser also attempts to switch encoding on the fly when |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 285 | detecting such a tag on input. Except for that the processing is the same |
| 286 | (and again reuses the same code).</p> |
| 287 | <h3><a name="Default">Default supported encodings</a></h3> |
| 288 | <p>libxml has a set of default converters for the following encodings |
| 289 | (located in encoding.c):</p> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 290 | <ol> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 291 | <li>UTF-8 is supported by default (null handlers)</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 292 | <li>UTF-16, both little and big endian</li> |
| 293 | <li>ISO-Latin-1 (ISO-8859-1) covering most western languages</li> |
| 294 | <li>ASCII, useful mostly for saving</li> |
| 295 | <li>HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 296 | predefined entities like &copy; for the Copyright sign.</li> |
| 297 | </ol> |
Daniel Veillard | c0801af | 2002-05-28 16:28:42 +0000 | [diff] [blame] | 298 | <p>More over when compiled on an Unix platform with iconv support the full |
| 299 | set of encodings supported by iconv can be instantly be used by libxml. On a |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 300 | linux machine with glibc-2.1 the list of supported encodings and aliases fill |
| 301 | 3 full pages, and include UCS-4, the full set of ISO-Latin encodings, and the |
| 302 | various Japanese ones.</p> |
| 303 | <h4>Encoding aliases</h4> |
| 304 | <p>From 2.2.3, libxml has support to register encoding names aliases. The |
| 305 | goal is to be able to parse document whose encoding is supported but where |
| 306 | the name differs (for example from the default set of names accepted by |
| 307 | iconv). The following functions allow to register and handle new aliases for |
| 308 | existing encodings. Once registered libxml will automatically lookup the |
| 309 | aliases when handling a document:</p> |
Daniel Veillard | 088f428 | 2000-08-25 23:46:50 +0000 | [diff] [blame] | 310 | <ul> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 311 | <li>int xmlAddEncodingAlias(const char *name, const char *alias);</li> |
Daniel Veillard | 0b28e88 | 2002-07-24 23:47:05 +0000 | [diff] [blame] | 312 | <li>int xmlDelEncodingAlias(const char *alias);</li> |
| 313 | <li>const char * xmlGetEncodingAlias(const char *alias);</li> |
| 314 | <li>void xmlCleanupEncodingAliases(void);</li> |
Daniel Veillard | 088f428 | 2000-08-25 23:46:50 +0000 | [diff] [blame] | 315 | </ul> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 316 | <h3><a name="extend">How to extend the existing support</a></h3> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 317 | <p>Well adding support for new encoding, or overriding one of the encoders |
| 318 | (assuming it is buggy) should not be hard, just write an input and output |
| 319 | conversion routines to/from UTF-8, and register them using |
| 320 | xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they will be |
| 321 | called automatically if the parser(s) encounter such an encoding name |
| 322 | (register it uppercase, this will help). The description of the encoders, |
| 323 | their arguments and expected return values are described in the encoding.h |
| 324 | header.</p> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 325 | <p>A quick note on the topic of subverting the parser to use a different |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 326 | internal encoding than UTF-8, in some case people will absolutely want to |
| 327 | keep the internal encoding different, I think it's still possible (but the |
| 328 | encoding must be compliant with ASCII on the same subrange) though I didn't |
| 329 | tried it. The key is to override the default conversion routines (by |
| 330 | registering null encoders/decoders for your charsets), and bypass the UTF-8 |
| 331 | checking of the parser by setting the parser context charset |
| 332 | (ctxt->charset) to something different than XML_CHAR_ENCODING_UTF8, but |
Daniel Veillard | 63d8314 | 2002-05-20 06:51:05 +0000 | [diff] [blame] | 333 | there is no guarantee that this will work. You may also have some troubles |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 334 | saving back.</p> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 335 | <p>Basically proper I18N support is important, this requires at least |
| 336 | libxml-2.0.0, but a lot of features and corrections are really available only |
| 337 | starting 2.2.</p> |
Daniel Veillard | 3f4c40f | 2002-02-13 09:19:28 +0000 | [diff] [blame] | 338 | <p><a href="bugs.html">Daniel Veillard</a></p> |
Daniel Veillard | b8cfbd1 | 2001-10-25 10:53:28 +0000 | [diff] [blame] | 339 | </td></tr></table></td></tr></table></td></tr></table></td> |
| 340 | </tr></table></td></tr></table> |
Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame] | 341 | </body> |
| 342 | </html> |