Daniel Veillard | be40c8b | 2000-07-14 12:10:59 +0000 | [diff] [blame^] | 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" |
| 2 | "http://www.w3.org/TR/REC-html40/loose.dtd"> |
| 3 | <html> |
| 4 | <head> |
| 5 | <title>Libxml Internationalization support</title> |
| 6 | <meta name="GENERATOR" content="amaya V3.2"> |
| 7 | <meta http-equiv="Content-Type" content="text/html"> |
| 8 | </head> |
| 9 | |
| 10 | <body bgcolor="#ffffff"> |
| 11 | <h1 align="center">Libxml Internationalization support</h1> |
| 12 | |
| 13 | <p>Location: <a |
| 14 | href="http://xmlsoft.org/encoding.html">http://xmlsoft.org/encoding.html</a></p> |
| 15 | |
| 16 | <p>Libxml home page: <a href="http://xmlsoft.org/">http://xmlsoft.org/</a></p> |
| 17 | |
| 18 | <p>Mailing-list archive: <a |
| 19 | href="http://xmlsoft.org/messages/">http://xmlsoft.org/messages/</a></p> |
| 20 | |
| 21 | <p>Version: $Revision$</p> |
| 22 | |
| 23 | <p>Table of Content:</p> |
| 24 | <ol> |
| 25 | <li><a href="#What">What does internationalization support mean ?</a></li> |
| 26 | <li><a href="#internal">The internal encoding, how and why</a></li> |
| 27 | <li><a href="#implemente">How is it implemented ?</a></li> |
| 28 | <li><a href="#Default">Default supported encodings</a></li> |
| 29 | <li><a href="#extend">How to extend the existing support</a></li> |
| 30 | </ol> |
| 31 | |
| 32 | <h2><a name="What">What does internationalization support mean ?</a></h2> |
| 33 | |
| 34 | <p>XML was designed from the start to allow the support of any character set |
| 35 | by using Unicode. Any conformant XML parser has to support the UTF-8 and |
| 36 | UTF-16 default encodings which can both express the full unicode ranges. UTF8 |
| 37 | is a variable length encoding whose greatest point are to resuse the same |
| 38 | emcoding for ASCII and to save space for Western encodings, but it is a bit |
| 39 | more complex to handle in practice. UTF-16 use 2 bytes per characters (and |
| 40 | sometimes combines two pairs), it makes implementation easier, but looks a bit |
| 41 | overkill for Western languages encoding. Moreover the XML specification allows |
| 42 | document to be encoded in other encodings at the condition that they are |
| 43 | clearly labelled as such. For example the following is a wellformed XML |
| 44 | document encoded in ISO-Latin 1 and using accentuated letter that we French |
| 45 | likes for both markup and content:</p> |
| 46 | <pre><?xml version="1.0" encoding="ISO-8859-1"?> |
| 47 | <très>là</très></pre> |
| 48 | |
| 49 | <p> Having internationalization support in libxml means the foolowing:</p> |
| 50 | <ul> |
| 51 | <li>the document is properly parsed</li> |
| 52 | <li>informations about it's encoding are saved</li> |
| 53 | <li>it can be modified</li> |
| 54 | <li>it can be saved in its original encoding</li> |
| 55 | <li>it can also be saved in another encoding supported by libxml (for |
| 56 | example straight UTF8 or even an ASCII form)</li> |
| 57 | </ul> |
| 58 | |
| 59 | <p>Another very important point is that the whole libxml API, with the |
| 60 | exception of a few routines to read with a specific encoding or save to a |
| 61 | specific encoding, is completely agnostic about the original encoding of the |
| 62 | document.</p> |
| 63 | |
| 64 | <p>It should be noted too that the HTML parser embedded in libxml now obbey |
| 65 | the same rules too, the following document will be (as of 2.2.2) handled in |
| 66 | an internationalized fashion by libxml too:</p> |
| 67 | <pre><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" |
| 68 | "http://www.w3.org/TR/REC-html40/loose.dtd"> |
| 69 | <html lang="fr"> |
| 70 | <head> |
| 71 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-latin-1"> |
| 72 | </head> |
| 73 | <body> |
| 74 | <p>W3C crée des standards pour le Web.</body> |
| 75 | </html></pre> |
| 76 | |
| 77 | <h2><a name="internal">The internal encoding, how and why</a></h2> |
| 78 | |
| 79 | <p>One of the core decision was to force all documents to be converted to a |
| 80 | default internal encoding, and that encoding to be UTF-8, here are the |
| 81 | rationale for those choices:</p> |
| 82 | <ul> |
| 83 | <li>keeping the native encoding in the internal form would force the libxml |
| 84 | users (or the code associated) to be fully aware of the encoding of the |
| 85 | original document, for examples when adding a text node to a document, the |
| 86 | content would have to be provided in the document encoding, i.e. the |
| 87 | client code would have to check it before hand, make sure it's conformant |
| 88 | to the encoding, etc ... Very hard in practice, though in some specific |
| 89 | cases this may make sense.</li> |
| 90 | <li>the second decision was which encoding. From the XML spec only UTF8 and |
| 91 | UTF16 really makes sense as being the two only encodings for which there |
| 92 | is amndatory support. UCS-4 (32 bits fixed size encoding) could be |
| 93 | considered an intelligent choice too since it's a direct Unicode mapping |
| 94 | support. I selected UTF-8 on the basis of efficiency and compatibility |
| 95 | with surrounding software: |
| 96 | <ul> |
| 97 | <li>UTF-8 while a bit more complex to convert from/to (i.e. slightly |
| 98 | more costly to import and export CPU wise) is also far more compact |
| 99 | than UTF-16 (and UCS-4) for a majority of the documents I see it used |
| 100 | for right now (RPM RDF catalogs, advogato data, various configuration |
| 101 | file formats, etc.) and the key point for today's computer |
| 102 | architecture is efficient uses of caches. If one nearly double the |
| 103 | memory requirement to store the same amount of data, this will trash |
| 104 | caches (main memory/external caches/internal caches) and my take is |
| 105 | that this harms the system far more than the CPU requirements needed |
| 106 | for the conversion to UTF-8</li> |
| 107 | <li>Most of libxml version 1 users were using it with straight ASCII |
| 108 | most of the time, doing the conversion with an internal encoding |
| 109 | requiring all their code to be rewritten was a serious show-stopper |
| 110 | for using UTF-16 or UCS-4.</li> |
| 111 | <li>UTF-8 is being used as the de-facto internal encoding standard for |
| 112 | related code like the <a href="http://www.pango.org/">pango</a> |
| 113 | upcoming Gnome text widget, and a lot of Unix code (yep another place |
| 114 | where Unix programmer base takes a different approach from Microsoft - |
| 115 | they are using UTF-16)</li> |
| 116 | </ul> |
| 117 | </li> |
| 118 | </ul> |
| 119 | |
| 120 | <p>What does this mean in practice for the libxml user:</p> |
| 121 | <ul> |
| 122 | <li>xmlChar, the libxml data type is a byte, those bytes must be assembled |
| 123 | as UTF-8 valid strings. The proper way to terminate an xmlChar * string is |
| 124 | simply to append 0 byte, as usual.</li> |
| 125 | <li> One just need to make sure that when using chars outside the ASCII set, |
| 126 | the values has been properly converted to UTF-8</li> |
| 127 | </ul> |
| 128 | |
| 129 | <h2><a name="implemente">How is it implemented ?</a></h2> |
| 130 | |
| 131 | <p>Let's describe how all this works within libxml, basically the I18N |
| 132 | (internationalization) support get triggered only during I/O operation, i.e. |
| 133 | when reading a document or saving one. Let's look first at the reading |
| 134 | sequence:</p> |
| 135 | <ol> |
| 136 | <li>when a document is processed, we usually don't know the encoding, a |
| 137 | simple heuristic allows to detect UTF-18 and UCS-4 from whose where the |
| 138 | ASCII range (0-0x7F) maps with ASCII</li> |
| 139 | <li>the xml declaration if available is parsed, including the encoding |
| 140 | declaration. At that point, if the autodetected encoding is different from |
| 141 | the one declared a call to xmlSwitchEncoding() is issued.</li> |
| 142 | <li>If there is no encoding declaration, then the input has to be in either |
| 143 | UTF-8 or UTF-16, if it is not then at some point when processing the |
| 144 | input, the converter/checker of UTF-8 form will raise an encoding error. |
| 145 | You may end-up with a garbled document, or no document at all ! Example: |
| 146 | <pre>~/XML -> ./xmllint err.xml |
| 147 | err.xml:1: error: Input is not proper UTF-8, indicate encoding ! |
| 148 | <très>là</très> |
| 149 | ^ |
| 150 | err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C |
| 151 | <très>là</très> |
| 152 | ^</pre> |
| 153 | </li> |
| 154 | <li>xmlSwitchEncoding() does an encoding name lookup, canonalize it, and |
| 155 | then search the default registered encoding converters for that encoding. |
| 156 | If it's not within the default set and iconv() support has been compiled |
| 157 | it, it will ask iconv for such an encoder. If this fails then the parser |
| 158 | will report an error and stops processing: |
| 159 | <pre>~/XML -> ./xmllint err2.xml |
| 160 | err2.xml:1: error: Unsupported encoding UnsupportedEnc |
| 161 | <?xml version="1.0" encoding="UnsupportedEnc"?> |
| 162 | ^</pre> |
| 163 | </li> |
| 164 | <li> From that point the encoder process progressingly the input (it is |
| 165 | plugged as a front-end to the I/O module) for that entity. It captures and |
| 166 | convert on-the-fly the document to be parsed to UTF-8. The parser itself |
| 167 | just does UTF-8 checking of this input and process it transparently. The |
| 168 | only difference is that the encoding information has been added to the |
| 169 | parsing context (more precisely to the input corresponding to this |
| 170 | entity).</li> |
| 171 | <li>The result (when using DOM) is an internal form completely in UTF-8 with |
| 172 | just an encoding information on the document node.</li> |
| 173 | </ol> |
| 174 | |
| 175 | <p>Ok then what's happen when saving the document (assuming you |
| 176 | colllected/built an xmlDoc DOM like structure) ? It depends on the function |
| 177 | called, xmlSaveFile() will just try to save in the original encoding, while |
| 178 | xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given |
| 179 | encoding:</p> |
| 180 | <ol> |
| 181 | <li> if no encoding is given, libxml will look for an encoding value |
| 182 | associated to the document and if it exists will try to save to that |
| 183 | encoding, |
| 184 | <p>otherwise everything is written in the internal form, i.e. UTF-8</p> |
| 185 | </li> |
| 186 | <li>so if an encoding was specified, either at the API level or on the |
| 187 | document, libxml will again canonalize the encoding name, lookup for a |
| 188 | converter in the registered set or through iconv. If not found the |
| 189 | function will return an error code</li> |
| 190 | <li>the converter is placed before the I/O buffer layer, as another kind of |
| 191 | buffer, then libxml will simply push the UTF-8 serialization to through |
| 192 | that buffer, which will then progressively be converted and pushed onto |
| 193 | the I/O layer.</li> |
| 194 | <li>It is possible that the converter code fails on some input, for example |
| 195 | trying to push an UTF-8 encoded chinese character through the UTF-8 to |
| 196 | ISO-Latin-1 converter won't work. Since the encoders are progressive they |
| 197 | will just report the error and the number of bytes converted, at that |
| 198 | point libxml will decode the offending character, remove it from the |
| 199 | buffer and replace it with the associated charRef encoding &#123; and |
| 200 | resume the convertion. This guarante that any document will be saved |
| 201 | without losses. A special "ascii" encoding name is used to save documents |
| 202 | to a pure ascii form can be used when portability is really crucial</li> |
| 203 | </ol> |
| 204 | |
| 205 | <p>Here is a few examples based on the same test document:</p> |
| 206 | <pre>~/XML -> ./xmllint isolat1 |
| 207 | <?xml version="1.0" encoding="ISO-8859-1"?> |
| 208 | <très>là</très> |
| 209 | ~/XML -> ./xmllint --encode UTF-8 isolat1 |
| 210 | <?xml version="1.0" encoding="UTF-8"?> |
| 211 | <très>là </très> |
| 212 | ~/XML -> ./xmllint --encode ascii isolat1 |
| 213 | <?xml version="1.0" encoding="ascii"?> |
| 214 | <tr&#xE8;s>l&#xE0;</tr&#xE8;s> |
| 215 | ~/XML -> </pre> |
| 216 | |
| 217 | <p> The same processing is applied (and reuse most of the code) for HTML I18N |
| 218 | processing. Looking up and modifying the content encoding is a bit more |
| 219 | difficult since it is located in a <meta> tag under the <head>, so |
| 220 | a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have |
| 221 | been provided. The parser also attempts to switch encoding on the fly when |
| 222 | detecting such a tag on input. Except for that the processing is the same (and |
| 223 | again reuses the same code). </p> |
| 224 | |
| 225 | <h2><a name="Default">Default supported encodings</a></h2> |
| 226 | |
| 227 | <p>libxml has a set of default converters for the following encodings (located |
| 228 | in encoding.c):</p> |
| 229 | <ol> |
| 230 | <li>UTF-8 is supported by default (null handlers)</li> |
| 231 | <li>UTF-16, both little and big endian</li> |
| 232 | <li>ISO-Latin-1 (ISO-8859-1) covering most western languages</li> |
| 233 | <li>ASCII, useful mostly for saving</li> |
| 234 | <li>HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML |
| 235 | predefined entities like &copy; for the Copyright sign.</li> |
| 236 | </ol> |
| 237 | |
| 238 | <p>More over when compiled on an Unix platfor with iconv support the full set |
| 239 | of encodings supported by iconv can be instantly be used by libxml. On a linux |
| 240 | machine with glibc-2.1 the list of supported encodings and aliases fill 3 full |
| 241 | pages, and include UCS-4, the full set of ISO-Latin encodings, and the various |
| 242 | Japanese ones.</p> |
| 243 | |
| 244 | <h2><a name="extend">How to extend the existing support</a></h2> |
| 245 | |
| 246 | <p>Well adding support for new encoding, or overriding one of the encoders |
| 247 | (assuming it is buggy) should not be hard, just write an input and output |
| 248 | conversion routines to/from UTF-8, and register them using |
| 249 | xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx), and they will be |
| 250 | called automatically if the parser(s) encounter such an encoding name |
| 251 | (register it uppercase, this will help). The description of the encoders, |
| 252 | their arguments and expected return values are described in the encoding.h |
| 253 | header.</p> |
| 254 | |
| 255 | <p>A quick note on the topic of subverting the parser to use a different |
| 256 | internal encoding than UTF-8, in some case people will absolutely want to keep |
| 257 | the internal encoding different, I think it's still possible (but the encoding |
| 258 | must be compliant with ASCII on the same subrange) though I didn't tried it. |
| 259 | The key is to override the default conversion routines (by registering null |
| 260 | encoders/decoders for your charsets), and bypass the UTF-8 checking of the |
| 261 | parser by setting the parser context charset (ctxt->charset) to something |
| 262 | different than XML_CHAR_ENCODING_UTF8, but there is no guarantee taht this |
| 263 | will work. You may also have some troubles saving back.</p> |
| 264 | |
| 265 | <p>Basically proper I18N support is important, this requires at least |
| 266 | libxml-2.0.0, but a lot of features and corrections are really available only |
| 267 | starting 2.2.</p> |
| 268 | |
| 269 | <p><a href="mailto:Daniel.Veillard@w3.org">Daniel Veillard</a></p> |
| 270 | |
| 271 | <p>$Id$</p> |
| 272 | </body> |
| 273 | </html> |