First version of the encoding doc, Daniel.

commit: be40c8b2d2c811ef48d249305e271b5f2ffc969f [log] [tgz]
author: Daniel Veillard <veillard@src.gnome.org> Fri Jul 14 12:10:59 2000 +0000
committer: Daniel Veillard <veillard@src.gnome.org> Fri Jul 14 12:10:59 2000 +0000
tree: 4c70ee7c3a2fd35efd039a85c3a4987d0d548ed1
parent: 60979bdcb3a749c21097dc12d6650548cb6e33ef [diff] [blame]
diff --git a/doc/encoding.html b/doc/encoding.html
new file mode 100644
index 0000000..6135bfc
--- /dev/null
+++ b/doc/encoding.html

@@ -0,0 +1,273 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
+                      "http://www.w3.org/TR/REC-html40/loose.dtd">
+<html>
+<head>
+  <title>Libxml Internationalization support</title>
+  <meta name="GENERATOR" content="amaya V3.2">
+  <meta http-equiv="Content-Type" content="text/html">
+</head>
+
+<body bgcolor="#ffffff">
+<h1 align="center">Libxml Internationalization support</h1>
+
+<p>Location: <a
+href="http://xmlsoft.org/encoding.html">http://xmlsoft.org/encoding.html</a></p>
+
+<p>Libxml home page: <a href="http://xmlsoft.org/">http://xmlsoft.org/</a></p>
+
+<p>Mailing-list archive:  <a
+href="http://xmlsoft.org/messages/">http://xmlsoft.org/messages/</a></p>
+
+<p>Version: $Revision$</p>
+
+<p>Table of Content:</p>
+<ol>
+  <li><a href="#What">What does internationalization support mean ?</a></li>
+  <li><a href="#internal">The internal encoding, how and why</a></li>
+  <li><a href="#implemente">How is it implemented ?</a></li>
+  <li><a href="#Default">Default supported encodings</a></li>
+  <li><a href="#extend">How to extend the existing support</a></li>
+</ol>
+
+<h2><a name="What">What does internationalization support mean ?</a></h2>
+
+<p>XML was designed from the start to allow the support of any character set
+by using Unicode. Any conformant XML parser has to support the UTF-8 and
+UTF-16 default encodings which can both express the full unicode ranges. UTF8
+is a variable length encoding whose greatest point are to resuse the same
+emcoding for ASCII and to save space for Western encodings, but it is a bit
+more complex to handle in practice. UTF-16 use 2 bytes per characters (and
+sometimes combines two pairs), it makes implementation easier, but looks a bit
+overkill for Western languages encoding. Moreover the XML specification allows
+document to be encoded in other encodings at the condition that they are
+clearly labelled as such. For example the following is a wellformed XML
+document encoded in ISO-Latin 1 and using accentuated letter that we French
+likes for both markup and content:</p>
+<pre>&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
+&lt;très&gt;là&lt;/très&gt;</pre>
+
+<p>  Having internationalization support in libxml means the foolowing:</p>
+<ul>
+  <li>the document is properly parsed</li>
+  <li>informations about it's encoding are saved</li>
+  <li>it can be modified</li>
+  <li>it can be saved in its original encoding</li>
+  <li>it can also be saved in another encoding supported by libxml (for
+    example straight UTF8 or even an ASCII form)</li>
+</ul>
+
+<p>Another very important point is that the whole libxml API, with the
+exception of a few routines to read with a specific encoding or save to a
+specific encoding, is completely agnostic about the original encoding of the
+document.</p>
+
+<p>It should be noted too that the HTML parser embedded in libxml now obbey
+the same rules too, the following document will be (as of 2.2.2) handled  in
+an internationalized fashion by libxml too:</p>
+<pre>&lt;!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
+                      "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
+&lt;html lang="fr"&gt;
+&lt;head&gt;
+  &lt;META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-latin-1"&gt;
+&lt;/head&gt;
+&lt;body&gt;
+&lt;p&gt;W3C crée des standards pour le Web.&lt;/body&gt;
+&lt;/html&gt;</pre>
+
+<h2><a name="internal">The internal encoding, how and why</a></h2>
+
+<p>One of the core decision was to force all documents to be converted to a
+default internal encoding, and that encoding to be UTF-8, here are the
+rationale for those choices:</p>
+<ul>
+  <li>keeping the native encoding in the internal form would force the libxml
+    users (or the code associated) to be fully aware of the encoding of the
+    original document, for examples when adding a text node to a document, the
+    content would have to be provided in the document encoding, i.e. the
+    client code would have to check it before hand, make sure it's conformant
+    to the encoding, etc ... Very hard in practice, though in some specific
+    cases this may make sense.</li>
+  <li>the second decision was which encoding. From the XML spec only UTF8 and
+    UTF16 really makes sense as being the two only encodings for which there
+    is amndatory support. UCS-4 (32 bits fixed size encoding) could be
+    considered an intelligent choice too since it's a direct Unicode mapping
+    support. I selected UTF-8 on the basis of efficiency and compatibility
+    with surrounding software:
+    <ul>
+      <li>UTF-8 while a bit more complex to convert from/to (i.e. slightly
+        more costly to import and export CPU wise) is also far more compact
+        than UTF-16 (and UCS-4) for a majority of the documents I see it used
+        for right now (RPM RDF catalogs, advogato data, various configuration
+        file formats, etc.) and the key point for today's computer
+        architecture is efficient uses of caches. If one nearly double the
+        memory requirement to store the same amount of data, this will trash
+        caches (main memory/external caches/internal caches) and my take is
+        that this harms the system far more than the CPU requirements needed
+        for the conversion to UTF-8</li>
+      <li>Most of libxml version 1 users were using it with straight ASCII
+        most of the time, doing the conversion with an internal encoding
+        requiring all their code to be rewritten was a serious show-stopper
+        for using UTF-16 or UCS-4.</li>
+      <li>UTF-8 is being used as the de-facto internal encoding standard for
+        related code like the <a href="http://www.pango.org/">pango</a>
+        upcoming Gnome text widget, and a lot of Unix code (yep another place
+        where Unix programmer base takes a different approach from Microsoft -
+        they are using UTF-16)</li>
+    </ul>
+  </li>
+</ul>
+
+<p>What does this mean in practice for the libxml user:</p>
+<ul>
+  <li>xmlChar, the libxml data type is a byte, those bytes must be assembled
+    as UTF-8 valid strings. The proper way to terminate an xmlChar * string is
+    simply to append 0 byte, as usual.</li>
+  <li> One just need to make sure that when using chars outside the ASCII set,
+    the values has been properly converted to UTF-8</li>
+</ul>
+
+<h2><a name="implemente">How is it implemented ?</a></h2>
+
+<p>Let's describe how all this works within libxml, basically the I18N
+(internationalization) support get triggered only during I/O operation, i.e.
+when reading a document or saving one. Let's look first at the reading
+sequence:</p>
+<ol>
+  <li>when a document is processed, we usually don't know the encoding, a
+    simple heuristic allows to detect UTF-18 and UCS-4 from whose where the
+    ASCII range (0-0x7F) maps with ASCII</li>
+  <li>the xml declaration if available is parsed, including the encoding
+    declaration. At that point, if the autodetected encoding is different from
+    the one declared a call to xmlSwitchEncoding() is issued.</li>
+  <li>If there is no encoding declaration, then the input has to be in either
+    UTF-8 or UTF-16, if it is not then at some point when processing the
+    input, the converter/checker of UTF-8 form will raise an encoding error.
+    You may end-up with a garbled document, or no document at all ! Example:
+    <pre>~/XML -&gt; ./xmllint err.xml 
+err.xml:1: error: Input is not proper UTF-8, indicate encoding !
+&lt;très&gt;là&lt;/très&gt;
+   ^
+err.xml:1: error: Bytes: 0xE8 0x73 0x3E 0x6C
+&lt;très&gt;là&lt;/très&gt;
+   ^</pre>
+  </li>
+  <li>xmlSwitchEncoding() does an encoding name lookup, canonalize it, and
+    then search the default registered encoding converters for that encoding.
+    If it's not within the default set and iconv() support has been compiled
+    it, it will ask iconv for such an encoder. If this fails then the parser
+    will report an error and stops processing:
+    <pre>~/XML -&gt; ./xmllint err2.xml 
+err2.xml:1: error: Unsupported encoding UnsupportedEnc
+&lt;?xml version="1.0" encoding="UnsupportedEnc"?&gt;
+                                             ^</pre>
+  </li>
+  <li> From that point the encoder process progressingly the input (it is
+    plugged as a front-end to the I/O module) for that entity. It captures and
+    convert on-the-fly the document to be parsed to UTF-8. The parser itself
+    just does UTF-8 checking of this input and process it transparently. The
+    only difference is that the encoding information has been added to the
+    parsing context (more precisely to the input corresponding to this
+    entity).</li>
+  <li>The result (when using DOM) is an internal form completely in UTF-8 with
+    just an encoding information on the document node.</li>
+</ol>
+
+<p>Ok then what's happen when saving the document (assuming you
+colllected/built an xmlDoc DOM like structure) ? It depends on the function
+called, xmlSaveFile() will just try to save in the original encoding, while
+xmlSaveFileTo() and xmlSaveFileEnc() can optionally save to a given
+encoding:</p>
+<ol>
+  <li> if no encoding is given, libxml will look for an encoding value
+    associated to the document and if it exists will try to save to that
+    encoding,
+    <p>otherwise everything is written in the internal form, i.e. UTF-8</p>
+  </li>
+  <li>so if an encoding was specified, either at the API level or on the
+    document, libxml will again canonalize the encoding name, lookup for a
+    converter in the registered set or through iconv. If not found the
+    function will return an error code</li>
+  <li>the converter is placed before the I/O buffer layer, as another kind of
+    buffer, then libxml will simply push the UTF-8 serialization to through
+    that buffer, which will then progressively be converted and pushed onto
+    the I/O layer.</li>
+  <li>It is possible that the converter code fails on some input, for example
+    trying to push an UTF-8 encoded chinese character through the UTF-8 to
+    ISO-Latin-1 converter won't work. Since the encoders are progressive they
+    will just report the error and the number of bytes converted, at that
+    point libxml will decode the offending character, remove it from the
+    buffer and replace it with the associated charRef encoding &amp;#123; and
+    resume the convertion. This guarante that any document will be saved
+    without losses. A special "ascii" encoding name is used to save documents
+    to a pure ascii form can be used when portability is really crucial</li>
+</ol>
+
+<p>Here is a few examples based on the same test document:</p>
+<pre>~/XML -&gt; ./xmllint isolat1 
+&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
+&lt;très&gt;là&lt;/très&gt;
+~/XML -&gt; ./xmllint --encode UTF-8 isolat1 
+&lt;?xml version="1.0" encoding="UTF-8"?&gt;
+&lt;trÃ¨s&gt;lÃ  &lt;/trÃ¨s&gt;
+~/XML -&gt; ./xmllint --encode ascii isolat1 
+&lt;?xml version="1.0" encoding="ascii"?&gt;
+&lt;tr&amp;#xE8;s&gt;l&amp;#xE0;&lt;/tr&amp;#xE8;s&gt;
+~/XML -&gt; </pre>
+
+<p> The same processing is applied (and reuse most of the code) for HTML I18N
+processing. Looking up and modifying the content encoding is a bit more
+difficult since it is located in a &lt;meta&gt; tag under the &lt;head&gt;, so
+a couple of functions htmlGetMetaEncoding() and htmlSetMetaEncoding() have
+been provided. The parser also attempts to switch encoding on the fly when
+detecting such a tag on input. Except for that the processing is the same (and
+again reuses the same code). </p>
+
+<h2><a name="Default">Default supported encodings</a></h2>
+
+<p>libxml has a set of default converters for the following encodings (located
+in encoding.c):</p>
+<ol>
+  <li>UTF-8 is supported by default (null handlers)</li>
+  <li>UTF-16, both little and big endian</li>
+  <li>ISO-Latin-1 (ISO-8859-1) covering most western languages</li>
+  <li>ASCII, useful mostly for saving</li>
+  <li>HTML, a specific handler for the conversion of UTF-8 to ASCII with HTML
+    predefined entities like &amp;copy; for the Copyright sign.</li>
+</ol>
+
+<p>More over when compiled on an Unix platfor with iconv support the full set
+of encodings supported by iconv can be instantly be used by libxml. On a linux
+machine with glibc-2.1 the list of supported encodings and aliases fill 3 full
+pages, and include UCS-4, the full set of ISO-Latin encodings, and the various
+Japanese ones.</p>
+
+<h2><a name="extend">How to extend the existing support</a></h2>
+
+<p>Well adding support for new encoding, or overriding one of the encoders
+(assuming it is buggy) should not be hard, just write an input and output
+conversion routines to/from UTF-8, and register them using
+xmlNewCharEncodingHandler(name, xxxToUTF8, UTF8Toxxx),  and they will be
+called automatically if the parser(s) encounter such an encoding name
+(register it uppercase, this will help). The description of the encoders,
+their arguments and expected return values are described in the encoding.h
+header.</p>
+
+<p>A quick note on the topic of subverting the parser to use a different
+internal encoding than UTF-8, in some case people will absolutely want to keep
+the internal encoding different, I think it's still possible (but the encoding
+must be compliant with ASCII on the same subrange) though I didn't tried it.
+The key is to override the default conversion routines (by registering null
+encoders/decoders for your charsets), and bypass the UTF-8 checking of the
+parser by setting the parser context charset (ctxt-&gt;charset) to something
+different than XML_CHAR_ENCODING_UTF8, but there is no guarantee taht this
+will work. You may also have some troubles saving back.</p>
+
+<p>Basically proper I18N support is important, this requires at least
+libxml-2.0.0, but a lot of features and corrections are really available only
+starting 2.2.</p>
+
+<p><a href="mailto:Daniel.Veillard@w3.org">Daniel Veillard</a></p>
+
+<p>$Id$</p>
+</body>
+</html>
commit	be40c8b2d2c811ef48d249305e271b5f2ffc969f	[log] [tgz]
author	Daniel Veillard <veillard@src.gnome.org>	Fri Jul 14 12:10:59 2000 +0000
committer	Daniel Veillard <veillard@src.gnome.org>	Fri Jul 14 12:10:59 2000 +0000
tree	4c70ee7c3a2fd35efd039a85c3a4987d0d548ed1
parent	60979bdcb3a749c21097dc12d6650548cb6e33ef [diff] [blame]