MDT 2004 John Fleck | 4c3bb7d | 2004-08-25 02:51:27 +0000 | [diff] [blame^] | 1 | <html><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>Encoding Conversion</title><meta name="generator" content="DocBook XSL Stylesheets V1.61.2"><link rel="home" href="index.html" title="Libxml Tutorial"><link rel="up" href="index.html" title="Libxml Tutorial"><link rel="previous" href="ar01s08.html" title="Retrieving Attributes"><link rel="next" href="apa.html" title="A. Compilation"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">Encoding Conversion</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="ar01s08.html">Prev</a> </td><th width="60%" align="center"> </th><td width="20%" align="right"> <a accesskey="n" href="apa.html">Next</a></td></tr></table><hr></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="xmltutorialconvert"></a>Encoding Conversion</h2></div></div><div></div></div><p><a class="indexterm" name="id2587348"></a> |
MDT 2003 John Fleck | bc6734a | 2003-08-28 15:01:40 +0000 | [diff] [blame] | 2 | Data encoding compatibility problems are one of the most common |
| 3 | difficulties encountered by programmers new to <span class="acronym">XML</span> in |
| 4 | general and <span class="application">libxml</span> in particular. Thinking |
| 5 | through the design of your application in light of this issue will help |
| 6 | avoid difficulties later. Internally, <span class="application">libxml</span> |
| 7 | stores and manipulates data in the UTF-8 format. Data used by your program |
| 8 | in other formats, such as the commonly used ISO-8859-1 encoding, must be |
| 9 | converted to UTF-8 before passing it to <span class="application">libxml</span> |
| 10 | functions. If you want your program's output in an encoding other than |
| 11 | UTF-8, you also must convert it.</p><p><span class="application">Libxml</span> uses |
| 12 | <span class="application">iconv</span> if it is available to convert |
| 13 | data. Without <span class="application">iconv</span>, only UTF-8, UTF-16 and |
| 14 | ISO-8859-1 can be used as external formats. With |
| 15 | <span class="application">iconv</span>, any format can be used provided |
| 16 | <span class="application">iconv</span> is able to convert it to and from |
| 17 | UTF-8. Currently <span class="application">iconv</span> supports about 150 |
| 18 | different character formats with ability to convert from any to any. While |
| 19 | the actual number of supported formats varies between implementations, every |
| 20 | <span class="application">iconv</span> implementation is almost guaranteed to |
MST 2004 John Fleck | d14bccc | 2004-02-15 01:57:42 +0000 | [diff] [blame] | 21 | support every format anyone has ever heard of.</p><div class="warning" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Warning"><tr><td rowspan="2" align="center" valign="top" width="25"><img alt="[Warning]" src="images/warning.png"></td><th align="left">Warning</th></tr><tr><td colspan="2" align="left" valign="top"><p>A common mistake is to use different formats for the internal data |
MDT 2003 John Fleck | bc6734a | 2003-08-28 15:01:40 +0000 | [diff] [blame] | 22 | in different parts of one's code. The most common case is an application |
| 23 | that assumes ISO-8859-1 to be the internal data format, combined with |
| 24 | <span class="application">libxml</span>, which assumes UTF-8 to be the |
| 25 | internal data format. The result is an application that treats internal |
| 26 | data differently, depending on which code section is executing. The one or |
| 27 | the other part of code will then, naturally, misinterpret the data. |
MST 2004 John Fleck | d14bccc | 2004-02-15 01:57:42 +0000 | [diff] [blame] | 28 | </p></td></tr></table></div><p>This example constructs a simple document, then adds content provided |
MDT 2003 John Fleck | bc6734a | 2003-08-28 15:01:40 +0000 | [diff] [blame] | 29 | at the command line to the document's root element and outputs the results |
| 30 | to <tt class="filename">stdout</tt> in the proper encoding. For this example, we |
| 31 | use ISO-8859-1 encoding. The encoding of the string input at the command |
| 32 | line is converted from ISO-8859-1 to UTF-8. Full code: <a href="aph.html" title="H. Code for Encoding Conversion Example">Appendix H, <i>Code for Encoding Conversion Example</i></a></p><p>The conversion, encapsulated in the example code in the |
| 33 | <tt class="function">convert</tt> function, uses |
| 34 | <span class="application">libxml's</span> |
| 35 | <tt class="function">xmlFindCharEncodingHandler</tt> function: |
| 36 | </p><pre class="programlisting"> |
| 37 | <a name="handlerdatatype"></a><img src="images/callouts/1.png" alt="1" border="0">xmlCharEncodingHandlerPtr handler; |
| 38 | <a name="calcsize"></a><img src="images/callouts/2.png" alt="2" border="0">size = (int)strlen(in)+1; |
| 39 | out_size = size*2-1; |
| 40 | out = malloc((size_t)out_size); |
| 41 | |
| 42 | … |
| 43 | <a name="findhandlerfunction"></a><img src="images/callouts/3.png" alt="3" border="0">handler = xmlFindCharEncodingHandler(encoding); |
| 44 | … |
| 45 | <a name="callconversionfunction"></a><img src="images/callouts/4.png" alt="4" border="0">handler->input(out, &out_size, in, &temp); |
| 46 | … |
| 47 | <a name="outputencoding"></a><img src="images/callouts/5.png" alt="5" border="0">xmlSaveFormatFileEnc("-", doc, encoding, 1); |
| 48 | </pre><p> |
| 49 | </p><div class="calloutlist"><table border="0" summary="Callout list"><tr><td width="5%" valign="top" align="left"><a href="#handlerdatatype"><img src="images/callouts/1.png" alt="1" border="0"></a> </td><td valign="top" align="left"><p><tt class="varname">handler</tt> is declared as a pointer to an |
| 50 | <tt class="function">xmlCharEncodingHandler</tt> function.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#calcsize"><img src="images/callouts/2.png" alt="2" border="0"></a> </td><td valign="top" align="left"><p>The <tt class="function">xmlCharEncodingHandler</tt> function needs |
| 51 | to be given the size of the input and output strings, which are |
| 52 | calculated here for strings <tt class="varname">in</tt> and |
| 53 | <tt class="varname">out</tt>.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#findhandlerfunction"><img src="images/callouts/3.png" alt="3" border="0"></a> </td><td valign="top" align="left"><p><tt class="function">xmlFindCharEncodingHandler</tt> takes as its |
| 54 | argument the data's initial encoding and searches |
| 55 | <span class="application">libxml's</span> built-in set of conversion |
| 56 | handlers, returning a pointer to the function or NULL if none is |
| 57 | found.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#callconversionfunction"><img src="images/callouts/4.png" alt="4" border="0"></a> </td><td valign="top" align="left"><p>The conversion function identified by <tt class="varname">handler</tt> |
| 58 | requires as its arguments pointers to the input and output strings, |
| 59 | along with the length of each. The lengths must be determined |
| 60 | separately by the application.</p></td></tr><tr><td width="5%" valign="top" align="left"><a href="#outputencoding"><img src="images/callouts/5.png" alt="5" border="0"></a> </td><td valign="top" align="left"><p>To output in a specified encoding rather than UTF-8, we use |
| 61 | <tt class="function">xmlSaveFormatFileEnc</tt>, specifying the |
| 62 | encoding.</p></td></tr></table></div><p> |
| 63 | </p></div><div class="navfooter"><hr><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="ar01s08.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="index.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="apa.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Retrieving Attributes </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> A. Compilation</td></tr></table></div></body></html> |