blob: 0946bdcec1a7c5b4408c70920b17e1327bbde779 [file] [log] [blame]
Andrew M. Kuchlinge8f44d62005-08-30 01:25:05 +00001Unicode HOWTO
2================
3
4**Version 1.02**
5
6This HOWTO discusses Python's support for Unicode, and explains various
7problems that people commonly encounter when trying to work with Unicode.
8
9Introduction to Unicode
10------------------------------
11
12History of Character Codes
13''''''''''''''''''''''''''''''
14
15In 1968, the American Standard Code for Information Interchange,
16better known by its acronym ASCII, was standardized. ASCII defined
17numeric codes for various characters, with the numeric values running from 0 to
18127. For example, the lowercase letter 'a' is assigned 97 as its code
19value.
20
21ASCII was an American-developed standard, so it only defined
22unaccented characters. There was an 'e', but no 'é' or 'Í'. This
23meant that languages which required accented characters couldn't be
24faithfully represented in ASCII. (Actually the missing accents matter
25for English, too, which contains words such as 'naïve' and 'café', and some
26publications have house styles which require spellings such as
27'coöperate'.)
28
29For a while people just wrote programs that didn't display accents. I
30remember looking at Apple ][ BASIC programs, published in French-language
31publications in the mid-1980s, that had lines like these::
32
33 PRINT "FICHER EST COMPLETE."
34 PRINT "CARACTERE NON ACCEPTE."
35
36Those messages should contain accents, and they just look wrong to
37someone who can read French.
38
39In the 1980s, almost all personal computers were 8-bit, meaning that
40bytes could hold values ranging from 0 to 255. ASCII codes only went
41up to 127, so some machines assigned values between 128 and 255 to
42accented characters. Different machines had different codes, however,
43which led to problems exchanging files. Eventually various commonly
44used sets of values for the 128-255 range emerged. Some were true
45standards, defined by the International Standards Organization, and
46some were **de facto** conventions that were invented by one company
47or another and managed to catch on.
48
49255 characters aren't very many. For example, you can't fit
50both the accented characters used in Western Europe and the Cyrillic
51alphabet used for Russian into the 128-255 range because there are more than
52127 such characters.
53
54You could write files using different codes (all your Russian
55files in a coding system called KOI8, all your French files in
56a different coding system called Latin1), but what if you wanted
57to write a French document that quotes some Russian text? In the
581980s people began to want to solve this problem, and the Unicode
59standardization effort began.
60
61Unicode started out using 16-bit characters instead of 8-bit characters. 16
62bits means you have 2^16 = 65,536 distinct values available, making it
63possible to represent many different characters from many different
64alphabets; an initial goal was to have Unicode contain the alphabets for
65every single human language. It turns out that even 16 bits isn't enough to
66meet that goal, and the modern Unicode specification uses a wider range of
67codes, 0-1,114,111 (0x10ffff in base-16).
68
69There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
70originally separate efforts, but the specifications were merged with
71the 1.1 revision of Unicode.
72
73(This discussion of Unicode's history is highly simplified. I don't
74think the average Python programmer needs to worry about the
75historical details; consult the Unicode consortium site listed in the
76References for more information.)
77
78
79Definitions
80''''''''''''''''''''''''
81
82A **character** is the smallest possible component of a text. 'A',
83'B', 'C', etc., are all different characters. So are 'È' and
84'Í'. Characters are abstractions, and vary depending on the
85language or context you're talking about. For example, the symbol for
86ohms (Ω) is usually drawn much like the capital letter
87omega (Ω) in the Greek alphabet (they may even be the same in
88some fonts), but these are two different characters that have
89different meanings.
90
91The Unicode standard describes how characters are represented by
92**code points**. A code point is an integer value, usually denoted in
93base 16. In the standard, a code point is written using the notation
94U+12ca to mean the character with value 0x12ca (4810 decimal). The
95Unicode standard contains a lot of tables listing characters and their
96corresponding code points::
97
98 0061 'a'; LATIN SMALL LETTER A
99 0062 'b'; LATIN SMALL LETTER B
100 0063 'c'; LATIN SMALL LETTER C
101 ...
102 007B '{'; LEFT CURLY BRACKET
103
104Strictly, these definitions imply that it's meaningless to say 'this is
105character U+12ca'. U+12ca is a code point, which represents some particular
106character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'.
107In informal contexts, this distinction between code points and characters will
108sometimes be forgotten.
109
110A character is represented on a screen or on paper by a set of graphical
111elements that's called a **glyph**. The glyph for an uppercase A, for
112example, is two diagonal strokes and a horizontal stroke, though the exact
113details will depend on the font being used. Most Python code doesn't need
114to worry about glyphs; figuring out the correct glyph to display is
115generally the job of a GUI toolkit or a terminal's font renderer.
116
117
118Encodings
119'''''''''
120
121To summarize the previous section:
122a Unicode string is a sequence of code points, which are
123numbers from 0 to 0x10ffff. This sequence needs to be represented as
124a set of bytes (meaning, values from 0-255) in memory. The rules for
125translating a Unicode string into a sequence of bytes are called an
126**encoding**.
127
128The first encoding you might think of is an array of 32-bit integers.
129In this representation, the string "Python" would look like this::
130
131 P y t h o n
132 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
133 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
134
135This representation is straightforward but using
136it presents a number of problems.
137
1381. It's not portable; different processors order the bytes
139 differently.
140
1412. It's very wasteful of space. In most texts, the majority of the code
142 points are less than 127, or less than 255, so a lot of space is occupied
143 by zero bytes. The above string takes 24 bytes compared to the 6
144 bytes needed for an ASCII representation. Increased RAM usage doesn't
145 matter too much (desktop computers have megabytes of RAM, and strings
146 aren't usually that large), but expanding our usage of disk and
147 network bandwidth by a factor of 4 is intolerable.
148
1493. It's not compatible with existing C functions such as ``strlen()``,
150 so a new family of wide string functions would need to be used.
151
1524. Many Internet standards are defined in terms of textual data, and
153 can't handle content with embedded zero bytes.
154
155Generally people don't use this encoding, choosing other encodings
156that are more efficient and convenient.
157
158Encodings don't have to handle every possible Unicode character, and
159most encodings don't. For example, Python's default encoding is the
160'ascii' encoding. The rules for converting a Unicode string into the
161ASCII encoding are are simple; for each code point:
162
1631. If the code point is <128, each byte is the same as the value of the
164 code point.
165
1662. If the code point is 128 or greater, the Unicode string can't
167 be represented in this encoding. (Python raises a
168 ``UnicodeEncodeError`` exception in this case.)
169
170Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode
171code points 0-255 are identical to the Latin-1 values, so converting
172to this encoding simply requires converting code points to byte
173values; if a code point larger than 255 is encountered, the string
174can't be encoded into Latin-1.
175
176Encodings don't have to be simple one-to-one mappings like Latin-1.
177Consider IBM's EBCDIC, which was used on IBM mainframes. Letter
178values weren't in one block: 'a' through 'i' had values from 129 to
179137, but 'j' through 'r' were 145 through 153. If you wanted to use
180EBCDIC as an encoding, you'd probably use some sort of lookup table to
181perform the conversion, but this is largely an internal detail.
182
183UTF-8 is one of the most commonly used encodings. UTF stands for
184"Unicode Transformation Format", and the '8' means that 8-bit numbers
185are used in the encoding. (There's also a UTF-16 encoding, but it's
186less frequently used than UTF-8.) UTF-8 uses the following rules:
187
1881. If the code point is <128, it's represented by the corresponding byte value.
1892. If the code point is between 128 and 0x7ff, it's turned into two byte values
190 between 128 and 255.
1913. Code points >0x7ff are turned into three- or four-byte sequences, where
192 each byte of the sequence is between 128 and 255.
193
194UTF-8 has several convenient properties:
195
1961. It can handle any Unicode code point.
1972. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes.
1983. A string of ASCII text is also valid UTF-8 text.
1994. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
2005. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8.
201
202
203
204References
205''''''''''''''
206
207The Unicode Consortium site at <http://www.unicode.org> has character
208charts, a glossary, and PDF versions of the Unicode specification. Be
209prepared for some difficult reading.
210<http://www.unicode.org/history/> is a chronology of the origin and
211development of Unicode.
212
213To help understand the standard, Jukka Korpela has written an
214introductory guide to reading the Unicode character tables,
215available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
216
217Roman Czyborra wrote another explanation of Unicode's basic principles;
218it's at <http://czyborra.com/unicode/characters.html>.
219Czyborra has written a number of other Unicode-related documentation,
220available from <http://www.cyzborra.com>.
221
222Two other good introductory articles were written by Joel Spolsky
223<http://www.joelonsoftware.com/articles/Unicode.html> and Jason
224Orendorff <http://www.jorendorff.com/articles/unicode/>. If this
225introduction didn't make things clear to you, you should try reading
226one of these alternate articles before continuing.
227
228Wikipedia entries are often helpful; see the entries for "character
229encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
230<http://en.wikipedia.org/wiki/UTF-8>, for example.
231
232
233Python's Unicode Support
234------------------------
235
236Now that you've learned the rudiments of Unicode, we can look at
237Python's Unicode features.
238
239
240The Unicode Type
241'''''''''''''''''''
242
243Unicode strings are expressed as instances of the ``unicode`` type,
244one of Python's repertoire of built-in types. It derives from an
245abstract type called ``basestring``, which is also an ancestor of the
246``str`` type; you can therefore check if a value is a string type with
247``isinstance(value, basestring)``. Under the hood, Python represents
248Unicode strings as either 16- or 32-bit integers, depending on how the
Andrew M. Kuchling5040fee2005-11-19 18:43:38 +0000249Python interpreter was compiled.
Andrew M. Kuchlinge8f44d62005-08-30 01:25:05 +0000250
251The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``.
252All of its arguments should be 8-bit strings. The first argument is converted
253to Unicode using the specified encoding; if you leave off the ``encoding`` argument,
254the ASCII encoding is used for the conversion, so characters greater than 127 will
255be treated as errors::
256
257 >>> unicode('abcdef')
258 u'abcdef'
259 >>> s = unicode('abcdef')
260 >>> type(s)
261 <type 'unicode'>
262 >>> unicode('abcdef' + chr(255))
263 Traceback (most recent call last):
264 File "<stdin>", line 1, in ?
265 UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
266 ordinal not in range(128)
267
268The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument
269are 'strict' (raise a ``UnicodeDecodeError`` exception),
270'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'),
271or 'ignore' (just leave the character out of the Unicode result).
272The following examples show the differences::
273
274 >>> unicode('\x80abc', errors='strict')
275 Traceback (most recent call last):
276 File "<stdin>", line 1, in ?
277 UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
278 ordinal not in range(128)
279 >>> unicode('\x80abc', errors='replace')
280 u'\ufffdabc'
281 >>> unicode('\x80abc', errors='ignore')
282 u'abc'
283
284Encodings are specified as strings containing the encoding's name.
285Python 2.4 comes with roughly 100 different encodings; see the Python
286Library Reference at
287<http://docs.python.org/lib/standard-encodings.html> for a list. Some
288encodings have multiple names; for example, 'latin-1', 'iso_8859_1'
289and '8859' are all synonyms for the same encoding.
290
291One-character Unicode strings can also be created with the
292``unichr()`` built-in function, which takes integers and returns a
293Unicode string of length 1 that contains the corresponding code point.
294The reverse operation is the built-in `ord()` function that takes a
295one-character Unicode string and returns the code point value::
296
297 >>> unichr(40960)
298 u'\ua000'
299 >>> ord(u'\ua000')
300 40960
301
302Instances of the ``unicode`` type have many of the same methods as
303the 8-bit string type for operations such as searching and formatting::
304
305 >>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
306 >>> s.count('e')
307 5
308 >>> s.find('feather')
309 9
310 >>> s.find('bird')
311 -1
312 >>> s.replace('feather', 'sand')
313 u'Was ever sand so lightly blown to and fro as this multitude?'
314 >>> s.upper()
315 u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
316
317Note that the arguments to these methods can be Unicode strings or 8-bit strings.
3188-bit strings will be converted to Unicode before carrying out the operation;
319Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception::
320
321 >>> s.find('Was\x9f')
322 Traceback (most recent call last):
323 File "<stdin>", line 1, in ?
324 UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
325 >>> s.find(u'Was\x9f')
326 -1
327
328Much Python code that operates on strings will therefore work with
329Unicode strings without requiring any changes to the code. (Input and
330output code needs more updating for Unicode; more on this later.)
331
332Another important method is ``.encode([encoding], [errors='strict'])``,
333which returns an 8-bit string version of the
334Unicode string, encoded in the requested encoding. The ``errors``
335parameter is the same as the parameter of the ``unicode()``
336constructor, with one additional possibility; as well as 'strict',
337'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which
338uses XML's character references. The following example shows the
339different results::
340
341 >>> u = unichr(40960) + u'abcd' + unichr(1972)
342 >>> u.encode('utf-8')
343 '\xea\x80\x80abcd\xde\xb4'
344 >>> u.encode('ascii')
345 Traceback (most recent call last):
346 File "<stdin>", line 1, in ?
347 UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
348 >>> u.encode('ascii', 'ignore')
349 'abcd'
350 >>> u.encode('ascii', 'replace')
351 '?abcd?'
352 >>> u.encode('ascii', 'xmlcharrefreplace')
353 '&#40960;abcd&#1972;'
354
355Python's 8-bit strings have a ``.decode([encoding], [errors])`` method
356that interprets the string using the given encoding::
357
358 >>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
359 >>> utf8_version = u.encode('utf-8') # Encode as UTF-8
360 >>> type(utf8_version), utf8_version
361 (<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
362 >>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
363 >>> u == u2 # The two strings match
364 True
365
366The low-level routines for registering and accessing the available
367encodings are found in the ``codecs`` module. However, the encoding
368and decoding functions returned by this module are usually more
369low-level than is comfortable, so I'm not going to describe the
370``codecs`` module here. If you need to implement a completely new
371encoding, you'll need to learn about the ``codecs`` module interfaces,
372but implementing encodings is a specialized task that also won't be
373covered here. Consult the Python documentation to learn more about
374this module.
375
376The most commonly used part of the ``codecs`` module is the
377``codecs.open()`` function which will be discussed in the section
378on input and output.
379
380
381Unicode Literals in Python Source Code
382''''''''''''''''''''''''''''''''''''''''''
383
384In Python source code, Unicode literals are written as strings
385prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``. Specific
386code points can be written using the ``\u`` escape sequence, which is
387followed by four hex digits giving the code point. The ``\U`` escape
388sequence is similar, but expects 8 hex digits, not 4.
389
390Unicode literals can also use the same escape sequences as 8-bit
391strings, including ``\x``, but ``\x`` only takes two hex digits so it
392can't express an arbitrary code point. Octal escapes can go up to
393U+01ff, which is octal 777.
394
395::
396
397 >>> s = u"a\xac\u1234\u20ac\U00008000"
398 ^^^^ two-digit hex escape
399 ^^^^^^ four-digit Unicode escape
400 ^^^^^^^^^^ eight-digit Unicode escape
401 >>> for c in s: print ord(c),
402 ...
403 97 172 4660 8364 32768
404
405Using escape sequences for code points greater than 127 is fine in
406small doses, but becomes an annoyance if you're using many accented
407characters, as you would in a program with messages in French or some
408other accent-using language. You can also assemble strings using the
409``unichr()`` built-in function, but this is even more tedious.
410
411Ideally, you'd want to be able to write literals in your language's
412natural encoding. You could then edit Python source code with your
413favorite editor which would display the accented characters naturally,
414and have the right characters used at runtime.
415
416Python supports writing Unicode literals in any encoding, but you have
417to declare the encoding being used. This is done by including a
418special comment as either the first or second line of the source
419file::
420
421 #!/usr/bin/env python
422 # -*- coding: latin-1 -*-
423
424 u = u'abcdé'
425 print ord(u[-1])
426
427The syntax is inspired by Emacs's notation for specifying variables local to a file.
428Emacs supports many different variables, but Python only supports 'coding'.
429The ``-*-`` symbols indicate that the comment is special; within them,
430you must supply the name ``coding`` and the name of your chosen encoding,
431separated by ``':'``.
432
433If you don't include such a comment, the default encoding used will be
434ASCII. Versions of Python before 2.4 were Euro-centric and assumed
435Latin-1 as a default encoding for string literals; in Python 2.4,
436characters greater than 127 still work but result in a warning. For
437example, the following program has no encoding declaration::
438
439 #!/usr/bin/env python
440 u = u'abcdé'
441 print ord(u[-1])
442
443When you run it with Python 2.4, it will output the following warning::
444
445 amk:~$ python p263.py
446 sys:1: DeprecationWarning: Non-ASCII character '\xe9'
447 in file p263.py on line 2, but no encoding declared;
448 see http://www.python.org/peps/pep-0263.html for details
449
450
451Unicode Properties
452'''''''''''''''''''
453
454The Unicode specification includes a database of information about
455code points. For each code point that's defined, the information
456includes the character's name, its category, the numeric value if
457applicable (Unicode has characters representing the Roman numerals and
458fractions such as one-third and four-fifths). There are also
459properties related to the code point's use in bidirectional text and
460other display-related properties.
461
462The following program displays some information about several
463characters, and prints the numeric value of one particular character::
464
465 import unicodedata
466
467 u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
468
469 for i, c in enumerate(u):
470 print i, '%04x' % ord(c), unicodedata.category(c),
471 print unicodedata.name(c)
472
473 # Get numeric value of second character
474 print unicodedata.numeric(u[1])
475
476When run, this prints::
477
478 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
479 1 0bf2 No TAMIL NUMBER ONE THOUSAND
480 2 0f84 Mn TIBETAN MARK HALANTA
481 3 1770 Lo TAGBANWA LETTER SA
482 4 33af So SQUARE RAD OVER S SQUARED
483 1000.0
484
485The category codes are abbreviations describing the nature of the
486character. These are grouped into categories such as "Letter",
487"Number", "Punctuation", or "Symbol", which in turn are broken up into
488subcategories. To take the codes from the above output, ``'Ll'``
489means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is
490"Mark, nonspacing", and ``'So'`` is "Symbol, other". See
491<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
492for a list of category codes.
493
494References
495''''''''''''''
496
497The Unicode and 8-bit string types are described in the Python library
498reference at <http://docs.python.org/lib/typesseq.html>.
499
500The documentation for the ``unicodedata`` module is at
501<http://docs.python.org/lib/module-unicodedata.html>.
502
503The documentation for the ``codecs`` module is at
504<http://docs.python.org/lib/module-codecs.html>.
505
506Marc-André Lemburg gave a presentation at EuroPython 2002
507titled "Python and Unicode". A PDF version of his slides
508is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>,
509and is an excellent overview of the design of Python's Unicode features.
510
511
512Reading and Writing Unicode Data
513----------------------------------------
514
515Once you've written some code that works with Unicode data, the next
516problem is input/output. How do you get Unicode strings into your
517program, and how do you convert Unicode into a form suitable for
518storage or transmission?
519
520It's possible that you may not need to do anything depending on your
521input sources and output destinations; you should check whether the
522libraries used in your application support Unicode natively. XML
523parsers often return Unicode data, for example. Many relational
524databases also support Unicode-valued columns and can return Unicode
525values from an SQL query.
526
527Unicode data is usually converted to a particular encoding before it
528gets written to disk or sent over a socket. It's possible to do all
529the work yourself: open a file, read an 8-bit string from it, and
530convert the string with ``unicode(str, encoding)``. However, the
531manual approach is not recommended.
532
533One problem is the multi-byte nature of encodings; one Unicode
534character can be represented by several bytes. If you want to read
535the file in arbitrary-sized chunks (say, 1K or 4K), you need to write
536error-handling code to catch the case where only part of the bytes
537encoding a single Unicode character are read at the end of a chunk.
538One solution would be to read the entire file into memory and then
539perform the decoding, but that prevents you from working with files
540that are extremely large; if you need to read a 2Gb file, you need 2Gb
541of RAM. (More, really, since for at least a moment you'd need to have
542both the encoded string and its Unicode version in memory.)
543
544The solution would be to use the low-level decoding interface to catch
545the case of partial coding sequences. The work of implementing this
546has already been done for you: the ``codecs`` module includes a
547version of the ``open()`` function that returns a file-like object
548that assumes the file's contents are in a specified encoding and
549accepts Unicode parameters for methods such as ``.read()`` and
550``.write()``.
551
552The function's parameters are
553``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``. ``mode`` can be
554``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the
555regular built-in ``open()`` function; add a ``'+'`` to
556update the file. ``buffering`` is similarly
557parallel to the standard function's parameter.
558``encoding`` is a string giving
559the encoding to use; if it's left as ``None``, a regular Python file
560object that accepts 8-bit strings is returned. Otherwise, a wrapper
561object is returned, and data written to or read from the wrapper
562object will be converted as needed. ``errors`` specifies the action
563for encoding errors and can be one of the usual values of 'strict',
564'ignore', and 'replace'.
565
566Reading Unicode from a file is therefore simple::
567
568 import codecs
569 f = codecs.open('unicode.rst', encoding='utf-8')
570 for line in f:
571 print repr(line)
572
573It's also possible to open files in update mode,
574allowing both reading and writing::
575
576 f = codecs.open('test', encoding='utf-8', mode='w+')
577 f.write(u'\u4500 blah blah blah\n')
578 f.seek(0)
579 print repr(f.readline()[:1])
580 f.close()
581
582Unicode character U+FEFF is used as a byte-order mark (BOM),
583and is often written as the first character of a file in order
584to assist with autodetection of the file's byte ordering.
585Some encodings, such as UTF-16, expect a BOM to be present at
586the start of a file; when such an encoding is used,
587the BOM will be automatically written as the first character
588and will be silently dropped when the file is read. There are
589variants of these encodings, such as 'utf-16-le' and 'utf-16-be'
590for little-endian and big-endian encodings, that specify
591one particular byte ordering and don't
592skip the BOM.
593
594
595Unicode filenames
596'''''''''''''''''''''''''
597
598Most of the operating systems in common use today support filenames
599that contain arbitrary Unicode characters. Usually this is
600implemented by converting the Unicode string into some encoding that
601varies depending on the system. For example, MacOS X uses UTF-8 while
602Windows uses a configurable encoding; on Windows, Python uses the name
603"mbcs" to refer to whatever the currently configured encoding is. On
604Unix systems, there will only be a filesystem encoding if you've set
605the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't,
606the default encoding is ASCII.
607
608The ``sys.getfilesystemencoding()`` function returns the encoding to
609use on your current system, in case you want to do the encoding
610manually, but there's not much reason to bother. When opening a file
611for reading or writing, you can usually just provide the Unicode
612string as the filename, and it will be automatically converted to the
613right encoding for you::
614
615 filename = u'filename\u4500abc'
616 f = open(filename, 'w')
617 f.write('blah\n')
618 f.close()
619
620Functions in the ``os`` module such as ``os.stat()`` will also accept
621Unicode filenames.
622
623``os.listdir()``, which returns filenames, raises an issue: should it
624return the Unicode version of filenames, or should it return 8-bit
625strings containing the encoded versions? ``os.listdir()`` will do
626both, depending on whether you provided the directory path as an 8-bit
627string or a Unicode string. If you pass a Unicode string as the path,
628filenames will be decoded using the filesystem's encoding and a list
629of Unicode strings will be returned, while passing an 8-bit path will
630return the 8-bit versions of the filenames. For example, assuming the
631default filesystem encoding is UTF-8, running the following program::
632
633 fn = u'filename\u4500abc'
634 f = open(fn, 'w')
635 f.close()
636
637 import os
638 print os.listdir('.')
639 print os.listdir(u'.')
640
641will produce the following output::
642
643 amk:~$ python t.py
644 ['.svn', 'filename\xe4\x94\x80abc', ...]
645 [u'.svn', u'filename\u4500abc', ...]
646
647The first list contains UTF-8-encoded filenames, and the second list
648contains the Unicode versions.
649
650
651
652Tips for Writing Unicode-aware Programs
653''''''''''''''''''''''''''''''''''''''''''''
654
655This section provides some suggestions on writing software that
656deals with Unicode.
657
658The most important tip is:
659
660 Software should only work with Unicode strings internally,
661 converting to a particular encoding on output.
662
663If you attempt to write processing functions that accept both
664Unicode and 8-bit strings, you will find your program vulnerable to
665bugs wherever you combine the two different kinds of strings. Python's
666default encoding is ASCII, so whenever a character with an ASCII value >127
667is in the input data, you'll get a ``UnicodeDecodeError``
668because that character can't be handled by the ASCII encoding.
669
670It's easy to miss such problems if you only test your software
671with data that doesn't contain any
672accents; everything will seem to work, but there's actually a bug in your
673program waiting for the first user who attempts to use characters >127.
674A second tip, therefore, is:
675
676 Include characters >127 and, even better, characters >255 in your
677 test data.
678
679When using data coming from a web browser or some other untrusted source,
680a common technique is to check for illegal characters in a string
681before using the string in a generated command line or storing it in a
682database. If you're doing this, be careful to check
683the string once it's in the form that will be used or stored; it's
684possible for encodings to be used to disguise characters. This is especially
685true if the input data also specifies the encoding;
686many encodings leave the commonly checked-for characters alone,
687but Python includes some encodings such as ``'base64'``
688that modify every single character.
689
690For example, let's say you have a content management system that takes a
691Unicode filename, and you want to disallow paths with a '/' character.
692You might write this code::
693
694 def read_file (filename, encoding):
695 if '/' in filename:
696 raise ValueError("'/' not allowed in filenames")
697 unicode_name = filename.decode(encoding)
698 f = open(unicode_name, 'r')
699 # ... return contents of file ...
700
701However, if an attacker could specify the ``'base64'`` encoding,
702they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64
703encoded form of the string ``'/etc/passwd'``, to read a
704system file. The above code looks for ``'/'`` characters
705in the encoded form and misses the dangerous character
706in the resulting decoded form.
707
708References
709''''''''''''''
710
711The PDF slides for Marc-André Lemburg's presentation "Writing
712Unicode-aware Applications in Python" are available at
713<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
714and discuss questions of character encodings as well as how to
715internationalize and localize an application.
716
717
718Revision History and Acknowledgements
719------------------------------------------
720
721Thanks to the following people who have noted errors or offered
722suggestions on this article: Nicholas Bastin,
723Marius Gedminas, Kent Johnson, Ken Krugler,
724Marc-André Lemburg, Martin von Löwis.
725
726Version 1.0: posted August 5 2005.
727
728Version 1.01: posted August 7 2005. Corrects factual and markup
729errors; adds several links.
730
731Version 1.02: posted August 16 2005. Corrects factual errors.
732
733
734.. comment Additional topic: building Python w/ UCS2 or UCS4 support
735.. comment Describe obscure -U switch somewhere?
Thomas Woutersd4ec0c32006-04-21 16:44:05 +0000736.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
Andrew M. Kuchlinge8f44d62005-08-30 01:25:05 +0000737
738.. comment
739 Original outline:
740
741 - [ ] Unicode introduction
742 - [ ] ASCII
743 - [ ] Terms
744 - [ ] Character
745 - [ ] Code point
746 - [ ] Encodings
747 - [ ] Common encodings: ASCII, Latin-1, UTF-8
748 - [ ] Unicode Python type
749 - [ ] Writing unicode literals
750 - [ ] Obscurity: -U switch
751 - [ ] Built-ins
752 - [ ] unichr()
753 - [ ] ord()
754 - [ ] unicode() constructor
755 - [ ] Unicode type
756 - [ ] encode(), decode() methods
757 - [ ] Unicodedata module for character properties
758 - [ ] I/O
759 - [ ] Reading/writing Unicode data into files
760 - [ ] Byte-order marks
761 - [ ] Unicode filenames
762 - [ ] Writing Unicode programs
763 - [ ] Do everything in Unicode
764 - [ ] Declaring source code encodings (PEP 263)
765 - [ ] Other issues
766 - [ ] Building Python (UCS2, UCS4)