blob: b309f601d0598d27bcdadb92f55e0bdfb568c361 [file] [log] [blame]
Guido van Rossum715287f2008-12-02 22:34:15 +00001.. _unicode-howto:
2
Georg Brandl116aa622007-08-15 14:28:22 +00003*****************
4 Unicode HOWTO
5*****************
6
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00007:Release: 1.12
Georg Brandl116aa622007-08-15 14:28:22 +00008
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00009This HOWTO discusses Python support for Unicode, and explains
Benjamin Petersond7c3ed52010-06-27 22:32:30 +000010various problems that people commonly encounter when trying to work
Alexander Belopolsky93a6b132010-11-19 16:09:58 +000011with Unicode.
Georg Brandl6911e3c2007-09-04 07:15:32 +000012
Georg Brandl116aa622007-08-15 14:28:22 +000013Introduction to Unicode
14=======================
15
16History of Character Codes
17--------------------------
18
19In 1968, the American Standard Code for Information Interchange, better known by
20its acronym ASCII, was standardized. ASCII defined numeric codes for various
Georg Brandl0c074222008-11-22 10:26:59 +000021characters, with the numeric values running from 0 to 127. For example, the
22lowercase letter 'a' is assigned 97 as its code value.
Georg Brandl116aa622007-08-15 14:28:22 +000023
24ASCII was an American-developed standard, so it only defined unaccented
25characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
26which required accented characters couldn't be faithfully represented in ASCII.
27(Actually the missing accents matter for English, too, which contains words such
28as 'naïve' and 'café', and some publications have house styles which require
29spellings such as 'coöperate'.)
30
31For a while people just wrote programs that didn't display accents. I remember
32looking at Apple ][ BASIC programs, published in French-language publications in
33the mid-1980s, that had lines like these::
34
Georg Brandla1c6a1c2009-01-03 21:26:05 +000035 PRINT "FICHIER EST COMPLETE."
36 PRINT "CARACTERE NON ACCEPTE."
Georg Brandl116aa622007-08-15 14:28:22 +000037
38Those messages should contain accents, and they just look wrong to someone who
39can read French.
40
41In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
42hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
43machines assigned values between 128 and 255 to accented characters. Different
44machines had different codes, however, which led to problems exchanging files.
Alexander Belopolsky93a6b132010-11-19 16:09:58 +000045Eventually various commonly used sets of values for the 128--255 range emerged.
Georg Brandl116aa622007-08-15 14:28:22 +000046Some were true standards, defined by the International Standards Organization,
Ezio Melotti410eee52013-01-20 12:16:03 +020047and some were *de facto* conventions that were invented by one company or
Georg Brandl116aa622007-08-15 14:28:22 +000048another and managed to catch on.
49
50255 characters aren't very many. For example, you can't fit both the accented
51characters used in Western Europe and the Cyrillic alphabet used for Russian
Alexander Belopolsky93a6b132010-11-19 16:09:58 +000052into the 128--255 range because there are more than 127 such characters.
Georg Brandl116aa622007-08-15 14:28:22 +000053
54You could write files using different codes (all your Russian files in a coding
55system called KOI8, all your French files in a different coding system called
56Latin1), but what if you wanted to write a French document that quotes some
57Russian text? In the 1980s people began to want to solve this problem, and the
58Unicode standardization effort began.
59
60Unicode started out using 16-bit characters instead of 8-bit characters. 16
61bits means you have 2^16 = 65,536 distinct values available, making it possible
62to represent many different characters from many different alphabets; an initial
63goal was to have Unicode contain the alphabets for every single human language.
64It turns out that even 16 bits isn't enough to meet that goal, and the modern
Ezio Melotti410eee52013-01-20 12:16:03 +020065Unicode specification uses a wider range of codes, 0 through 1,114,111 (
66``0x10FFFF`` in base 16).
Georg Brandl116aa622007-08-15 14:28:22 +000067
68There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
69originally separate efforts, but the specifications were merged with the 1.1
70revision of Unicode.
71
72(This discussion of Unicode's history is highly simplified. I don't think the
73average Python programmer needs to worry about the historical details; consult
74the Unicode consortium site listed in the References for more information.)
75
76
77Definitions
78-----------
79
80A **character** is the smallest possible component of a text. 'A', 'B', 'C',
81etc., are all different characters. So are 'È' and 'Í'. Characters are
82abstractions, and vary depending on the language or context you're talking
83about. For example, the symbol for ohms (Ω) is usually drawn much like the
84capital letter omega (Ω) in the Greek alphabet (they may even be the same in
85some fonts), but these are two different characters that have different
86meanings.
87
88The Unicode standard describes how characters are represented by **code
89points**. A code point is an integer value, usually denoted in base 16. In the
Ezio Melotti410eee52013-01-20 12:16:03 +020090standard, a code point is written using the notation ``U+12CA`` to mean the
91character with value ``0x12ca`` (4,810 decimal). The Unicode standard contains
92a lot of tables listing characters and their corresponding code points:
93
94.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +000095
Georg Brandla1c6a1c2009-01-03 21:26:05 +000096 0061 'a'; LATIN SMALL LETTER A
97 0062 'b'; LATIN SMALL LETTER B
98 0063 'c'; LATIN SMALL LETTER C
99 ...
100 007B '{'; LEFT CURLY BRACKET
Georg Brandl116aa622007-08-15 14:28:22 +0000101
102Strictly, these definitions imply that it's meaningless to say 'this is
Ezio Melotti410eee52013-01-20 12:16:03 +0200103character ``U+12CA``'. ``U+12CA`` is a code point, which represents some particular
Georg Brandl116aa622007-08-15 14:28:22 +0000104character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
105informal contexts, this distinction between code points and characters will
106sometimes be forgotten.
107
108A character is represented on a screen or on paper by a set of graphical
109elements that's called a **glyph**. The glyph for an uppercase A, for example,
110is two diagonal strokes and a horizontal stroke, though the exact details will
111depend on the font being used. Most Python code doesn't need to worry about
112glyphs; figuring out the correct glyph to display is generally the job of a GUI
113toolkit or a terminal's font renderer.
114
115
116Encodings
117---------
118
119To summarize the previous section: a Unicode string is a sequence of code
Ezio Melotti410eee52013-01-20 12:16:03 +0200120points, which are numbers from 0 through ``0x10FFFF`` (1,114,111 decimal). This
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000121sequence needs to be represented as a set of bytes (meaning, values
122from 0 through 255) in memory. The rules for translating a Unicode string
123into a sequence of bytes are called an **encoding**.
Georg Brandl116aa622007-08-15 14:28:22 +0000124
125The first encoding you might think of is an array of 32-bit integers. In this
Ezio Melotti410eee52013-01-20 12:16:03 +0200126representation, the string "Python" would look like this:
127
128.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000129
130 P y t h o n
Georg Brandl6911e3c2007-09-04 07:15:32 +0000131 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
132 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl116aa622007-08-15 14:28:22 +0000133
134This representation is straightforward but using it presents a number of
135problems.
136
1371. It's not portable; different processors order the bytes differently.
138
1392. It's very wasteful of space. In most texts, the majority of the code points
Ezio Melotti410eee52013-01-20 12:16:03 +0200140 are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
Georg Brandl116aa622007-08-15 14:28:22 +0000141 bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
142 ASCII representation. Increased RAM usage doesn't matter too much (desktop
Ezio Melotti410eee52013-01-20 12:16:03 +0200143 computers have gigabytes of RAM, and strings aren't usually that large), but
Georg Brandl116aa622007-08-15 14:28:22 +0000144 expanding our usage of disk and network bandwidth by a factor of 4 is
145 intolerable.
146
1473. It's not compatible with existing C functions such as ``strlen()``, so a new
148 family of wide string functions would need to be used.
149
1504. Many Internet standards are defined in terms of textual data, and can't
151 handle content with embedded zero bytes.
152
Benjamin Petersond7c3ed52010-06-27 22:32:30 +0000153Generally people don't use this encoding, instead choosing other
154encodings that are more efficient and convenient. UTF-8 is probably
155the most commonly supported encoding; it will be discussed below.
Georg Brandl116aa622007-08-15 14:28:22 +0000156
157Encodings don't have to handle every possible Unicode character, and most
Benjamin Peterson1f316972009-09-11 20:42:29 +0000158encodings don't. The rules for converting a Unicode string into the ASCII
159encoding, for example, are simple; for each code point:
Georg Brandl116aa622007-08-15 14:28:22 +0000160
1611. If the code point is < 128, each byte is the same as the value of the code
162 point.
163
1642. If the code point is 128 or greater, the Unicode string can't be represented
165 in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
166 case.)
167
168Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00001690--255 are identical to the Latin-1 values, so converting to this encoding simply
Georg Brandl116aa622007-08-15 14:28:22 +0000170requires converting code points to byte values; if a code point larger than 255
171is encountered, the string can't be encoded into Latin-1.
172
173Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
174IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
175block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
176through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
177some sort of lookup table to perform the conversion, but this is largely an
178internal detail.
179
180UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
181Transformation Format", and the '8' means that 8-bit numbers are used in the
Ezio Melotti410eee52013-01-20 12:16:03 +0200182encoding. (There are also a UTF-16 and UTF-32 encodings, but they are less
183frequently used than UTF-8.) UTF-8 uses the following rules:
Georg Brandl116aa622007-08-15 14:28:22 +0000184
Ezio Melotti410eee52013-01-20 12:16:03 +02001851. If the code point is < 128, it's represented by the corresponding byte value.
1862. If the code point is >= 128, it's turned into a sequence of two, three, or
187 four bytes, where each byte of the sequence is between 128 and 255.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000188
Georg Brandl116aa622007-08-15 14:28:22 +0000189UTF-8 has several convenient properties:
190
1911. It can handle any Unicode code point.
1922. A Unicode string is turned into a string of bytes containing no embedded zero
193 bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
194 processed by C functions such as ``strcpy()`` and sent through protocols that
195 can't handle zero bytes.
1963. A string of ASCII text is also valid UTF-8 text.
Ezio Melotti410eee52013-01-20 12:16:03 +02001974. UTF-8 is fairly compact; the majority of commonly used characters can be
198 represented with one or two bytes.
Georg Brandl116aa622007-08-15 14:28:22 +00001995. If bytes are corrupted or lost, it's possible to determine the start of the
200 next UTF-8-encoded code point and resynchronize. It's also unlikely that
201 random 8-bit data will look like valid UTF-8.
202
203
204
205References
206----------
207
Ezio Melotti410eee52013-01-20 12:16:03 +0200208The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
Georg Brandl116aa622007-08-15 14:28:22 +0000209glossary, and PDF versions of the Unicode specification. Be prepared for some
Ezio Melotti410eee52013-01-20 12:16:03 +0200210difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
211origin and development of Unicode is also available on the site.
Georg Brandl116aa622007-08-15 14:28:22 +0000212
Ezio Melotti410eee52013-01-20 12:16:03 +0200213To help understand the standard, Jukka Korpela has written `an introductory
214guide <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>`_ to reading the
215Unicode character tables.
Georg Brandl116aa622007-08-15 14:28:22 +0000216
Ezio Melotti410eee52013-01-20 12:16:03 +0200217Another `good introductory article <http://www.joelonsoftware.com/articles/Unicode.html>`_
218was written by Joel Spolsky.
Georg Brandlee8783d2009-09-16 16:00:31 +0000219If this introduction didn't make things clear to you, you should try reading this
220alternate article before continuing.
221
222.. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken
Georg Brandl116aa622007-08-15 14:28:22 +0000223
Ezio Melotti410eee52013-01-20 12:16:03 +0200224Wikipedia entries are often helpful; see the entries for "`character encoding
225<http://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
226<http://en.wikipedia.org/wiki/UTF-8>`_, for example.
Georg Brandl116aa622007-08-15 14:28:22 +0000227
228
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000229Python's Unicode Support
230========================
Georg Brandl116aa622007-08-15 14:28:22 +0000231
232Now that you've learned the rudiments of Unicode, we can look at Python's
233Unicode features.
234
Georg Brandlf6945182008-02-01 11:56:49 +0000235The String Type
236---------------
Georg Brandl116aa622007-08-15 14:28:22 +0000237
Ezio Melotti410eee52013-01-20 12:16:03 +0200238Since Python 3.0, the language features a :class:`str` type that contain Unicode
Georg Brandlf6945182008-02-01 11:56:49 +0000239characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
Georg Brandl4f5f98d2009-05-04 21:01:20 +0000240rocks!'``, or the triple-quoted string syntax is stored as Unicode.
Georg Brandl116aa622007-08-15 14:28:22 +0000241
Ezio Melotti410eee52013-01-20 12:16:03 +0200242To insert a non-ASCII Unicode character, e.g., any letters with
Georg Brandlf6945182008-02-01 11:56:49 +0000243accents, one can use escape sequences in their string literals as such::
Georg Brandl116aa622007-08-15 14:28:22 +0000244
Georg Brandlf6945182008-02-01 11:56:49 +0000245 >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
246 '\u0394'
247 >>> "\u0394" # Using a 16-bit hex value
248 '\u0394'
249 >>> "\U00000394" # Using a 32-bit hex value
250 '\u0394'
Georg Brandl116aa622007-08-15 14:28:22 +0000251
Ezio Melotti410eee52013-01-20 12:16:03 +0200252In addition, one can create a string using the :func:`~bytes.decode` method of
253:class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``,
254and optionally, an *errors* argument.
Georg Brandl116aa622007-08-15 14:28:22 +0000255
Georg Brandlf6945182008-02-01 11:56:49 +0000256The *errors* argument specifies the response when the input string can't be
Georg Brandl116aa622007-08-15 14:28:22 +0000257converted according to the encoding's rules. Legal values for this argument are
Ezio Melotti410eee52013-01-20 12:16:03 +0200258``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
259``U+FFFD``, ``REPLACEMENT CHARACTER``), or ``'ignore'`` (just leave the
260character out of the Unicode result).
261The following examples show the differences::
Georg Brandl116aa622007-08-15 14:28:22 +0000262
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700263 >>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000264 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700265 ...
266 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
267 invalid start byte
Ezio Melotti20b8d992012-09-23 15:55:14 +0300268 >>> b'\x80abc'.decode("utf-8", "replace")
269 '\ufffdabc'
Georg Brandlf6945182008-02-01 11:56:49 +0000270 >>> b'\x80abc'.decode("utf-8", "ignore")
271 'abc'
Georg Brandl116aa622007-08-15 14:28:22 +0000272
Georg Brandlc8c60c22010-11-19 22:09:04 +0000273(In this code example, the Unicode replacement character has been replaced by
274a question mark because it may not be displayed on some systems.)
275
Benjamin Petersond7c3ed52010-06-27 22:32:30 +0000276Encodings are specified as strings containing the encoding's name. Python 3.2
277comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl0c074222008-11-22 10:26:59 +0000278:ref:`standard-encodings` for a list. Some encodings have multiple names; for
Ezio Melotti410eee52013-01-20 12:16:03 +0200279example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
280the same encoding.
Georg Brandl116aa622007-08-15 14:28:22 +0000281
Georg Brandlf6945182008-02-01 11:56:49 +0000282One-character Unicode strings can also be created with the :func:`chr`
Georg Brandl116aa622007-08-15 14:28:22 +0000283built-in function, which takes integers and returns a Unicode string of length 1
284that contains the corresponding code point. The reverse operation is the
285built-in :func:`ord` function that takes a one-character Unicode string and
286returns the code point value::
287
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000288 >>> chr(57344)
289 '\ue000'
290 >>> ord('\ue000')
291 57344
Georg Brandl116aa622007-08-15 14:28:22 +0000292
Georg Brandlf6945182008-02-01 11:56:49 +0000293Converting to Bytes
294-------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000295
Ezio Melotti410eee52013-01-20 12:16:03 +0200296The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
297which returns a :class:`bytes` representation of the Unicode string, encoded in the
298requested *encoding*. The *errors* parameter is the same as the parameter of
299the :meth:`~bytes.decode` method, with one additional possibility; as well as
300``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case inserts a
301question mark instead of the unencodable character), you can also pass
302``'xmlcharrefreplace'`` which uses XML's character references.
303The following example shows the different results::
Georg Brandl116aa622007-08-15 14:28:22 +0000304
Georg Brandlf6945182008-02-01 11:56:49 +0000305 >>> u = chr(40960) + 'abcd' + chr(1972)
Georg Brandl116aa622007-08-15 14:28:22 +0000306 >>> u.encode('utf-8')
Georg Brandlf6945182008-02-01 11:56:49 +0000307 b'\xea\x80\x80abcd\xde\xb4'
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700308 >>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000309 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700310 ...
Georg Brandl0c074222008-11-22 10:26:59 +0000311 UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700312 position 0: ordinal not in range(128)
Georg Brandl116aa622007-08-15 14:28:22 +0000313 >>> u.encode('ascii', 'ignore')
Georg Brandlf6945182008-02-01 11:56:49 +0000314 b'abcd'
Georg Brandl116aa622007-08-15 14:28:22 +0000315 >>> u.encode('ascii', 'replace')
Georg Brandlf6945182008-02-01 11:56:49 +0000316 b'?abcd?'
Georg Brandl116aa622007-08-15 14:28:22 +0000317 >>> u.encode('ascii', 'xmlcharrefreplace')
Georg Brandlf6945182008-02-01 11:56:49 +0000318 b'&#40960;abcd&#1972;'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000319
Ezio Melotti410eee52013-01-20 12:16:03 +0200320.. XXX mention the surrogate* error handlers
321
Georg Brandl116aa622007-08-15 14:28:22 +0000322The low-level routines for registering and accessing the available encodings are
323found in the :mod:`codecs` module. However, the encoding and decoding functions
324returned by this module are usually more low-level than is comfortable, so I'm
325not going to describe the :mod:`codecs` module here. If you need to implement a
326completely new encoding, you'll need to learn about the :mod:`codecs` module
327interfaces, but implementing encodings is a specialized task that also won't be
328covered here. Consult the Python documentation to learn more about this module.
329
Georg Brandl6911e3c2007-09-04 07:15:32 +0000330
Georg Brandl116aa622007-08-15 14:28:22 +0000331Unicode Literals in Python Source Code
332--------------------------------------
333
Georg Brandlf6945182008-02-01 11:56:49 +0000334In Python source code, specific Unicode code points can be written using the
335``\u`` escape sequence, which is followed by four hex digits giving the code
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000336point. The ``\U`` escape sequence is similar, but expects eight hex digits,
337not four::
Georg Brandl116aa622007-08-15 14:28:22 +0000338
Georg Brandlf6945182008-02-01 11:56:49 +0000339 >>> s = "a\xac\u1234\u20ac\U00008000"
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700340 ... # ^^^^ two-digit hex escape
341 ... # ^^^^^^ four-digit Unicode escape
342 ... # ^^^^^^^^^^ eight-digit Unicode escape
343 >>> [ord(c) for c in s]
344 [97, 172, 4660, 8364, 32768]
Georg Brandl116aa622007-08-15 14:28:22 +0000345
346Using escape sequences for code points greater than 127 is fine in small doses,
347but becomes an annoyance if you're using many accented characters, as you would
348in a program with messages in French or some other accent-using language. You
Georg Brandlf6945182008-02-01 11:56:49 +0000349can also assemble strings using the :func:`chr` built-in function, but this is
Georg Brandl116aa622007-08-15 14:28:22 +0000350even more tedious.
351
352Ideally, you'd want to be able to write literals in your language's natural
353encoding. You could then edit Python source code with your favorite editor
354which would display the accented characters naturally, and have the right
355characters used at runtime.
356
Georg Brandl0c074222008-11-22 10:26:59 +0000357Python supports writing source code in UTF-8 by default, but you can use almost
358any encoding if you declare the encoding being used. This is done by including
359a special comment as either the first or second line of the source file::
Georg Brandl116aa622007-08-15 14:28:22 +0000360
361 #!/usr/bin/env python
362 # -*- coding: latin-1 -*-
Georg Brandl6911e3c2007-09-04 07:15:32 +0000363
Georg Brandlf6945182008-02-01 11:56:49 +0000364 u = 'abcdé'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000365 print(ord(u[-1]))
366
Georg Brandl116aa622007-08-15 14:28:22 +0000367The syntax is inspired by Emacs's notation for specifying variables local to a
368file. Emacs supports many different variables, but Python only supports
Georg Brandl0c074222008-11-22 10:26:59 +0000369'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
370they have no significance to Python but are a convention. Python looks for
371``coding: name`` or ``coding=name`` in the comment.
Georg Brandl116aa622007-08-15 14:28:22 +0000372
Georg Brandlf6945182008-02-01 11:56:49 +0000373If you don't include such a comment, the default encoding used will be UTF-8 as
Ezio Melotti410eee52013-01-20 12:16:03 +0200374already mentioned. See also :pep:`263` for more information.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000375
Georg Brandl116aa622007-08-15 14:28:22 +0000376
377Unicode Properties
378------------------
379
380The Unicode specification includes a database of information about code points.
Ezio Melotti410eee52013-01-20 12:16:03 +0200381For each defined code point, the information includes the character's
Georg Brandl116aa622007-08-15 14:28:22 +0000382name, its category, the numeric value if applicable (Unicode has characters
383representing the Roman numerals and fractions such as one-third and
384four-fifths). There are also properties related to the code point's use in
385bidirectional text and other display-related properties.
386
387The following program displays some information about several characters, and
388prints the numeric value of one particular character::
389
390 import unicodedata
Georg Brandl6911e3c2007-09-04 07:15:32 +0000391
Georg Brandlf6945182008-02-01 11:56:49 +0000392 u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
Georg Brandl6911e3c2007-09-04 07:15:32 +0000393
Georg Brandl116aa622007-08-15 14:28:22 +0000394 for i, c in enumerate(u):
Georg Brandl6911e3c2007-09-04 07:15:32 +0000395 print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
396 print(unicodedata.name(c))
397
Georg Brandl116aa622007-08-15 14:28:22 +0000398 # Get numeric value of second character
Georg Brandl6911e3c2007-09-04 07:15:32 +0000399 print(unicodedata.numeric(u[1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000400
Ezio Melotti410eee52013-01-20 12:16:03 +0200401When run, this prints:
402
403.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000404
405 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
406 1 0bf2 No TAMIL NUMBER ONE THOUSAND
407 2 0f84 Mn TIBETAN MARK HALANTA
408 3 1770 Lo TAGBANWA LETTER SA
409 4 33af So SQUARE RAD OVER S SQUARED
410 1000.0
411
412The category codes are abbreviations describing the nature of the character.
413These are grouped into categories such as "Letter", "Number", "Punctuation", or
414"Symbol", which in turn are broken up into subcategories. To take the codes
415from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
416"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
417other". See
Ezio Melotti4c5475d2010-03-22 23:16:42 +0000418<http://www.unicode.org/reports/tr44/#General_Category_Values> for a
Georg Brandl116aa622007-08-15 14:28:22 +0000419list of category codes.
420
421References
422----------
423
Ezio Melotti410eee52013-01-20 12:16:03 +0200424The :class:`str` type is described in the Python library reference at
Georg Brandlf6945182008-02-01 11:56:49 +0000425:ref:`typesseq`.
Georg Brandl116aa622007-08-15 14:28:22 +0000426
427The documentation for the :mod:`unicodedata` module.
428
429The documentation for the :mod:`codecs` module.
430
431Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and
432Unicode". A PDF version of his slides is available at
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000433<http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an
Georg Brandl0c074222008-11-22 10:26:59 +0000434excellent overview of the design of Python's Unicode features (based on Python
4352, where the Unicode string type is called ``unicode`` and literals start with
436``u``).
Georg Brandl116aa622007-08-15 14:28:22 +0000437
438
439Reading and Writing Unicode Data
440================================
441
442Once you've written some code that works with Unicode data, the next problem is
443input/output. How do you get Unicode strings into your program, and how do you
444convert Unicode into a form suitable for storage or transmission?
445
446It's possible that you may not need to do anything depending on your input
447sources and output destinations; you should check whether the libraries used in
448your application support Unicode natively. XML parsers often return Unicode
449data, for example. Many relational databases also support Unicode-valued
450columns and can return Unicode values from an SQL query.
451
452Unicode data is usually converted to a particular encoding before it gets
453written to disk or sent over a socket. It's possible to do all the work
Ezio Melotti410eee52013-01-20 12:16:03 +0200454yourself: open a file, read an 8-bit bytes object from it, and convert the string
455with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
Georg Brandl116aa622007-08-15 14:28:22 +0000456
457One problem is the multi-byte nature of encodings; one Unicode character can be
458represented by several bytes. If you want to read the file in arbitrary-sized
Ezio Melotti410eee52013-01-20 12:16:03 +0200459chunks (say, 1k or 4k), you need to write error-handling code to catch the case
Georg Brandl116aa622007-08-15 14:28:22 +0000460where only part of the bytes encoding a single Unicode character are read at the
461end of a chunk. One solution would be to read the entire file into memory and
462then perform the decoding, but that prevents you from working with files that
Ezio Melotti410eee52013-01-20 12:16:03 +0200463are extremely large; if you need to read a 2GB file, you need 2GB of RAM.
Georg Brandl116aa622007-08-15 14:28:22 +0000464(More, really, since for at least a moment you'd need to have both the encoded
465string and its Unicode version in memory.)
466
467The solution would be to use the low-level decoding interface to catch the case
468of partial coding sequences. The work of implementing this has already been
Georg Brandl0c074222008-11-22 10:26:59 +0000469done for you: the built-in :func:`open` function can return a file-like object
470that assumes the file's contents are in a specified encoding and accepts Unicode
Ezio Melotti410eee52013-01-20 12:16:03 +0200471parameters for methods such as :meth:`read` and :meth:`write`. This works through
Georg Brandl0c074222008-11-22 10:26:59 +0000472:func:`open`\'s *encoding* and *errors* parameters which are interpreted just
Ezio Melotti410eee52013-01-20 12:16:03 +0200473like those in :meth:`str.encode` and :meth:`bytes.decode`.
Georg Brandl116aa622007-08-15 14:28:22 +0000474
475Reading Unicode from a file is therefore simple::
476
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000477 with open('unicode.rst', encoding='utf-8') as f:
478 for line in f:
479 print(repr(line))
Georg Brandl116aa622007-08-15 14:28:22 +0000480
481It's also possible to open files in update mode, allowing both reading and
482writing::
483
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000484 with open('test', encoding='utf-8', mode='w+') as f:
485 f.write('\u4500 blah blah blah\n')
486 f.seek(0)
487 print(repr(f.readline()[:1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000488
Ezio Melotti410eee52013-01-20 12:16:03 +0200489The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
Georg Brandl116aa622007-08-15 14:28:22 +0000490written as the first character of a file in order to assist with autodetection
491of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
492present at the start of a file; when such an encoding is used, the BOM will be
493automatically written as the first character and will be silently dropped when
494the file is read. There are variants of these encodings, such as 'utf-16-le'
495and 'utf-16-be' for little-endian and big-endian encodings, that specify one
496particular byte ordering and don't skip the BOM.
497
Georg Brandl0c074222008-11-22 10:26:59 +0000498In some areas, it is also convention to use a "BOM" at the start of UTF-8
499encoded files; the name is misleading since UTF-8 is not byte-order dependent.
500The mark simply announces that the file is encoded in UTF-8. Use the
501'utf-8-sig' codec to automatically skip the mark if present for reading such
502files.
503
Georg Brandl116aa622007-08-15 14:28:22 +0000504
505Unicode filenames
506-----------------
507
508Most of the operating systems in common use today support filenames that contain
509arbitrary Unicode characters. Usually this is implemented by converting the
510Unicode string into some encoding that varies depending on the system. For
Georg Brandlc575c902008-09-13 17:46:05 +0000511example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
Georg Brandl116aa622007-08-15 14:28:22 +0000512Windows, Python uses the name "mbcs" to refer to whatever the currently
513configured encoding is. On Unix systems, there will only be a filesystem
514encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
515you haven't, the default encoding is ASCII.
516
517The :func:`sys.getfilesystemencoding` function returns the encoding to use on
518your current system, in case you want to do the encoding manually, but there's
519not much reason to bother. When opening a file for reading or writing, you can
520usually just provide the Unicode string as the filename, and it will be
521automatically converted to the right encoding for you::
522
Georg Brandlf6945182008-02-01 11:56:49 +0000523 filename = 'filename\u4500abc'
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000524 with open(filename, 'w') as f:
525 f.write('blah\n')
Georg Brandl116aa622007-08-15 14:28:22 +0000526
527Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
528filenames.
529
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000530Function :func:`os.listdir`, which returns filenames, raises an issue: should it return
Ezio Melotti410eee52013-01-20 12:16:03 +0200531the Unicode version of filenames, or should it return bytes containing
Georg Brandl116aa622007-08-15 14:28:22 +0000532the encoded versions? :func:`os.listdir` will do both, depending on whether you
Ezio Melotti410eee52013-01-20 12:16:03 +0200533provided the directory path as bytes or a Unicode string. If you pass a
Georg Brandl0c074222008-11-22 10:26:59 +0000534Unicode string as the path, filenames will be decoded using the filesystem's
535encoding and a list of Unicode strings will be returned, while passing a byte
Ezio Melotti410eee52013-01-20 12:16:03 +0200536path will return the bytes versions of the filenames. For example,
Georg Brandl0c074222008-11-22 10:26:59 +0000537assuming the default filesystem encoding is UTF-8, running the following
538program::
Georg Brandl116aa622007-08-15 14:28:22 +0000539
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000540 fn = 'filename\u4500abc'
541 f = open(fn, 'w')
542 f.close()
Georg Brandl116aa622007-08-15 14:28:22 +0000543
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000544 import os
545 print(os.listdir(b'.'))
546 print(os.listdir('.'))
Georg Brandl116aa622007-08-15 14:28:22 +0000547
548will produce the following output::
549
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000550 amk:~$ python t.py
551 [b'.svn', b'filename\xe4\x94\x80abc', ...]
552 ['.svn', 'filename\u4500abc', ...]
Georg Brandl116aa622007-08-15 14:28:22 +0000553
554The first list contains UTF-8-encoded filenames, and the second list contains
555the Unicode versions.
556
R. David Murray01054d72009-09-12 03:09:02 +0000557Note that in most occasions, the Unicode APIs should be used. The bytes APIs
Georg Brandl0c074222008-11-22 10:26:59 +0000558should only be used on systems where undecodable file names can be present,
559i.e. Unix systems.
560
Georg Brandl116aa622007-08-15 14:28:22 +0000561
Georg Brandl116aa622007-08-15 14:28:22 +0000562Tips for Writing Unicode-aware Programs
563---------------------------------------
564
565This section provides some suggestions on writing software that deals with
566Unicode.
567
568The most important tip is:
569
Ezio Melotti410eee52013-01-20 12:16:03 +0200570 Software should only work with Unicode strings internally, decoding the input
571 data as soon as possible and encoding the output only at the end.
Georg Brandl116aa622007-08-15 14:28:22 +0000572
Georg Brandl0c074222008-11-22 10:26:59 +0000573If you attempt to write processing functions that accept both Unicode and byte
Georg Brandl116aa622007-08-15 14:28:22 +0000574strings, you will find your program vulnerable to bugs wherever you combine the
Ezio Melotti410eee52013-01-20 12:16:03 +0200575two different kinds of strings. There is no automatic encoding or decoding: if
576you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
Georg Brandl116aa622007-08-15 14:28:22 +0000577
Georg Brandl116aa622007-08-15 14:28:22 +0000578When using data coming from a web browser or some other untrusted source, a
579common technique is to check for illegal characters in a string before using the
580string in a generated command line or storing it in a database. If you're doing
Antoine Pitrou534e2532011-12-05 01:21:46 +0100581this, be careful to check the decoded string, not the encoded bytes data;
582some encodings may have interesting properties, such as not being bijective
583or not being fully ASCII-compatible. This is especially true if the input
584data also specifies the encoding, since the attacker can then choose a
585clever way to hide malicious text in the encoded bytestream.
Georg Brandl116aa622007-08-15 14:28:22 +0000586
Georg Brandl116aa622007-08-15 14:28:22 +0000587
588References
589----------
590
591The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
592Applications in Python" are available at
Christian Heimesdd15f6c2008-03-16 00:07:10 +0000593<http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
Georg Brandl116aa622007-08-15 14:28:22 +0000594and discuss questions of character encodings as well as how to internationalize
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000595and localize an application. These slides cover Python 2.x only.
Georg Brandl116aa622007-08-15 14:28:22 +0000596
597
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000598Acknowledgements
599================
Georg Brandl116aa622007-08-15 14:28:22 +0000600
601Thanks to the following people who have noted errors or offered suggestions on
602this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
603Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
604
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000605.. comment
606 Revision History
Georg Brandl116aa622007-08-15 14:28:22 +0000607
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000608 Version 1.0: posted August 5 2005.
Georg Brandl116aa622007-08-15 14:28:22 +0000609
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000610 Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
611 several links.
Georg Brandl116aa622007-08-15 14:28:22 +0000612
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000613 Version 1.02: posted August 16 2005. Corrects factual errors.
Georg Brandl0c074222008-11-22 10:26:59 +0000614
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000615 Version 1.1: Feb-Nov 2008. Updates the document with respect to Python 3 changes.
616
617 Version 1.11: posted June 20 2010. Notes that Python 3.x is not covered,
618 and that the HOWTO only covers 2.x.
Georg Brandl116aa622007-08-15 14:28:22 +0000619
Benjamin Petersond7c3ed52010-06-27 22:32:30 +0000620.. comment Describe Python 3.x support (new section? new document?)
Georg Brandl116aa622007-08-15 14:28:22 +0000621.. comment Additional topic: building Python w/ UCS2 or UCS4 support
Georg Brandl116aa622007-08-15 14:28:22 +0000622.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
623
Georg Brandl6911e3c2007-09-04 07:15:32 +0000624.. comment
Georg Brandl116aa622007-08-15 14:28:22 +0000625 Original outline:
626
627 - [ ] Unicode introduction
628 - [ ] ASCII
629 - [ ] Terms
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000630 - [ ] Character
631 - [ ] Code point
632 - [ ] Encodings
633 - [ ] Common encodings: ASCII, Latin-1, UTF-8
Georg Brandl116aa622007-08-15 14:28:22 +0000634 - [ ] Unicode Python type
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000635 - [ ] Writing unicode literals
636 - [ ] Obscurity: -U switch
637 - [ ] Built-ins
638 - [ ] unichr()
639 - [ ] ord()
640 - [ ] unicode() constructor
641 - [ ] Unicode type
642 - [ ] encode(), decode() methods
Georg Brandl116aa622007-08-15 14:28:22 +0000643 - [ ] Unicodedata module for character properties
644 - [ ] I/O
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000645 - [ ] Reading/writing Unicode data into files
646 - [ ] Byte-order marks
647 - [ ] Unicode filenames
Georg Brandl116aa622007-08-15 14:28:22 +0000648 - [ ] Writing Unicode programs
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000649 - [ ] Do everything in Unicode
650 - [ ] Declaring source code encodings (PEP 263)
Georg Brandl116aa622007-08-15 14:28:22 +0000651 - [ ] Other issues
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000652 - [ ] Building Python (UCS2, UCS4)