blob: f19cfc3133153a8858a197079f956f77a91ab89c [file] [log] [blame]
Guido van Rossum715287f2008-12-02 22:34:15 +00001.. _unicode-howto:
2
Georg Brandl116aa622007-08-15 14:28:22 +00003*****************
4 Unicode HOWTO
5*****************
6
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00007:Release: 1.12
Georg Brandl116aa622007-08-15 14:28:22 +00008
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00009This HOWTO discusses Python support for Unicode, and explains
Benjamin Petersond7c3ed52010-06-27 22:32:30 +000010various problems that people commonly encounter when trying to work
Alexander Belopolsky93a6b132010-11-19 16:09:58 +000011with Unicode.
Georg Brandl6911e3c2007-09-04 07:15:32 +000012
Georg Brandl116aa622007-08-15 14:28:22 +000013Introduction to Unicode
14=======================
15
16History of Character Codes
17--------------------------
18
19In 1968, the American Standard Code for Information Interchange, better known by
20its acronym ASCII, was standardized. ASCII defined numeric codes for various
Georg Brandl0c074222008-11-22 10:26:59 +000021characters, with the numeric values running from 0 to 127. For example, the
22lowercase letter 'a' is assigned 97 as its code value.
Georg Brandl116aa622007-08-15 14:28:22 +000023
24ASCII was an American-developed standard, so it only defined unaccented
25characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
26which required accented characters couldn't be faithfully represented in ASCII.
27(Actually the missing accents matter for English, too, which contains words such
28as 'naïve' and 'café', and some publications have house styles which require
29spellings such as 'coöperate'.)
30
Andrew Kuchling2151fc62013-06-20 09:29:09 -040031For a while people just wrote programs that didn't display accents.
32In the mid-1980s an Apple II BASIC program written by a French speaker
33might have lines like these::
Georg Brandl116aa622007-08-15 14:28:22 +000034
Georg Brandla1c6a1c2009-01-03 21:26:05 +000035 PRINT "FICHIER EST COMPLETE."
36 PRINT "CARACTERE NON ACCEPTE."
Georg Brandl116aa622007-08-15 14:28:22 +000037
Andrew Kuchling2151fc62013-06-20 09:29:09 -040038Those messages should contain accents (completé, caractère, accepté),
39and they just look wrong to someone who can read French.
Georg Brandl116aa622007-08-15 14:28:22 +000040
41In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
42hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
43machines assigned values between 128 and 255 to accented characters. Different
44machines had different codes, however, which led to problems exchanging files.
Alexander Belopolsky93a6b132010-11-19 16:09:58 +000045Eventually various commonly used sets of values for the 128--255 range emerged.
Georg Brandl116aa622007-08-15 14:28:22 +000046Some were true standards, defined by the International Standards Organization,
Ezio Melotti410eee52013-01-20 12:16:03 +020047and some were *de facto* conventions that were invented by one company or
Georg Brandl116aa622007-08-15 14:28:22 +000048another and managed to catch on.
49
50255 characters aren't very many. For example, you can't fit both the accented
51characters used in Western Europe and the Cyrillic alphabet used for Russian
Alexander Belopolsky93a6b132010-11-19 16:09:58 +000052into the 128--255 range because there are more than 127 such characters.
Georg Brandl116aa622007-08-15 14:28:22 +000053
54You could write files using different codes (all your Russian files in a coding
55system called KOI8, all your French files in a different coding system called
56Latin1), but what if you wanted to write a French document that quotes some
57Russian text? In the 1980s people began to want to solve this problem, and the
58Unicode standardization effort began.
59
60Unicode started out using 16-bit characters instead of 8-bit characters. 16
61bits means you have 2^16 = 65,536 distinct values available, making it possible
62to represent many different characters from many different alphabets; an initial
63goal was to have Unicode contain the alphabets for every single human language.
64It turns out that even 16 bits isn't enough to meet that goal, and the modern
Ezio Melotti410eee52013-01-20 12:16:03 +020065Unicode specification uses a wider range of codes, 0 through 1,114,111 (
66``0x10FFFF`` in base 16).
Georg Brandl116aa622007-08-15 14:28:22 +000067
68There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
69originally separate efforts, but the specifications were merged with the 1.1
70revision of Unicode.
71
Andrew Kuchling2151fc62013-06-20 09:29:09 -040072(This discussion of Unicode's history is highly simplified. The
73precise historical details aren't necessary for understanding how to
74use Unicode effectively, but if you're curious, consult the Unicode
75consortium site listed in the References or
76the `Wikipedia entry for Unicode <http://en.wikipedia.org/wiki/Unicode#History>`_
77for more information.)
Georg Brandl116aa622007-08-15 14:28:22 +000078
79
80Definitions
81-----------
82
83A **character** is the smallest possible component of a text. 'A', 'B', 'C',
84etc., are all different characters. So are 'È' and 'Í'. Characters are
85abstractions, and vary depending on the language or context you're talking
86about. For example, the symbol for ohms (Ω) is usually drawn much like the
87capital letter omega (Ω) in the Greek alphabet (they may even be the same in
88some fonts), but these are two different characters that have different
89meanings.
90
91The Unicode standard describes how characters are represented by **code
92points**. A code point is an integer value, usually denoted in base 16. In the
Ezio Melotti410eee52013-01-20 12:16:03 +020093standard, a code point is written using the notation ``U+12CA`` to mean the
94character with value ``0x12ca`` (4,810 decimal). The Unicode standard contains
95a lot of tables listing characters and their corresponding code points:
96
97.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +000098
Georg Brandla1c6a1c2009-01-03 21:26:05 +000099 0061 'a'; LATIN SMALL LETTER A
100 0062 'b'; LATIN SMALL LETTER B
101 0063 'c'; LATIN SMALL LETTER C
102 ...
103 007B '{'; LEFT CURLY BRACKET
Georg Brandl116aa622007-08-15 14:28:22 +0000104
105Strictly, these definitions imply that it's meaningless to say 'this is
Ezio Melotti410eee52013-01-20 12:16:03 +0200106character ``U+12CA``'. ``U+12CA`` is a code point, which represents some particular
Georg Brandl116aa622007-08-15 14:28:22 +0000107character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
108informal contexts, this distinction between code points and characters will
109sometimes be forgotten.
110
111A character is represented on a screen or on paper by a set of graphical
112elements that's called a **glyph**. The glyph for an uppercase A, for example,
113is two diagonal strokes and a horizontal stroke, though the exact details will
114depend on the font being used. Most Python code doesn't need to worry about
115glyphs; figuring out the correct glyph to display is generally the job of a GUI
116toolkit or a terminal's font renderer.
117
118
119Encodings
120---------
121
122To summarize the previous section: a Unicode string is a sequence of code
Ezio Melotti410eee52013-01-20 12:16:03 +0200123points, which are numbers from 0 through ``0x10FFFF`` (1,114,111 decimal). This
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000124sequence needs to be represented as a set of bytes (meaning, values
125from 0 through 255) in memory. The rules for translating a Unicode string
126into a sequence of bytes are called an **encoding**.
Georg Brandl116aa622007-08-15 14:28:22 +0000127
128The first encoding you might think of is an array of 32-bit integers. In this
Ezio Melotti410eee52013-01-20 12:16:03 +0200129representation, the string "Python" would look like this:
130
131.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000132
133 P y t h o n
Georg Brandl6911e3c2007-09-04 07:15:32 +0000134 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
135 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl116aa622007-08-15 14:28:22 +0000136
137This representation is straightforward but using it presents a number of
138problems.
139
1401. It's not portable; different processors order the bytes differently.
141
1422. It's very wasteful of space. In most texts, the majority of the code points
Ezio Melotti410eee52013-01-20 12:16:03 +0200143 are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
Georg Brandl116aa622007-08-15 14:28:22 +0000144 bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
145 ASCII representation. Increased RAM usage doesn't matter too much (desktop
Ezio Melotti410eee52013-01-20 12:16:03 +0200146 computers have gigabytes of RAM, and strings aren't usually that large), but
Georg Brandl116aa622007-08-15 14:28:22 +0000147 expanding our usage of disk and network bandwidth by a factor of 4 is
148 intolerable.
149
1503. It's not compatible with existing C functions such as ``strlen()``, so a new
151 family of wide string functions would need to be used.
152
1534. Many Internet standards are defined in terms of textual data, and can't
154 handle content with embedded zero bytes.
155
Benjamin Petersond7c3ed52010-06-27 22:32:30 +0000156Generally people don't use this encoding, instead choosing other
157encodings that are more efficient and convenient. UTF-8 is probably
158the most commonly supported encoding; it will be discussed below.
Georg Brandl116aa622007-08-15 14:28:22 +0000159
160Encodings don't have to handle every possible Unicode character, and most
Benjamin Peterson1f316972009-09-11 20:42:29 +0000161encodings don't. The rules for converting a Unicode string into the ASCII
162encoding, for example, are simple; for each code point:
Georg Brandl116aa622007-08-15 14:28:22 +0000163
1641. If the code point is < 128, each byte is the same as the value of the code
165 point.
166
1672. If the code point is 128 or greater, the Unicode string can't be represented
168 in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
169 case.)
170
171Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00001720--255 are identical to the Latin-1 values, so converting to this encoding simply
Georg Brandl116aa622007-08-15 14:28:22 +0000173requires converting code points to byte values; if a code point larger than 255
174is encountered, the string can't be encoded into Latin-1.
175
176Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
177IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
178block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
179through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
180some sort of lookup table to perform the conversion, but this is largely an
181internal detail.
182
183UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
184Transformation Format", and the '8' means that 8-bit numbers are used in the
Ezio Melotti410eee52013-01-20 12:16:03 +0200185encoding. (There are also a UTF-16 and UTF-32 encodings, but they are less
186frequently used than UTF-8.) UTF-8 uses the following rules:
Georg Brandl116aa622007-08-15 14:28:22 +0000187
Ezio Melotti410eee52013-01-20 12:16:03 +02001881. If the code point is < 128, it's represented by the corresponding byte value.
1892. If the code point is >= 128, it's turned into a sequence of two, three, or
190 four bytes, where each byte of the sequence is between 128 and 255.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000191
Georg Brandl116aa622007-08-15 14:28:22 +0000192UTF-8 has several convenient properties:
193
1941. It can handle any Unicode code point.
1952. A Unicode string is turned into a string of bytes containing no embedded zero
196 bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
197 processed by C functions such as ``strcpy()`` and sent through protocols that
198 can't handle zero bytes.
1993. A string of ASCII text is also valid UTF-8 text.
Ezio Melotti410eee52013-01-20 12:16:03 +02002004. UTF-8 is fairly compact; the majority of commonly used characters can be
201 represented with one or two bytes.
Georg Brandl116aa622007-08-15 14:28:22 +00002025. If bytes are corrupted or lost, it's possible to determine the start of the
203 next UTF-8-encoded code point and resynchronize. It's also unlikely that
204 random 8-bit data will look like valid UTF-8.
205
206
207
208References
209----------
210
Ezio Melotti410eee52013-01-20 12:16:03 +0200211The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
Georg Brandl116aa622007-08-15 14:28:22 +0000212glossary, and PDF versions of the Unicode specification. Be prepared for some
Ezio Melotti410eee52013-01-20 12:16:03 +0200213difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
214origin and development of Unicode is also available on the site.
Georg Brandl116aa622007-08-15 14:28:22 +0000215
Ezio Melotti410eee52013-01-20 12:16:03 +0200216To help understand the standard, Jukka Korpela has written `an introductory
217guide <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>`_ to reading the
218Unicode character tables.
Georg Brandl116aa622007-08-15 14:28:22 +0000219
Ezio Melotti410eee52013-01-20 12:16:03 +0200220Another `good introductory article <http://www.joelonsoftware.com/articles/Unicode.html>`_
221was written by Joel Spolsky.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400222If this introduction didn't make things clear to you, you should try
223reading this alternate article before continuing.
Georg Brandl116aa622007-08-15 14:28:22 +0000224
Ezio Melotti410eee52013-01-20 12:16:03 +0200225Wikipedia entries are often helpful; see the entries for "`character encoding
226<http://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
227<http://en.wikipedia.org/wiki/UTF-8>`_, for example.
Georg Brandl116aa622007-08-15 14:28:22 +0000228
229
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000230Python's Unicode Support
231========================
Georg Brandl116aa622007-08-15 14:28:22 +0000232
233Now that you've learned the rudiments of Unicode, we can look at Python's
234Unicode features.
235
Georg Brandlf6945182008-02-01 11:56:49 +0000236The String Type
237---------------
Georg Brandl116aa622007-08-15 14:28:22 +0000238
Ezio Melotti410eee52013-01-20 12:16:03 +0200239Since Python 3.0, the language features a :class:`str` type that contain Unicode
Georg Brandlf6945182008-02-01 11:56:49 +0000240characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
Georg Brandl4f5f98d2009-05-04 21:01:20 +0000241rocks!'``, or the triple-quoted string syntax is stored as Unicode.
Georg Brandl116aa622007-08-15 14:28:22 +0000242
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400243The default encoding for Python source code is UTF-8, so you can simply
244include a Unicode character in a string literal::
245
246 try:
247 with open('/tmp/input.txt', 'r') as f:
248 ...
249 except IOError:
250 # 'File not found' error message.
251 print("Fichier non trouvé")
252
253You can use a different encoding from UTF-8 by putting a specially-formatted
254comment as the first or second line of the source code::
255
256 # -*- coding: <encoding name> -*-
257
258Side note: Python 3 also supports using Unicode characters in identifiers::
259
260 répertoire = "/tmp/records.log"
261 with open(répertoire, "w") as f:
262 f.write("test\n")
263
264If you can't enter a particular character in your editor or want to
265keep the source code ASCII-only for some reason, you can also use
266escape sequences in string literals. (Depending on your system,
267you may see the actual capital-delta glyph instead of a \u escape.) ::
Georg Brandl116aa622007-08-15 14:28:22 +0000268
Georg Brandlf6945182008-02-01 11:56:49 +0000269 >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
270 '\u0394'
271 >>> "\u0394" # Using a 16-bit hex value
272 '\u0394'
273 >>> "\U00000394" # Using a 32-bit hex value
274 '\u0394'
Georg Brandl116aa622007-08-15 14:28:22 +0000275
Ezio Melotti410eee52013-01-20 12:16:03 +0200276In addition, one can create a string using the :func:`~bytes.decode` method of
277:class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``,
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400278and optionally an *errors* argument.
Georg Brandl116aa622007-08-15 14:28:22 +0000279
Georg Brandlf6945182008-02-01 11:56:49 +0000280The *errors* argument specifies the response when the input string can't be
Georg Brandl116aa622007-08-15 14:28:22 +0000281converted according to the encoding's rules. Legal values for this argument are
Ezio Melotti410eee52013-01-20 12:16:03 +0200282``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
283``U+FFFD``, ``REPLACEMENT CHARACTER``), or ``'ignore'`` (just leave the
284character out of the Unicode result).
285The following examples show the differences::
Georg Brandl116aa622007-08-15 14:28:22 +0000286
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700287 >>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000288 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700289 ...
290 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
291 invalid start byte
Ezio Melotti20b8d992012-09-23 15:55:14 +0300292 >>> b'\x80abc'.decode("utf-8", "replace")
293 '\ufffdabc'
Georg Brandlf6945182008-02-01 11:56:49 +0000294 >>> b'\x80abc'.decode("utf-8", "ignore")
295 'abc'
Georg Brandl116aa622007-08-15 14:28:22 +0000296
Georg Brandlc8c60c22010-11-19 22:09:04 +0000297(In this code example, the Unicode replacement character has been replaced by
298a question mark because it may not be displayed on some systems.)
299
Benjamin Petersond7c3ed52010-06-27 22:32:30 +0000300Encodings are specified as strings containing the encoding's name. Python 3.2
301comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl0c074222008-11-22 10:26:59 +0000302:ref:`standard-encodings` for a list. Some encodings have multiple names; for
Ezio Melotti410eee52013-01-20 12:16:03 +0200303example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
304the same encoding.
Georg Brandl116aa622007-08-15 14:28:22 +0000305
Georg Brandlf6945182008-02-01 11:56:49 +0000306One-character Unicode strings can also be created with the :func:`chr`
Georg Brandl116aa622007-08-15 14:28:22 +0000307built-in function, which takes integers and returns a Unicode string of length 1
308that contains the corresponding code point. The reverse operation is the
309built-in :func:`ord` function that takes a one-character Unicode string and
310returns the code point value::
311
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000312 >>> chr(57344)
313 '\ue000'
314 >>> ord('\ue000')
315 57344
Georg Brandl116aa622007-08-15 14:28:22 +0000316
Georg Brandlf6945182008-02-01 11:56:49 +0000317Converting to Bytes
318-------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000319
Ezio Melotti410eee52013-01-20 12:16:03 +0200320The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
321which returns a :class:`bytes` representation of the Unicode string, encoded in the
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400322requested *encoding*.
323
324The *errors* parameter is the same as the parameter of the
325:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
326``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
327inserts a question mark instead of the unencodable character), there is
328also ``'xmlcharrefreplace'`` (inserts an XML character reference) and
329``backslashreplace`` (inserts a ``\uNNNN`` escape sequence).
330
Ezio Melotti410eee52013-01-20 12:16:03 +0200331The following example shows the different results::
Georg Brandl116aa622007-08-15 14:28:22 +0000332
Georg Brandlf6945182008-02-01 11:56:49 +0000333 >>> u = chr(40960) + 'abcd' + chr(1972)
Georg Brandl116aa622007-08-15 14:28:22 +0000334 >>> u.encode('utf-8')
Georg Brandlf6945182008-02-01 11:56:49 +0000335 b'\xea\x80\x80abcd\xde\xb4'
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700336 >>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000337 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700338 ...
Georg Brandl0c074222008-11-22 10:26:59 +0000339 UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700340 position 0: ordinal not in range(128)
Georg Brandl116aa622007-08-15 14:28:22 +0000341 >>> u.encode('ascii', 'ignore')
Georg Brandlf6945182008-02-01 11:56:49 +0000342 b'abcd'
Georg Brandl116aa622007-08-15 14:28:22 +0000343 >>> u.encode('ascii', 'replace')
Georg Brandlf6945182008-02-01 11:56:49 +0000344 b'?abcd?'
Georg Brandl116aa622007-08-15 14:28:22 +0000345 >>> u.encode('ascii', 'xmlcharrefreplace')
Georg Brandlf6945182008-02-01 11:56:49 +0000346 b'&#40960;abcd&#1972;'
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400347 >>> u.encode('ascii', 'backslashreplace')
348 b'\\ua000abcd\\u07b4'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000349
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400350The low-level routines for registering and accessing the available
351encodings are found in the :mod:`codecs` module. Implementing new
352encodings also requires understanding the :mod:`codecs` module.
353However, the encoding and decoding functions returned by this module
354are usually more low-level than is comfortable, and writing new encodings
355is a specialized task, so the module won't be covered in this HOWTO.
Georg Brandl116aa622007-08-15 14:28:22 +0000356
Georg Brandl6911e3c2007-09-04 07:15:32 +0000357
Georg Brandl116aa622007-08-15 14:28:22 +0000358Unicode Literals in Python Source Code
359--------------------------------------
360
Georg Brandlf6945182008-02-01 11:56:49 +0000361In Python source code, specific Unicode code points can be written using the
362``\u`` escape sequence, which is followed by four hex digits giving the code
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000363point. The ``\U`` escape sequence is similar, but expects eight hex digits,
364not four::
Georg Brandl116aa622007-08-15 14:28:22 +0000365
Georg Brandlf6945182008-02-01 11:56:49 +0000366 >>> s = "a\xac\u1234\u20ac\U00008000"
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700367 ... # ^^^^ two-digit hex escape
368 ... # ^^^^^^ four-digit Unicode escape
369 ... # ^^^^^^^^^^ eight-digit Unicode escape
370 >>> [ord(c) for c in s]
371 [97, 172, 4660, 8364, 32768]
Georg Brandl116aa622007-08-15 14:28:22 +0000372
373Using escape sequences for code points greater than 127 is fine in small doses,
374but becomes an annoyance if you're using many accented characters, as you would
375in a program with messages in French or some other accent-using language. You
Georg Brandlf6945182008-02-01 11:56:49 +0000376can also assemble strings using the :func:`chr` built-in function, but this is
Georg Brandl116aa622007-08-15 14:28:22 +0000377even more tedious.
378
379Ideally, you'd want to be able to write literals in your language's natural
380encoding. You could then edit Python source code with your favorite editor
381which would display the accented characters naturally, and have the right
382characters used at runtime.
383
Georg Brandl0c074222008-11-22 10:26:59 +0000384Python supports writing source code in UTF-8 by default, but you can use almost
385any encoding if you declare the encoding being used. This is done by including
386a special comment as either the first or second line of the source file::
Georg Brandl116aa622007-08-15 14:28:22 +0000387
388 #!/usr/bin/env python
389 # -*- coding: latin-1 -*-
Georg Brandl6911e3c2007-09-04 07:15:32 +0000390
Georg Brandlf6945182008-02-01 11:56:49 +0000391 u = 'abcdé'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000392 print(ord(u[-1]))
393
Georg Brandl116aa622007-08-15 14:28:22 +0000394The syntax is inspired by Emacs's notation for specifying variables local to a
395file. Emacs supports many different variables, but Python only supports
Georg Brandl0c074222008-11-22 10:26:59 +0000396'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
397they have no significance to Python but are a convention. Python looks for
398``coding: name`` or ``coding=name`` in the comment.
Georg Brandl116aa622007-08-15 14:28:22 +0000399
Georg Brandlf6945182008-02-01 11:56:49 +0000400If you don't include such a comment, the default encoding used will be UTF-8 as
Ezio Melotti410eee52013-01-20 12:16:03 +0200401already mentioned. See also :pep:`263` for more information.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000402
Georg Brandl116aa622007-08-15 14:28:22 +0000403
404Unicode Properties
405------------------
406
407The Unicode specification includes a database of information about code points.
Ezio Melotti410eee52013-01-20 12:16:03 +0200408For each defined code point, the information includes the character's
Georg Brandl116aa622007-08-15 14:28:22 +0000409name, its category, the numeric value if applicable (Unicode has characters
410representing the Roman numerals and fractions such as one-third and
411four-fifths). There are also properties related to the code point's use in
412bidirectional text and other display-related properties.
413
414The following program displays some information about several characters, and
415prints the numeric value of one particular character::
416
417 import unicodedata
Georg Brandl6911e3c2007-09-04 07:15:32 +0000418
Georg Brandlf6945182008-02-01 11:56:49 +0000419 u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
Georg Brandl6911e3c2007-09-04 07:15:32 +0000420
Georg Brandl116aa622007-08-15 14:28:22 +0000421 for i, c in enumerate(u):
Georg Brandl6911e3c2007-09-04 07:15:32 +0000422 print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
423 print(unicodedata.name(c))
424
Georg Brandl116aa622007-08-15 14:28:22 +0000425 # Get numeric value of second character
Georg Brandl6911e3c2007-09-04 07:15:32 +0000426 print(unicodedata.numeric(u[1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000427
Ezio Melotti410eee52013-01-20 12:16:03 +0200428When run, this prints:
429
430.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000431
432 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
433 1 0bf2 No TAMIL NUMBER ONE THOUSAND
434 2 0f84 Mn TIBETAN MARK HALANTA
435 3 1770 Lo TAGBANWA LETTER SA
436 4 33af So SQUARE RAD OVER S SQUARED
437 1000.0
438
439The category codes are abbreviations describing the nature of the character.
440These are grouped into categories such as "Letter", "Number", "Punctuation", or
441"Symbol", which in turn are broken up into subcategories. To take the codes
442from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
443"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
444other". See
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400445`the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
Georg Brandl116aa622007-08-15 14:28:22 +0000446list of category codes.
447
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400448
449Unicode Regular Expressions
450---------------------------
451
452The regular expressions supported by the :mod:`re` module can be provided
453either as bytes or strings. Some of the special character sequences such as
454``\d`` and ``\w`` have different meanings depending on whether
455the pattern is supplied as bytes or a string. For example,
456``\d`` will match the characters ``[0-9]`` in bytes but
457in strings will match any character that's in the ``'Nd'`` category.
458
459The string in this example has the number 57 written in both Thai and
460Arabic numerals::
461
462 import re
463 p = re.compile('\d+')
464
465 s = "Over \u0e55\u0e57 57 flavours"
466 m = p.search(s)
467 print(repr(m.group()))
468
469When executed, ``\d+`` will match the Thai numerals and print them
470out. If you supply the :const:`re.ASCII` flag to
471:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
472
473Similarly, ``\w`` matches a wide variety of Unicode characters but
474only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
475and ``\s`` will match either Unicode whitespace characters or
476``[ \t\n\r\f\v]``.
477
478
Georg Brandl116aa622007-08-15 14:28:22 +0000479References
480----------
481
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400482.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
483
484Some good alternative discussions of Python's Unicode support are:
485
486* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
487* `Pragmatic Unicode <http://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
488
Ezio Melotti410eee52013-01-20 12:16:03 +0200489The :class:`str` type is described in the Python library reference at
Ezio Melottia6229e62012-10-12 10:59:14 +0300490:ref:`textseq`.
Georg Brandl116aa622007-08-15 14:28:22 +0000491
492The documentation for the :mod:`unicodedata` module.
493
494The documentation for the :mod:`codecs` module.
495
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400496Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides) <http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
497EuroPython 2002. The slides are an excellent overview of the design
498of Python 2's Unicode features (where the Unicode string type is
499called ``unicode`` and literals start with ``u``).
Georg Brandl116aa622007-08-15 14:28:22 +0000500
501
502Reading and Writing Unicode Data
503================================
504
505Once you've written some code that works with Unicode data, the next problem is
506input/output. How do you get Unicode strings into your program, and how do you
507convert Unicode into a form suitable for storage or transmission?
508
509It's possible that you may not need to do anything depending on your input
510sources and output destinations; you should check whether the libraries used in
511your application support Unicode natively. XML parsers often return Unicode
512data, for example. Many relational databases also support Unicode-valued
513columns and can return Unicode values from an SQL query.
514
515Unicode data is usually converted to a particular encoding before it gets
516written to disk or sent over a socket. It's possible to do all the work
Ezio Melotti410eee52013-01-20 12:16:03 +0200517yourself: open a file, read an 8-bit bytes object from it, and convert the string
518with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
Georg Brandl116aa622007-08-15 14:28:22 +0000519
520One problem is the multi-byte nature of encodings; one Unicode character can be
521represented by several bytes. If you want to read the file in arbitrary-sized
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200522chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case
Georg Brandl116aa622007-08-15 14:28:22 +0000523where only part of the bytes encoding a single Unicode character are read at the
524end of a chunk. One solution would be to read the entire file into memory and
525then perform the decoding, but that prevents you from working with files that
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200526are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.
Georg Brandl116aa622007-08-15 14:28:22 +0000527(More, really, since for at least a moment you'd need to have both the encoded
528string and its Unicode version in memory.)
529
530The solution would be to use the low-level decoding interface to catch the case
531of partial coding sequences. The work of implementing this has already been
Georg Brandl0c074222008-11-22 10:26:59 +0000532done for you: the built-in :func:`open` function can return a file-like object
533that assumes the file's contents are in a specified encoding and accepts Unicode
Ezio Melotti410eee52013-01-20 12:16:03 +0200534parameters for methods such as :meth:`read` and :meth:`write`. This works through
Georg Brandl0c074222008-11-22 10:26:59 +0000535:func:`open`\'s *encoding* and *errors* parameters which are interpreted just
Ezio Melotti410eee52013-01-20 12:16:03 +0200536like those in :meth:`str.encode` and :meth:`bytes.decode`.
Georg Brandl116aa622007-08-15 14:28:22 +0000537
538Reading Unicode from a file is therefore simple::
539
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000540 with open('unicode.rst', encoding='utf-8') as f:
541 for line in f:
542 print(repr(line))
Georg Brandl116aa622007-08-15 14:28:22 +0000543
544It's also possible to open files in update mode, allowing both reading and
545writing::
546
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000547 with open('test', encoding='utf-8', mode='w+') as f:
548 f.write('\u4500 blah blah blah\n')
549 f.seek(0)
550 print(repr(f.readline()[:1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000551
Ezio Melotti410eee52013-01-20 12:16:03 +0200552The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
Georg Brandl116aa622007-08-15 14:28:22 +0000553written as the first character of a file in order to assist with autodetection
554of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
555present at the start of a file; when such an encoding is used, the BOM will be
556automatically written as the first character and will be silently dropped when
557the file is read. There are variants of these encodings, such as 'utf-16-le'
558and 'utf-16-be' for little-endian and big-endian encodings, that specify one
559particular byte ordering and don't skip the BOM.
560
Georg Brandl0c074222008-11-22 10:26:59 +0000561In some areas, it is also convention to use a "BOM" at the start of UTF-8
562encoded files; the name is misleading since UTF-8 is not byte-order dependent.
563The mark simply announces that the file is encoded in UTF-8. Use the
564'utf-8-sig' codec to automatically skip the mark if present for reading such
565files.
566
Georg Brandl116aa622007-08-15 14:28:22 +0000567
568Unicode filenames
569-----------------
570
571Most of the operating systems in common use today support filenames that contain
572arbitrary Unicode characters. Usually this is implemented by converting the
573Unicode string into some encoding that varies depending on the system. For
Georg Brandlc575c902008-09-13 17:46:05 +0000574example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
Georg Brandl116aa622007-08-15 14:28:22 +0000575Windows, Python uses the name "mbcs" to refer to whatever the currently
576configured encoding is. On Unix systems, there will only be a filesystem
577encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400578you haven't, the default encoding is UTF-8.
Georg Brandl116aa622007-08-15 14:28:22 +0000579
580The :func:`sys.getfilesystemencoding` function returns the encoding to use on
581your current system, in case you want to do the encoding manually, but there's
582not much reason to bother. When opening a file for reading or writing, you can
583usually just provide the Unicode string as the filename, and it will be
584automatically converted to the right encoding for you::
585
Georg Brandlf6945182008-02-01 11:56:49 +0000586 filename = 'filename\u4500abc'
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000587 with open(filename, 'w') as f:
588 f.write('blah\n')
Georg Brandl116aa622007-08-15 14:28:22 +0000589
590Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
591filenames.
592
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400593The :func:`os.listdir` function returns filenames and raises an issue: should it return
Ezio Melotti410eee52013-01-20 12:16:03 +0200594the Unicode version of filenames, or should it return bytes containing
Georg Brandl116aa622007-08-15 14:28:22 +0000595the encoded versions? :func:`os.listdir` will do both, depending on whether you
Ezio Melotti410eee52013-01-20 12:16:03 +0200596provided the directory path as bytes or a Unicode string. If you pass a
Georg Brandl0c074222008-11-22 10:26:59 +0000597Unicode string as the path, filenames will be decoded using the filesystem's
598encoding and a list of Unicode strings will be returned, while passing a byte
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400599path will return the filenames as bytes. For example,
Georg Brandl0c074222008-11-22 10:26:59 +0000600assuming the default filesystem encoding is UTF-8, running the following
601program::
Georg Brandl116aa622007-08-15 14:28:22 +0000602
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000603 fn = 'filename\u4500abc'
604 f = open(fn, 'w')
605 f.close()
Georg Brandl116aa622007-08-15 14:28:22 +0000606
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000607 import os
608 print(os.listdir(b'.'))
609 print(os.listdir('.'))
Georg Brandl116aa622007-08-15 14:28:22 +0000610
611will produce the following output::
612
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000613 amk:~$ python t.py
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400614 [b'filename\xe4\x94\x80abc', ...]
615 ['filename\u4500abc', ...]
Georg Brandl116aa622007-08-15 14:28:22 +0000616
617The first list contains UTF-8-encoded filenames, and the second list contains
618the Unicode versions.
619
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400620Note that on most occasions, the Unicode APIs should be used. The bytes APIs
Georg Brandl0c074222008-11-22 10:26:59 +0000621should only be used on systems where undecodable file names can be present,
622i.e. Unix systems.
623
Georg Brandl116aa622007-08-15 14:28:22 +0000624
Georg Brandl116aa622007-08-15 14:28:22 +0000625Tips for Writing Unicode-aware Programs
626---------------------------------------
627
628This section provides some suggestions on writing software that deals with
629Unicode.
630
631The most important tip is:
632
Ezio Melotti410eee52013-01-20 12:16:03 +0200633 Software should only work with Unicode strings internally, decoding the input
634 data as soon as possible and encoding the output only at the end.
Georg Brandl116aa622007-08-15 14:28:22 +0000635
Georg Brandl0c074222008-11-22 10:26:59 +0000636If you attempt to write processing functions that accept both Unicode and byte
Georg Brandl116aa622007-08-15 14:28:22 +0000637strings, you will find your program vulnerable to bugs wherever you combine the
Ezio Melotti410eee52013-01-20 12:16:03 +0200638two different kinds of strings. There is no automatic encoding or decoding: if
639you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
Georg Brandl116aa622007-08-15 14:28:22 +0000640
Georg Brandl116aa622007-08-15 14:28:22 +0000641When using data coming from a web browser or some other untrusted source, a
642common technique is to check for illegal characters in a string before using the
643string in a generated command line or storing it in a database. If you're doing
Antoine Pitrou534e2532011-12-05 01:21:46 +0100644this, be careful to check the decoded string, not the encoded bytes data;
645some encodings may have interesting properties, such as not being bijective
646or not being fully ASCII-compatible. This is especially true if the input
647data also specifies the encoding, since the attacker can then choose a
648clever way to hide malicious text in the encoded bytestream.
Georg Brandl116aa622007-08-15 14:28:22 +0000649
Georg Brandl116aa622007-08-15 14:28:22 +0000650
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400651Converting Between File Encodings
652'''''''''''''''''''''''''''''''''
653
654The :class:`~codecs.StreamRecoder` class can transparently convert between
655encodings, taking a stream that returns data in encoding #1
656and behaving like a stream returning data in encoding #2.
657
658For example, if you have an input file *f* that's in Latin-1, you
659can wrap it with a :class:`StreamRecoder` to return bytes encoded in UTF-8::
660
661 new_f = codecs.StreamRecoder(f,
662 # en/decoder: used by read() to encode its results and
663 # by write() to decode its input.
664 codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
665
666 # reader/writer: used to read and write to the stream.
667 codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
668
669
670Files in an Unknown Encoding
671''''''''''''''''''''''''''''
672
673What can you do if you need to make a change to a file, but don't know
674the file's encoding? If you know the encoding is ASCII-compatible and
675only want to examine or modify the ASCII parts, you can open the file
676with the ``surrogateescape`` error handler::
677
678 with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
679 data = f.read()
680
681 # make changes to the string 'data'
682
683 with open(fname + '.new', 'w',
684 encoding="ascii", errors="surrogateescape") as f:
685 f.write(data)
686
687The ``surrogateescape`` error handler will decode any non-ASCII bytes
688as code points in the Unicode Private Use Area ranging from U+DC80 to
689U+DCFF. These private code points will then be turned back into the
690same bytes when the ``surrogateescape`` error handler is used when
691encoding the data and writing it back out.
692
693
Georg Brandl116aa622007-08-15 14:28:22 +0000694References
695----------
696
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400697One section of `Mastering Python 3 Input/Output <http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_, a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
698
699The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware Applications in Python" <http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
700discuss questions of character encodings as well as how to internationalize
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000701and localize an application. These slides cover Python 2.x only.
Georg Brandl116aa622007-08-15 14:28:22 +0000702
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400703`The Guts of Unicode in Python <http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_ is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode representation in Python 3.3.
704
Georg Brandl116aa622007-08-15 14:28:22 +0000705
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000706Acknowledgements
707================
Georg Brandl116aa622007-08-15 14:28:22 +0000708
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400709The initial draft of this document was written by Andrew Kuchling.
710It has since been revised further by Alexander Belopolsky, Georg Brandl,
711Andrew Kuchling, and Ezio Melotti.
Georg Brandl116aa622007-08-15 14:28:22 +0000712
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400713Thanks to the following people who have noted errors or offered
714suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
715Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
716Lemburg, Martin von Löwis, Terry J. Reedy, Chad Whitacre.