blob: be1fefb35a71f0970d5e84aadb93f011d79f73f9 [file] [log] [blame]
Guido van Rossum715287f2008-12-02 22:34:15 +00001.. _unicode-howto:
2
Georg Brandl116aa622007-08-15 14:28:22 +00003*****************
4 Unicode HOWTO
5*****************
6
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00007:Release: 1.12
Georg Brandl116aa622007-08-15 14:28:22 +00008
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00009This HOWTO discusses Python support for Unicode, and explains
Benjamin Petersond7c3ed52010-06-27 22:32:30 +000010various problems that people commonly encounter when trying to work
Alexander Belopolsky93a6b132010-11-19 16:09:58 +000011with Unicode.
Georg Brandl6911e3c2007-09-04 07:15:32 +000012
Georg Brandl116aa622007-08-15 14:28:22 +000013Introduction to Unicode
14=======================
15
16History of Character Codes
17--------------------------
18
19In 1968, the American Standard Code for Information Interchange, better known by
20its acronym ASCII, was standardized. ASCII defined numeric codes for various
Georg Brandl0c074222008-11-22 10:26:59 +000021characters, with the numeric values running from 0 to 127. For example, the
22lowercase letter 'a' is assigned 97 as its code value.
Georg Brandl116aa622007-08-15 14:28:22 +000023
24ASCII was an American-developed standard, so it only defined unaccented
25characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
26which required accented characters couldn't be faithfully represented in ASCII.
27(Actually the missing accents matter for English, too, which contains words such
28as 'naïve' and 'café', and some publications have house styles which require
29spellings such as 'coöperate'.)
30
Andrew Kuchling2151fc62013-06-20 09:29:09 -040031For a while people just wrote programs that didn't display accents.
32In the mid-1980s an Apple II BASIC program written by a French speaker
Serhiy Storchaka46936d52018-04-08 19:18:04 +030033might have lines like these:
34
35.. code-block:: basic
Georg Brandl116aa622007-08-15 14:28:22 +000036
Benjamin Peterson643eb442014-12-24 13:58:05 -060037 PRINT "MISE A JOUR TERMINEE"
38 PRINT "PARAMETRES ENREGISTRES"
Georg Brandl116aa622007-08-15 14:28:22 +000039
Benjamin Petersona54f0752014-12-24 16:07:02 -060040Those messages should contain accents (terminée, paramètre, enregistrés) and
41they just look wrong to someone who can read French.
Georg Brandl116aa622007-08-15 14:28:22 +000042
43In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
44hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
45machines assigned values between 128 and 255 to accented characters. Different
46machines had different codes, however, which led to problems exchanging files.
Alexander Belopolsky93a6b132010-11-19 16:09:58 +000047Eventually various commonly used sets of values for the 128--255 range emerged.
Jesse Gonzalez6fde7702017-04-27 00:12:17 -050048Some were true standards, defined by the International Organization for
49Standardization, and some were *de facto* conventions that were invented by one
50company or another and managed to catch on.
Georg Brandl116aa622007-08-15 14:28:22 +000051
52255 characters aren't very many. For example, you can't fit both the accented
53characters used in Western Europe and the Cyrillic alphabet used for Russian
Benjamin Peterson9e599672014-04-22 21:54:10 -040054into the 128--255 range because there are more than 128 such characters.
Georg Brandl116aa622007-08-15 14:28:22 +000055
56You could write files using different codes (all your Russian files in a coding
57system called KOI8, all your French files in a different coding system called
58Latin1), but what if you wanted to write a French document that quotes some
59Russian text? In the 1980s people began to want to solve this problem, and the
60Unicode standardization effort began.
61
62Unicode started out using 16-bit characters instead of 8-bit characters. 16
63bits means you have 2^16 = 65,536 distinct values available, making it possible
64to represent many different characters from many different alphabets; an initial
65goal was to have Unicode contain the alphabets for every single human language.
66It turns out that even 16 bits isn't enough to meet that goal, and the modern
Ezio Melotti410eee52013-01-20 12:16:03 +020067Unicode specification uses a wider range of codes, 0 through 1,114,111 (
68``0x10FFFF`` in base 16).
Georg Brandl116aa622007-08-15 14:28:22 +000069
70There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
71originally separate efforts, but the specifications were merged with the 1.1
72revision of Unicode.
73
Andrew Kuchling2151fc62013-06-20 09:29:09 -040074(This discussion of Unicode's history is highly simplified. The
75precise historical details aren't necessary for understanding how to
76use Unicode effectively, but if you're curious, consult the Unicode
77consortium site listed in the References or
Georg Brandl5d941342016-02-26 19:37:12 +010078the `Wikipedia entry for Unicode <https://en.wikipedia.org/wiki/Unicode#History>`_
Andrew Kuchling2151fc62013-06-20 09:29:09 -040079for more information.)
Georg Brandl116aa622007-08-15 14:28:22 +000080
81
82Definitions
83-----------
84
85A **character** is the smallest possible component of a text. 'A', 'B', 'C',
86etc., are all different characters. So are 'È' and 'Í'. Characters are
87abstractions, and vary depending on the language or context you're talking
88about. For example, the symbol for ohms (Ω) is usually drawn much like the
89capital letter omega (Ω) in the Greek alphabet (they may even be the same in
90some fonts), but these are two different characters that have different
91meanings.
92
93The Unicode standard describes how characters are represented by **code
94points**. A code point is an integer value, usually denoted in base 16. In the
Ezio Melotti410eee52013-01-20 12:16:03 +020095standard, a code point is written using the notation ``U+12CA`` to mean the
96character with value ``0x12ca`` (4,810 decimal). The Unicode standard contains
97a lot of tables listing characters and their corresponding code points:
98
99.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000100
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000101 0061 'a'; LATIN SMALL LETTER A
102 0062 'b'; LATIN SMALL LETTER B
103 0063 'c'; LATIN SMALL LETTER C
104 ...
105 007B '{'; LEFT CURLY BRACKET
Georg Brandl116aa622007-08-15 14:28:22 +0000106
107Strictly, these definitions imply that it's meaningless to say 'this is
Ezio Melotti410eee52013-01-20 12:16:03 +0200108character ``U+12CA``'. ``U+12CA`` is a code point, which represents some particular
Georg Brandl116aa622007-08-15 14:28:22 +0000109character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
110informal contexts, this distinction between code points and characters will
111sometimes be forgotten.
112
113A character is represented on a screen or on paper by a set of graphical
114elements that's called a **glyph**. The glyph for an uppercase A, for example,
115is two diagonal strokes and a horizontal stroke, though the exact details will
116depend on the font being used. Most Python code doesn't need to worry about
117glyphs; figuring out the correct glyph to display is generally the job of a GUI
118toolkit or a terminal's font renderer.
119
120
121Encodings
122---------
123
124To summarize the previous section: a Unicode string is a sequence of code
Ezio Melotti410eee52013-01-20 12:16:03 +0200125points, which are numbers from 0 through ``0x10FFFF`` (1,114,111 decimal). This
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000126sequence needs to be represented as a set of bytes (meaning, values
127from 0 through 255) in memory. The rules for translating a Unicode string
128into a sequence of bytes are called an **encoding**.
Georg Brandl116aa622007-08-15 14:28:22 +0000129
130The first encoding you might think of is an array of 32-bit integers. In this
Ezio Melotti410eee52013-01-20 12:16:03 +0200131representation, the string "Python" would look like this:
132
133.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000134
135 P y t h o n
Georg Brandl6911e3c2007-09-04 07:15:32 +0000136 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
137 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl116aa622007-08-15 14:28:22 +0000138
139This representation is straightforward but using it presents a number of
140problems.
141
1421. It's not portable; different processors order the bytes differently.
143
1442. It's very wasteful of space. In most texts, the majority of the code points
Ezio Melotti410eee52013-01-20 12:16:03 +0200145 are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
Georg Brandl116aa622007-08-15 14:28:22 +0000146 bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
147 ASCII representation. Increased RAM usage doesn't matter too much (desktop
Ezio Melotti410eee52013-01-20 12:16:03 +0200148 computers have gigabytes of RAM, and strings aren't usually that large), but
Georg Brandl116aa622007-08-15 14:28:22 +0000149 expanding our usage of disk and network bandwidth by a factor of 4 is
150 intolerable.
151
1523. It's not compatible with existing C functions such as ``strlen()``, so a new
153 family of wide string functions would need to be used.
154
1554. Many Internet standards are defined in terms of textual data, and can't
156 handle content with embedded zero bytes.
157
Benjamin Petersond7c3ed52010-06-27 22:32:30 +0000158Generally people don't use this encoding, instead choosing other
159encodings that are more efficient and convenient. UTF-8 is probably
160the most commonly supported encoding; it will be discussed below.
Georg Brandl116aa622007-08-15 14:28:22 +0000161
162Encodings don't have to handle every possible Unicode character, and most
Benjamin Peterson1f316972009-09-11 20:42:29 +0000163encodings don't. The rules for converting a Unicode string into the ASCII
164encoding, for example, are simple; for each code point:
Georg Brandl116aa622007-08-15 14:28:22 +0000165
1661. If the code point is < 128, each byte is the same as the value of the code
167 point.
168
1692. If the code point is 128 or greater, the Unicode string can't be represented
170 in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
171 case.)
172
173Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00001740--255 are identical to the Latin-1 values, so converting to this encoding simply
Georg Brandl116aa622007-08-15 14:28:22 +0000175requires converting code points to byte values; if a code point larger than 255
176is encountered, the string can't be encoded into Latin-1.
177
178Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
179IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
180block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
181through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
182some sort of lookup table to perform the conversion, but this is largely an
183internal detail.
184
185UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
186Transformation Format", and the '8' means that 8-bit numbers are used in the
Ezio Melotti410eee52013-01-20 12:16:03 +0200187encoding. (There are also a UTF-16 and UTF-32 encodings, but they are less
188frequently used than UTF-8.) UTF-8 uses the following rules:
Georg Brandl116aa622007-08-15 14:28:22 +0000189
Ezio Melotti410eee52013-01-20 12:16:03 +02001901. If the code point is < 128, it's represented by the corresponding byte value.
1912. If the code point is >= 128, it's turned into a sequence of two, three, or
192 four bytes, where each byte of the sequence is between 128 and 255.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000193
Georg Brandl116aa622007-08-15 14:28:22 +0000194UTF-8 has several convenient properties:
195
1961. It can handle any Unicode code point.
R David Murray48de2822016-08-23 20:43:56 -04001972. A Unicode string is turned into a sequence of bytes containing no embedded zero
Georg Brandl116aa622007-08-15 14:28:22 +0000198 bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
199 processed by C functions such as ``strcpy()`` and sent through protocols that
200 can't handle zero bytes.
2013. A string of ASCII text is also valid UTF-8 text.
Ezio Melotti410eee52013-01-20 12:16:03 +02002024. UTF-8 is fairly compact; the majority of commonly used characters can be
203 represented with one or two bytes.
Georg Brandl116aa622007-08-15 14:28:22 +00002045. If bytes are corrupted or lost, it's possible to determine the start of the
205 next UTF-8-encoded code point and resynchronize. It's also unlikely that
206 random 8-bit data will look like valid UTF-8.
207
208
209
210References
211----------
212
Ezio Melotti410eee52013-01-20 12:16:03 +0200213The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
Georg Brandl116aa622007-08-15 14:28:22 +0000214glossary, and PDF versions of the Unicode specification. Be prepared for some
Ezio Melotti410eee52013-01-20 12:16:03 +0200215difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
216origin and development of Unicode is also available on the site.
Georg Brandl116aa622007-08-15 14:28:22 +0000217
Ezio Melotti410eee52013-01-20 12:16:03 +0200218To help understand the standard, Jukka Korpela has written `an introductory
Sanyam Khurana338cd832018-01-20 05:55:37 +0530219guide <http://jkorpela.fi/unicode/guide.html>`_ to reading the
Ezio Melotti410eee52013-01-20 12:16:03 +0200220Unicode character tables.
Georg Brandl116aa622007-08-15 14:28:22 +0000221
Sanyam Khurana1b4587a2017-12-06 22:09:33 +0530222Another `good introductory article <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
Ezio Melotti410eee52013-01-20 12:16:03 +0200223was written by Joel Spolsky.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400224If this introduction didn't make things clear to you, you should try
225reading this alternate article before continuing.
Georg Brandl116aa622007-08-15 14:28:22 +0000226
Ezio Melotti410eee52013-01-20 12:16:03 +0200227Wikipedia entries are often helpful; see the entries for "`character encoding
Georg Brandl5d941342016-02-26 19:37:12 +0100228<https://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
229<https://en.wikipedia.org/wiki/UTF-8>`_, for example.
Georg Brandl116aa622007-08-15 14:28:22 +0000230
231
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000232Python's Unicode Support
233========================
Georg Brandl116aa622007-08-15 14:28:22 +0000234
235Now that you've learned the rudiments of Unicode, we can look at Python's
236Unicode features.
237
Georg Brandlf6945182008-02-01 11:56:49 +0000238The String Type
239---------------
Georg Brandl116aa622007-08-15 14:28:22 +0000240
Ezio Melotti410eee52013-01-20 12:16:03 +0200241Since Python 3.0, the language features a :class:`str` type that contain Unicode
Georg Brandlf6945182008-02-01 11:56:49 +0000242characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
Georg Brandl4f5f98d2009-05-04 21:01:20 +0000243rocks!'``, or the triple-quoted string syntax is stored as Unicode.
Georg Brandl116aa622007-08-15 14:28:22 +0000244
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400245The default encoding for Python source code is UTF-8, so you can simply
246include a Unicode character in a string literal::
247
248 try:
249 with open('/tmp/input.txt', 'r') as f:
250 ...
Andrew Svetlov08af0002014-04-01 01:13:30 +0300251 except OSError:
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400252 # 'File not found' error message.
253 print("Fichier non trouvé")
254
255You can use a different encoding from UTF-8 by putting a specially-formatted
256comment as the first or second line of the source code::
257
258 # -*- coding: <encoding name> -*-
259
260Side note: Python 3 also supports using Unicode characters in identifiers::
261
262 répertoire = "/tmp/records.log"
263 with open(répertoire, "w") as f:
264 f.write("test\n")
265
266If you can't enter a particular character in your editor or want to
267keep the source code ASCII-only for some reason, you can also use
268escape sequences in string literals. (Depending on your system,
269you may see the actual capital-delta glyph instead of a \u escape.) ::
Georg Brandl116aa622007-08-15 14:28:22 +0000270
Georg Brandlf6945182008-02-01 11:56:49 +0000271 >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
272 '\u0394'
273 >>> "\u0394" # Using a 16-bit hex value
274 '\u0394'
275 >>> "\U00000394" # Using a 32-bit hex value
276 '\u0394'
Georg Brandl116aa622007-08-15 14:28:22 +0000277
Ezio Melotti410eee52013-01-20 12:16:03 +0200278In addition, one can create a string using the :func:`~bytes.decode` method of
279:class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``,
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400280and optionally an *errors* argument.
Georg Brandl116aa622007-08-15 14:28:22 +0000281
Georg Brandlf6945182008-02-01 11:56:49 +0000282The *errors* argument specifies the response when the input string can't be
Georg Brandl116aa622007-08-15 14:28:22 +0000283converted according to the encoding's rules. Legal values for this argument are
Ezio Melotti410eee52013-01-20 12:16:03 +0200284``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200285``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
286character out of the Unicode result), or ``'backslashreplace'`` (inserts a
287``\xNN`` escape sequence).
Ezio Melotti410eee52013-01-20 12:16:03 +0200288The following examples show the differences::
Georg Brandl116aa622007-08-15 14:28:22 +0000289
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700290 >>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000291 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700292 ...
293 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
294 invalid start byte
Ezio Melotti20b8d992012-09-23 15:55:14 +0300295 >>> b'\x80abc'.decode("utf-8", "replace")
296 '\ufffdabc'
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200297 >>> b'\x80abc'.decode("utf-8", "backslashreplace")
298 '\\x80abc'
Georg Brandlf6945182008-02-01 11:56:49 +0000299 >>> b'\x80abc'.decode("utf-8", "ignore")
300 'abc'
Georg Brandl116aa622007-08-15 14:28:22 +0000301
Benjamin Petersond7c3ed52010-06-27 22:32:30 +0000302Encodings are specified as strings containing the encoding's name. Python 3.2
303comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl0c074222008-11-22 10:26:59 +0000304:ref:`standard-encodings` for a list. Some encodings have multiple names; for
Ezio Melotti410eee52013-01-20 12:16:03 +0200305example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
306the same encoding.
Georg Brandl116aa622007-08-15 14:28:22 +0000307
Georg Brandlf6945182008-02-01 11:56:49 +0000308One-character Unicode strings can also be created with the :func:`chr`
Georg Brandl116aa622007-08-15 14:28:22 +0000309built-in function, which takes integers and returns a Unicode string of length 1
310that contains the corresponding code point. The reverse operation is the
311built-in :func:`ord` function that takes a one-character Unicode string and
312returns the code point value::
313
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000314 >>> chr(57344)
315 '\ue000'
316 >>> ord('\ue000')
317 57344
Georg Brandl116aa622007-08-15 14:28:22 +0000318
Georg Brandlf6945182008-02-01 11:56:49 +0000319Converting to Bytes
320-------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000321
Ezio Melotti410eee52013-01-20 12:16:03 +0200322The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
323which returns a :class:`bytes` representation of the Unicode string, encoded in the
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400324requested *encoding*.
325
326The *errors* parameter is the same as the parameter of the
327:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
328``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
329inserts a question mark instead of the unencodable character), there is
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200330also ``'xmlcharrefreplace'`` (inserts an XML character reference),
331``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and
332``namereplace`` (inserts a ``\N{...}`` escape sequence).
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400333
Ezio Melotti410eee52013-01-20 12:16:03 +0200334The following example shows the different results::
Georg Brandl116aa622007-08-15 14:28:22 +0000335
Georg Brandlf6945182008-02-01 11:56:49 +0000336 >>> u = chr(40960) + 'abcd' + chr(1972)
Georg Brandl116aa622007-08-15 14:28:22 +0000337 >>> u.encode('utf-8')
Georg Brandlf6945182008-02-01 11:56:49 +0000338 b'\xea\x80\x80abcd\xde\xb4'
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700339 >>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000340 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700341 ...
Georg Brandl0c074222008-11-22 10:26:59 +0000342 UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700343 position 0: ordinal not in range(128)
Georg Brandl116aa622007-08-15 14:28:22 +0000344 >>> u.encode('ascii', 'ignore')
Georg Brandlf6945182008-02-01 11:56:49 +0000345 b'abcd'
Georg Brandl116aa622007-08-15 14:28:22 +0000346 >>> u.encode('ascii', 'replace')
Georg Brandlf6945182008-02-01 11:56:49 +0000347 b'?abcd?'
Georg Brandl116aa622007-08-15 14:28:22 +0000348 >>> u.encode('ascii', 'xmlcharrefreplace')
Georg Brandlf6945182008-02-01 11:56:49 +0000349 b'&#40960;abcd&#1972;'
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400350 >>> u.encode('ascii', 'backslashreplace')
351 b'\\ua000abcd\\u07b4'
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200352 >>> u.encode('ascii', 'namereplace')
353 b'\\N{YI SYLLABLE IT}abcd\\u07b4'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000354
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400355The low-level routines for registering and accessing the available
356encodings are found in the :mod:`codecs` module. Implementing new
357encodings also requires understanding the :mod:`codecs` module.
358However, the encoding and decoding functions returned by this module
359are usually more low-level than is comfortable, and writing new encodings
360is a specialized task, so the module won't be covered in this HOWTO.
Georg Brandl116aa622007-08-15 14:28:22 +0000361
Georg Brandl6911e3c2007-09-04 07:15:32 +0000362
Georg Brandl116aa622007-08-15 14:28:22 +0000363Unicode Literals in Python Source Code
364--------------------------------------
365
Georg Brandlf6945182008-02-01 11:56:49 +0000366In Python source code, specific Unicode code points can be written using the
367``\u`` escape sequence, which is followed by four hex digits giving the code
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000368point. The ``\U`` escape sequence is similar, but expects eight hex digits,
369not four::
Georg Brandl116aa622007-08-15 14:28:22 +0000370
Georg Brandlf6945182008-02-01 11:56:49 +0000371 >>> s = "a\xac\u1234\u20ac\U00008000"
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700372 ... # ^^^^ two-digit hex escape
373 ... # ^^^^^^ four-digit Unicode escape
374 ... # ^^^^^^^^^^ eight-digit Unicode escape
375 >>> [ord(c) for c in s]
376 [97, 172, 4660, 8364, 32768]
Georg Brandl116aa622007-08-15 14:28:22 +0000377
378Using escape sequences for code points greater than 127 is fine in small doses,
379but becomes an annoyance if you're using many accented characters, as you would
380in a program with messages in French or some other accent-using language. You
Georg Brandlf6945182008-02-01 11:56:49 +0000381can also assemble strings using the :func:`chr` built-in function, but this is
Georg Brandl116aa622007-08-15 14:28:22 +0000382even more tedious.
383
384Ideally, you'd want to be able to write literals in your language's natural
385encoding. You could then edit Python source code with your favorite editor
386which would display the accented characters naturally, and have the right
387characters used at runtime.
388
Georg Brandl0c074222008-11-22 10:26:59 +0000389Python supports writing source code in UTF-8 by default, but you can use almost
390any encoding if you declare the encoding being used. This is done by including
391a special comment as either the first or second line of the source file::
Georg Brandl116aa622007-08-15 14:28:22 +0000392
393 #!/usr/bin/env python
394 # -*- coding: latin-1 -*-
Georg Brandl6911e3c2007-09-04 07:15:32 +0000395
Georg Brandlf6945182008-02-01 11:56:49 +0000396 u = 'abcdé'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000397 print(ord(u[-1]))
398
Georg Brandl116aa622007-08-15 14:28:22 +0000399The syntax is inspired by Emacs's notation for specifying variables local to a
400file. Emacs supports many different variables, but Python only supports
Georg Brandl0c074222008-11-22 10:26:59 +0000401'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
402they have no significance to Python but are a convention. Python looks for
403``coding: name`` or ``coding=name`` in the comment.
Georg Brandl116aa622007-08-15 14:28:22 +0000404
Georg Brandlf6945182008-02-01 11:56:49 +0000405If you don't include such a comment, the default encoding used will be UTF-8 as
Ezio Melotti410eee52013-01-20 12:16:03 +0200406already mentioned. See also :pep:`263` for more information.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000407
Georg Brandl116aa622007-08-15 14:28:22 +0000408
409Unicode Properties
410------------------
411
412The Unicode specification includes a database of information about code points.
Ezio Melotti410eee52013-01-20 12:16:03 +0200413For each defined code point, the information includes the character's
Georg Brandl116aa622007-08-15 14:28:22 +0000414name, its category, the numeric value if applicable (Unicode has characters
415representing the Roman numerals and fractions such as one-third and
416four-fifths). There are also properties related to the code point's use in
417bidirectional text and other display-related properties.
418
419The following program displays some information about several characters, and
420prints the numeric value of one particular character::
421
422 import unicodedata
Georg Brandl6911e3c2007-09-04 07:15:32 +0000423
Georg Brandlf6945182008-02-01 11:56:49 +0000424 u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
Georg Brandl6911e3c2007-09-04 07:15:32 +0000425
Georg Brandl116aa622007-08-15 14:28:22 +0000426 for i, c in enumerate(u):
Georg Brandl6911e3c2007-09-04 07:15:32 +0000427 print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
428 print(unicodedata.name(c))
429
Georg Brandl116aa622007-08-15 14:28:22 +0000430 # Get numeric value of second character
Georg Brandl6911e3c2007-09-04 07:15:32 +0000431 print(unicodedata.numeric(u[1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000432
Ezio Melotti410eee52013-01-20 12:16:03 +0200433When run, this prints:
434
435.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000436
437 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
438 1 0bf2 No TAMIL NUMBER ONE THOUSAND
439 2 0f84 Mn TIBETAN MARK HALANTA
440 3 1770 Lo TAGBANWA LETTER SA
441 4 33af So SQUARE RAD OVER S SQUARED
442 1000.0
443
444The category codes are abbreviations describing the nature of the character.
445These are grouped into categories such as "Letter", "Number", "Punctuation", or
446"Symbol", which in turn are broken up into subcategories. To take the codes
447from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
448"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
449other". See
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400450`the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
Georg Brandl116aa622007-08-15 14:28:22 +0000451list of category codes.
452
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400453
454Unicode Regular Expressions
455---------------------------
456
457The regular expressions supported by the :mod:`re` module can be provided
458either as bytes or strings. Some of the special character sequences such as
459``\d`` and ``\w`` have different meanings depending on whether
460the pattern is supplied as bytes or a string. For example,
461``\d`` will match the characters ``[0-9]`` in bytes but
462in strings will match any character that's in the ``'Nd'`` category.
463
464The string in this example has the number 57 written in both Thai and
465Arabic numerals::
466
467 import re
Cheryl Sabella66771422018-02-02 16:16:27 -0500468 p = re.compile(r'\d+')
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400469
470 s = "Over \u0e55\u0e57 57 flavours"
471 m = p.search(s)
472 print(repr(m.group()))
473
474When executed, ``\d+`` will match the Thai numerals and print them
475out. If you supply the :const:`re.ASCII` flag to
476:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
477
478Similarly, ``\w`` matches a wide variety of Unicode characters but
479only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
480and ``\s`` will match either Unicode whitespace characters or
481``[ \t\n\r\f\v]``.
482
483
Georg Brandl116aa622007-08-15 14:28:22 +0000484References
485----------
486
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400487.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
488
489Some good alternative discussions of Python's Unicode support are:
490
491* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
Sanyam Khurana1b4587a2017-12-06 22:09:33 +0530492* `Pragmatic Unicode <https://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400493
Ezio Melotti410eee52013-01-20 12:16:03 +0200494The :class:`str` type is described in the Python library reference at
Ezio Melottia6229e62012-10-12 10:59:14 +0300495:ref:`textseq`.
Georg Brandl116aa622007-08-15 14:28:22 +0000496
497The documentation for the :mod:`unicodedata` module.
498
499The documentation for the :mod:`codecs` module.
500
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100501Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides)
502<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
503EuroPython 2002. The slides are an excellent overview of the design of Python
5042's Unicode features (where the Unicode string type is called ``unicode`` and
505literals start with ``u``).
Georg Brandl116aa622007-08-15 14:28:22 +0000506
507
508Reading and Writing Unicode Data
509================================
510
511Once you've written some code that works with Unicode data, the next problem is
512input/output. How do you get Unicode strings into your program, and how do you
513convert Unicode into a form suitable for storage or transmission?
514
515It's possible that you may not need to do anything depending on your input
516sources and output destinations; you should check whether the libraries used in
517your application support Unicode natively. XML parsers often return Unicode
518data, for example. Many relational databases also support Unicode-valued
519columns and can return Unicode values from an SQL query.
520
521Unicode data is usually converted to a particular encoding before it gets
522written to disk or sent over a socket. It's possible to do all the work
Georg Brandl3d596fa2013-10-29 08:16:56 +0100523yourself: open a file, read an 8-bit bytes object from it, and convert the bytes
Ezio Melotti410eee52013-01-20 12:16:03 +0200524with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
Georg Brandl116aa622007-08-15 14:28:22 +0000525
526One problem is the multi-byte nature of encodings; one Unicode character can be
527represented by several bytes. If you want to read the file in arbitrary-sized
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200528chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case
Georg Brandl116aa622007-08-15 14:28:22 +0000529where only part of the bytes encoding a single Unicode character are read at the
530end of a chunk. One solution would be to read the entire file into memory and
531then perform the decoding, but that prevents you from working with files that
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200532are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.
Georg Brandl116aa622007-08-15 14:28:22 +0000533(More, really, since for at least a moment you'd need to have both the encoded
534string and its Unicode version in memory.)
535
536The solution would be to use the low-level decoding interface to catch the case
537of partial coding sequences. The work of implementing this has already been
Georg Brandl0c074222008-11-22 10:26:59 +0000538done for you: the built-in :func:`open` function can return a file-like object
539that assumes the file's contents are in a specified encoding and accepts Unicode
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300540parameters for methods such as :meth:`~io.TextIOBase.read` and
Georg Brandl325a1c22013-10-27 09:16:01 +0100541:meth:`~io.TextIOBase.write`. This works through :func:`open`\'s *encoding* and
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300542*errors* parameters which are interpreted just like those in :meth:`str.encode`
543and :meth:`bytes.decode`.
Georg Brandl116aa622007-08-15 14:28:22 +0000544
545Reading Unicode from a file is therefore simple::
546
Georg Brandle47e1842013-10-06 13:07:10 +0200547 with open('unicode.txt', encoding='utf-8') as f:
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000548 for line in f:
549 print(repr(line))
Georg Brandl116aa622007-08-15 14:28:22 +0000550
551It's also possible to open files in update mode, allowing both reading and
552writing::
553
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000554 with open('test', encoding='utf-8', mode='w+') as f:
555 f.write('\u4500 blah blah blah\n')
556 f.seek(0)
557 print(repr(f.readline()[:1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000558
Ezio Melotti410eee52013-01-20 12:16:03 +0200559The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
Georg Brandl116aa622007-08-15 14:28:22 +0000560written as the first character of a file in order to assist with autodetection
561of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
562present at the start of a file; when such an encoding is used, the BOM will be
563automatically written as the first character and will be silently dropped when
564the file is read. There are variants of these encodings, such as 'utf-16-le'
565and 'utf-16-be' for little-endian and big-endian encodings, that specify one
566particular byte ordering and don't skip the BOM.
567
Georg Brandl0c074222008-11-22 10:26:59 +0000568In some areas, it is also convention to use a "BOM" at the start of UTF-8
569encoded files; the name is misleading since UTF-8 is not byte-order dependent.
570The mark simply announces that the file is encoded in UTF-8. Use the
571'utf-8-sig' codec to automatically skip the mark if present for reading such
572files.
573
Georg Brandl116aa622007-08-15 14:28:22 +0000574
575Unicode filenames
576-----------------
577
578Most of the operating systems in common use today support filenames that contain
579arbitrary Unicode characters. Usually this is implemented by converting the
580Unicode string into some encoding that varies depending on the system. For
Georg Brandlc575c902008-09-13 17:46:05 +0000581example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
Georg Brandl116aa622007-08-15 14:28:22 +0000582Windows, Python uses the name "mbcs" to refer to whatever the currently
583configured encoding is. On Unix systems, there will only be a filesystem
584encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400585you haven't, the default encoding is UTF-8.
Georg Brandl116aa622007-08-15 14:28:22 +0000586
587The :func:`sys.getfilesystemencoding` function returns the encoding to use on
588your current system, in case you want to do the encoding manually, but there's
589not much reason to bother. When opening a file for reading or writing, you can
590usually just provide the Unicode string as the filename, and it will be
591automatically converted to the right encoding for you::
592
Georg Brandlf6945182008-02-01 11:56:49 +0000593 filename = 'filename\u4500abc'
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000594 with open(filename, 'w') as f:
595 f.write('blah\n')
Georg Brandl116aa622007-08-15 14:28:22 +0000596
597Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
598filenames.
599
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400600The :func:`os.listdir` function returns filenames and raises an issue: should it return
Ezio Melotti410eee52013-01-20 12:16:03 +0200601the Unicode version of filenames, or should it return bytes containing
Georg Brandl116aa622007-08-15 14:28:22 +0000602the encoded versions? :func:`os.listdir` will do both, depending on whether you
Ezio Melotti410eee52013-01-20 12:16:03 +0200603provided the directory path as bytes or a Unicode string. If you pass a
Georg Brandl0c074222008-11-22 10:26:59 +0000604Unicode string as the path, filenames will be decoded using the filesystem's
605encoding and a list of Unicode strings will be returned, while passing a byte
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400606path will return the filenames as bytes. For example,
Georg Brandl0c074222008-11-22 10:26:59 +0000607assuming the default filesystem encoding is UTF-8, running the following
608program::
Georg Brandl116aa622007-08-15 14:28:22 +0000609
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000610 fn = 'filename\u4500abc'
611 f = open(fn, 'w')
612 f.close()
Georg Brandl116aa622007-08-15 14:28:22 +0000613
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000614 import os
615 print(os.listdir(b'.'))
616 print(os.listdir('.'))
Georg Brandl116aa622007-08-15 14:28:22 +0000617
Martin Panter1050d2d2016-07-26 11:18:21 +0200618will produce the following output:
619
620.. code-block:: shell-session
Georg Brandl116aa622007-08-15 14:28:22 +0000621
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000622 amk:~$ python t.py
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400623 [b'filename\xe4\x94\x80abc', ...]
624 ['filename\u4500abc', ...]
Georg Brandl116aa622007-08-15 14:28:22 +0000625
626The first list contains UTF-8-encoded filenames, and the second list contains
627the Unicode versions.
628
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400629Note that on most occasions, the Unicode APIs should be used. The bytes APIs
Georg Brandl0c074222008-11-22 10:26:59 +0000630should only be used on systems where undecodable file names can be present,
631i.e. Unix systems.
632
Georg Brandl116aa622007-08-15 14:28:22 +0000633
Georg Brandl116aa622007-08-15 14:28:22 +0000634Tips for Writing Unicode-aware Programs
635---------------------------------------
636
637This section provides some suggestions on writing software that deals with
638Unicode.
639
640The most important tip is:
641
Ezio Melotti410eee52013-01-20 12:16:03 +0200642 Software should only work with Unicode strings internally, decoding the input
643 data as soon as possible and encoding the output only at the end.
Georg Brandl116aa622007-08-15 14:28:22 +0000644
Georg Brandl0c074222008-11-22 10:26:59 +0000645If you attempt to write processing functions that accept both Unicode and byte
Georg Brandl116aa622007-08-15 14:28:22 +0000646strings, you will find your program vulnerable to bugs wherever you combine the
Ezio Melotti410eee52013-01-20 12:16:03 +0200647two different kinds of strings. There is no automatic encoding or decoding: if
648you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
Georg Brandl116aa622007-08-15 14:28:22 +0000649
Georg Brandl116aa622007-08-15 14:28:22 +0000650When using data coming from a web browser or some other untrusted source, a
651common technique is to check for illegal characters in a string before using the
652string in a generated command line or storing it in a database. If you're doing
Antoine Pitrou534e2532011-12-05 01:21:46 +0100653this, be careful to check the decoded string, not the encoded bytes data;
654some encodings may have interesting properties, such as not being bijective
655or not being fully ASCII-compatible. This is especially true if the input
656data also specifies the encoding, since the attacker can then choose a
657clever way to hide malicious text in the encoded bytestream.
Georg Brandl116aa622007-08-15 14:28:22 +0000658
Georg Brandl116aa622007-08-15 14:28:22 +0000659
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400660Converting Between File Encodings
661'''''''''''''''''''''''''''''''''
662
663The :class:`~codecs.StreamRecoder` class can transparently convert between
664encodings, taking a stream that returns data in encoding #1
665and behaving like a stream returning data in encoding #2.
666
667For example, if you have an input file *f* that's in Latin-1, you
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300668can wrap it with a :class:`~codecs.StreamRecoder` to return bytes encoded in
669UTF-8::
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400670
671 new_f = codecs.StreamRecoder(f,
672 # en/decoder: used by read() to encode its results and
673 # by write() to decode its input.
674 codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
675
676 # reader/writer: used to read and write to the stream.
677 codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
678
679
680Files in an Unknown Encoding
681''''''''''''''''''''''''''''
682
683What can you do if you need to make a change to a file, but don't know
684the file's encoding? If you know the encoding is ASCII-compatible and
685only want to examine or modify the ASCII parts, you can open the file
686with the ``surrogateescape`` error handler::
687
688 with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
689 data = f.read()
690
691 # make changes to the string 'data'
692
693 with open(fname + '.new', 'w',
Serhiy Storchakadba90392016-05-10 12:01:23 +0300694 encoding="ascii", errors="surrogateescape") as f:
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400695 f.write(data)
696
697The ``surrogateescape`` error handler will decode any non-ASCII bytes
698as code points in the Unicode Private Use Area ranging from U+DC80 to
699U+DCFF. These private code points will then be turned back into the
700same bytes when the ``surrogateescape`` error handler is used when
701encoding the data and writing it back out.
702
703
Georg Brandl116aa622007-08-15 14:28:22 +0000704References
705----------
706
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100707One section of `Mastering Python 3 Input/Output
708<http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_,
709a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400710
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100711The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
712Applications in Python"
713<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400714discuss questions of character encodings as well as how to internationalize
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000715and localize an application. These slides cover Python 2.x only.
Georg Brandl116aa622007-08-15 14:28:22 +0000716
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100717`The Guts of Unicode in Python
718<http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_
719is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
720representation in Python 3.3.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400721
Georg Brandl116aa622007-08-15 14:28:22 +0000722
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000723Acknowledgements
724================
Georg Brandl116aa622007-08-15 14:28:22 +0000725
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400726The initial draft of this document was written by Andrew Kuchling.
727It has since been revised further by Alexander Belopolsky, Georg Brandl,
728Andrew Kuchling, and Ezio Melotti.
Georg Brandl116aa622007-08-15 14:28:22 +0000729
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400730Thanks to the following people who have noted errors or offered
731suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
732Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
733Lemburg, Martin von Löwis, Terry J. Reedy, Chad Whitacre.