blob: 535b21bd4a54f56ee739d5d3f3c453bbf48f2d09 [file] [log] [blame]
Guido van Rossum715287f2008-12-02 22:34:15 +00001.. _unicode-howto:
2
Georg Brandl116aa622007-08-15 14:28:22 +00003*****************
4 Unicode HOWTO
5*****************
6
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00007:Release: 1.12
Georg Brandl116aa622007-08-15 14:28:22 +00008
Andrew Kuchling97c288d2019-03-03 23:10:28 -05009This HOWTO discusses Python's support for the Unicode specification
10for representing textual data, and explains various problems that
11people commonly encounter when trying to work with Unicode.
12
Georg Brandl6911e3c2007-09-04 07:15:32 +000013
Georg Brandl116aa622007-08-15 14:28:22 +000014Introduction to Unicode
15=======================
16
Georg Brandl116aa622007-08-15 14:28:22 +000017Definitions
18-----------
19
Andrew Kuchling97c288d2019-03-03 23:10:28 -050020Today's programs need to be able to handle a wide variety of
21characters. Applications are often internationalized to display
22messages and output in a variety of user-selectable languages; the
23same program might need to output an error message in English, French,
24Japanese, Hebrew, or Russian. Web content can be written in any of
25these languages and can also include a variety of emoji symbols.
26Python's string type uses the Unicode Standard for representing
27characters, which lets Python programs work with all these different
28possible characters.
Georg Brandl116aa622007-08-15 14:28:22 +000029
Andrew Kuchling97c288d2019-03-03 23:10:28 -050030Unicode (https://www.unicode.org/) is a specification that aims to
31list every character used by human languages and give each character
32its own unique code. The Unicode specifications are continually
33revised and updated to add new languages and symbols.
34
35A **character** is the smallest possible component of a text. 'A', 'B', 'C',
36etc., are all different characters. So are 'È' and 'Í'. Characters vary
37depending on the language or context you're talking
38about. For example, there's a character for "Roman Numeral One", '', that's
39separate from the uppercase letter 'I'. They'll usually look the same,
40but these are two different characters that have different meanings.
41
42The Unicode standard describes how characters are represented by
43**code points**. A code point value is an integer in the range 0 to
amaajemyfren8ea10a92020-04-07 07:16:02 +0300440x10FFFF (about 1.1 million values, the
45`actual number assigned <https://www.unicode.org/versions/latest/#Summary>`_
46is less than that). In the standard and in this document, a code point is written
Andrew Kuchling97c288d2019-03-03 23:10:28 -050047using the notation ``U+265E`` to mean the character with value
48``0x265e`` (9,822 in decimal).
49
50The Unicode standard contains a lot of tables listing characters and
51their corresponding code points:
Ezio Melotti410eee52013-01-20 12:16:03 +020052
53.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +000054
Georg Brandla1c6a1c2009-01-03 21:26:05 +000055 0061 'a'; LATIN SMALL LETTER A
56 0062 'b'; LATIN SMALL LETTER B
57 0063 'c'; LATIN SMALL LETTER C
58 ...
59 007B '{'; LEFT CURLY BRACKET
Andrew Kuchling97c288d2019-03-03 23:10:28 -050060 ...
Greg Price32a960f2019-09-08 02:42:13 -070061 2167 ''; ROMAN NUMERAL EIGHT
62 2168 ''; ROMAN NUMERAL NINE
Andrew Kuchling97c288d2019-03-03 23:10:28 -050063 ...
Greg Price32a960f2019-09-08 02:42:13 -070064 265E ''; BLACK CHESS KNIGHT
65 265F ''; BLACK CHESS PAWN
Andrew Kuchling97c288d2019-03-03 23:10:28 -050066 ...
Greg Price32a960f2019-09-08 02:42:13 -070067 1F600 '😀'; GRINNING FACE
68 1F609 '😉'; WINKING FACE
Andrew Kuchling97c288d2019-03-03 23:10:28 -050069 ...
Georg Brandl116aa622007-08-15 14:28:22 +000070
71Strictly, these definitions imply that it's meaningless to say 'this is
Andrew Kuchling97c288d2019-03-03 23:10:28 -050072character ``U+265E``'. ``U+265E`` is a code point, which represents some particular
73character; in this case, it represents the character 'BLACK CHESS KNIGHT',
74'♞'. In
Georg Brandl116aa622007-08-15 14:28:22 +000075informal contexts, this distinction between code points and characters will
76sometimes be forgotten.
77
78A character is represented on a screen or on paper by a set of graphical
79elements that's called a **glyph**. The glyph for an uppercase A, for example,
80is two diagonal strokes and a horizontal stroke, though the exact details will
81depend on the font being used. Most Python code doesn't need to worry about
82glyphs; figuring out the correct glyph to display is generally the job of a GUI
83toolkit or a terminal's font renderer.
84
85
86Encodings
87---------
88
Andrew Kuchling97c288d2019-03-03 23:10:28 -050089To summarize the previous section: a Unicode string is a sequence of
90code points, which are numbers from 0 through ``0x10FFFF`` (1,114,111
91decimal). This sequence of code points needs to be represented in
92memory as a set of **code units**, and **code units** are then mapped
93to 8-bit bytes. The rules for translating a Unicode string into a
94sequence of bytes are called a **character encoding**, or just
95an **encoding**.
Georg Brandl116aa622007-08-15 14:28:22 +000096
Andrew Kuchling97c288d2019-03-03 23:10:28 -050097The first encoding you might think of is using 32-bit integers as the
98code unit, and then using the CPU's representation of 32-bit integers.
99In this representation, the string "Python" might look like this:
Ezio Melotti410eee52013-01-20 12:16:03 +0200100
101.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000102
103 P y t h o n
Georg Brandl6911e3c2007-09-04 07:15:32 +0000104 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
105 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl116aa622007-08-15 14:28:22 +0000106
107This representation is straightforward but using it presents a number of
108problems.
109
1101. It's not portable; different processors order the bytes differently.
111
1122. It's very wasteful of space. In most texts, the majority of the code points
Ezio Melotti410eee52013-01-20 12:16:03 +0200113 are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
Georg Brandl116aa622007-08-15 14:28:22 +0000114 bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
115 ASCII representation. Increased RAM usage doesn't matter too much (desktop
Ezio Melotti410eee52013-01-20 12:16:03 +0200116 computers have gigabytes of RAM, and strings aren't usually that large), but
Georg Brandl116aa622007-08-15 14:28:22 +0000117 expanding our usage of disk and network bandwidth by a factor of 4 is
118 intolerable.
119
1203. It's not compatible with existing C functions such as ``strlen()``, so a new
121 family of wide string functions would need to be used.
122
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500123Therefore this encoding isn't used very much, and people instead choose other
124encodings that are more efficient and convenient, such as UTF-8.
Georg Brandl116aa622007-08-15 14:28:22 +0000125
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500126UTF-8 is one of the most commonly used encodings, and Python often
127defaults to using it. UTF stands for "Unicode Transformation Format",
128and the '8' means that 8-bit values are used in the encoding. (There
129are also UTF-16 and UTF-32 encodings, but they are less frequently
130used than UTF-8.) UTF-8 uses the following rules:
Georg Brandl116aa622007-08-15 14:28:22 +0000131
Ezio Melotti410eee52013-01-20 12:16:03 +02001321. If the code point is < 128, it's represented by the corresponding byte value.
1332. If the code point is >= 128, it's turned into a sequence of two, three, or
134 four bytes, where each byte of the sequence is between 128 and 255.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000135
Georg Brandl116aa622007-08-15 14:28:22 +0000136UTF-8 has several convenient properties:
137
1381. It can handle any Unicode code point.
redshiftzerof98c3c52019-05-17 03:44:18 -07001392. A Unicode string is turned into a sequence of bytes that contains embedded
140 zero bytes only where they represent the null character (U+0000). This means
141 that UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent
142 through protocols that can't handle zero bytes for anything other than
143 end-of-string markers.
Georg Brandl116aa622007-08-15 14:28:22 +00001443. A string of ASCII text is also valid UTF-8 text.
Ezio Melotti410eee52013-01-20 12:16:03 +02001454. UTF-8 is fairly compact; the majority of commonly used characters can be
146 represented with one or two bytes.
Georg Brandl116aa622007-08-15 14:28:22 +00001475. If bytes are corrupted or lost, it's possible to determine the start of the
148 next UTF-8-encoded code point and resynchronize. It's also unlikely that
149 random 8-bit data will look like valid UTF-8.
redshiftzerof98c3c52019-05-17 03:44:18 -07001506. UTF-8 is a byte oriented encoding. The encoding specifies that each
151 character is represented by a specific sequence of one or more bytes. This
152 avoids the byte-ordering issues that can occur with integer and word oriented
153 encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending
154 on the hardware on which the string was encoded.
Georg Brandl116aa622007-08-15 14:28:22 +0000155
156
157References
158----------
159
Benjamin Peterson51796e52020-03-10 21:10:59 -0700160The `Unicode Consortium site <https://www.unicode.org>`_ has character charts, a
Georg Brandl116aa622007-08-15 14:28:22 +0000161glossary, and PDF versions of the Unicode specification. Be prepared for some
Benjamin Peterson51796e52020-03-10 21:10:59 -0700162difficult reading. `A chronology <https://www.unicode.org/history/>`_ of the
Ezio Melotti410eee52013-01-20 12:16:03 +0200163origin and development of Unicode is also available on the site.
Georg Brandl116aa622007-08-15 14:28:22 +0000164
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500165On the Computerphile Youtube channel, Tom Scott briefly
redshiftzero3b2f9ab2019-05-09 15:13:40 -0400166`discusses the history of Unicode and UTF-8 <https://www.youtube.com/watch?v=MijmeoH9LT4>`_
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500167(9 minutes 36 seconds).
168
Ezio Melotti410eee52013-01-20 12:16:03 +0200169To help understand the standard, Jukka Korpela has written `an introductory
Sanyam Khurana338cd832018-01-20 05:55:37 +0530170guide <http://jkorpela.fi/unicode/guide.html>`_ to reading the
Ezio Melotti410eee52013-01-20 12:16:03 +0200171Unicode character tables.
Georg Brandl116aa622007-08-15 14:28:22 +0000172
Sanyam Khurana1b4587a2017-12-06 22:09:33 +0530173Another `good introductory article <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
Ezio Melotti410eee52013-01-20 12:16:03 +0200174was written by Joel Spolsky.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400175If this introduction didn't make things clear to you, you should try
176reading this alternate article before continuing.
Georg Brandl116aa622007-08-15 14:28:22 +0000177
Ezio Melotti410eee52013-01-20 12:16:03 +0200178Wikipedia entries are often helpful; see the entries for "`character encoding
Georg Brandl5d941342016-02-26 19:37:12 +0100179<https://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
180<https://en.wikipedia.org/wiki/UTF-8>`_, for example.
Georg Brandl116aa622007-08-15 14:28:22 +0000181
182
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000183Python's Unicode Support
184========================
Georg Brandl116aa622007-08-15 14:28:22 +0000185
186Now that you've learned the rudiments of Unicode, we can look at Python's
187Unicode features.
188
Georg Brandlf6945182008-02-01 11:56:49 +0000189The String Type
190---------------
Georg Brandl116aa622007-08-15 14:28:22 +0000191
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500192Since Python 3.0, the language's :class:`str` type contains Unicode
Georg Brandlf6945182008-02-01 11:56:49 +0000193characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
Georg Brandl4f5f98d2009-05-04 21:01:20 +0000194rocks!'``, or the triple-quoted string syntax is stored as Unicode.
Georg Brandl116aa622007-08-15 14:28:22 +0000195
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400196The default encoding for Python source code is UTF-8, so you can simply
197include a Unicode character in a string literal::
198
199 try:
200 with open('/tmp/input.txt', 'r') as f:
201 ...
Andrew Svetlov08af0002014-04-01 01:13:30 +0300202 except OSError:
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400203 # 'File not found' error message.
204 print("Fichier non trouvé")
205
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400206Side note: Python 3 also supports using Unicode characters in identifiers::
207
208 répertoire = "/tmp/records.log"
209 with open(répertoire, "w") as f:
210 f.write("test\n")
211
212If you can't enter a particular character in your editor or want to
213keep the source code ASCII-only for some reason, you can also use
214escape sequences in string literals. (Depending on your system,
215you may see the actual capital-delta glyph instead of a \u escape.) ::
Georg Brandl116aa622007-08-15 14:28:22 +0000216
Georg Brandlf6945182008-02-01 11:56:49 +0000217 >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
218 '\u0394'
219 >>> "\u0394" # Using a 16-bit hex value
220 '\u0394'
221 >>> "\U00000394" # Using a 32-bit hex value
222 '\u0394'
Georg Brandl116aa622007-08-15 14:28:22 +0000223
Ezio Melotti410eee52013-01-20 12:16:03 +0200224In addition, one can create a string using the :func:`~bytes.decode` method of
225:class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``,
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400226and optionally an *errors* argument.
Georg Brandl116aa622007-08-15 14:28:22 +0000227
Georg Brandlf6945182008-02-01 11:56:49 +0000228The *errors* argument specifies the response when the input string can't be
Georg Brandl116aa622007-08-15 14:28:22 +0000229converted according to the encoding's rules. Legal values for this argument are
Ezio Melotti410eee52013-01-20 12:16:03 +0200230``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200231``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
232character out of the Unicode result), or ``'backslashreplace'`` (inserts a
233``\xNN`` escape sequence).
Ezio Melotti410eee52013-01-20 12:16:03 +0200234The following examples show the differences::
Georg Brandl116aa622007-08-15 14:28:22 +0000235
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700236 >>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000237 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700238 ...
239 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
240 invalid start byte
Ezio Melotti20b8d992012-09-23 15:55:14 +0300241 >>> b'\x80abc'.decode("utf-8", "replace")
242 '\ufffdabc'
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200243 >>> b'\x80abc'.decode("utf-8", "backslashreplace")
244 '\\x80abc'
Georg Brandlf6945182008-02-01 11:56:49 +0000245 >>> b'\x80abc'.decode("utf-8", "ignore")
246 'abc'
Georg Brandl116aa622007-08-15 14:28:22 +0000247
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500248Encodings are specified as strings containing the encoding's name. Python
Benjamin Petersond7c3ed52010-06-27 22:32:30 +0000249comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl0c074222008-11-22 10:26:59 +0000250:ref:`standard-encodings` for a list. Some encodings have multiple names; for
Ezio Melotti410eee52013-01-20 12:16:03 +0200251example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
252the same encoding.
Georg Brandl116aa622007-08-15 14:28:22 +0000253
Georg Brandlf6945182008-02-01 11:56:49 +0000254One-character Unicode strings can also be created with the :func:`chr`
Georg Brandl116aa622007-08-15 14:28:22 +0000255built-in function, which takes integers and returns a Unicode string of length 1
256that contains the corresponding code point. The reverse operation is the
257built-in :func:`ord` function that takes a one-character Unicode string and
258returns the code point value::
259
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000260 >>> chr(57344)
261 '\ue000'
262 >>> ord('\ue000')
263 57344
Georg Brandl116aa622007-08-15 14:28:22 +0000264
Georg Brandlf6945182008-02-01 11:56:49 +0000265Converting to Bytes
266-------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000267
Ezio Melotti410eee52013-01-20 12:16:03 +0200268The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
269which returns a :class:`bytes` representation of the Unicode string, encoded in the
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400270requested *encoding*.
271
272The *errors* parameter is the same as the parameter of the
273:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
274``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
275inserts a question mark instead of the unencodable character), there is
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200276also ``'xmlcharrefreplace'`` (inserts an XML character reference),
277``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and
278``namereplace`` (inserts a ``\N{...}`` escape sequence).
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400279
Ezio Melotti410eee52013-01-20 12:16:03 +0200280The following example shows the different results::
Georg Brandl116aa622007-08-15 14:28:22 +0000281
Georg Brandlf6945182008-02-01 11:56:49 +0000282 >>> u = chr(40960) + 'abcd' + chr(1972)
Georg Brandl116aa622007-08-15 14:28:22 +0000283 >>> u.encode('utf-8')
Georg Brandlf6945182008-02-01 11:56:49 +0000284 b'\xea\x80\x80abcd\xde\xb4'
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700285 >>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000286 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700287 ...
Georg Brandl0c074222008-11-22 10:26:59 +0000288 UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700289 position 0: ordinal not in range(128)
Georg Brandl116aa622007-08-15 14:28:22 +0000290 >>> u.encode('ascii', 'ignore')
Georg Brandlf6945182008-02-01 11:56:49 +0000291 b'abcd'
Georg Brandl116aa622007-08-15 14:28:22 +0000292 >>> u.encode('ascii', 'replace')
Georg Brandlf6945182008-02-01 11:56:49 +0000293 b'?abcd?'
Georg Brandl116aa622007-08-15 14:28:22 +0000294 >>> u.encode('ascii', 'xmlcharrefreplace')
Georg Brandlf6945182008-02-01 11:56:49 +0000295 b'&#40960;abcd&#1972;'
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400296 >>> u.encode('ascii', 'backslashreplace')
297 b'\\ua000abcd\\u07b4'
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200298 >>> u.encode('ascii', 'namereplace')
299 b'\\N{YI SYLLABLE IT}abcd\\u07b4'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000300
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400301The low-level routines for registering and accessing the available
302encodings are found in the :mod:`codecs` module. Implementing new
303encodings also requires understanding the :mod:`codecs` module.
304However, the encoding and decoding functions returned by this module
305are usually more low-level than is comfortable, and writing new encodings
306is a specialized task, so the module won't be covered in this HOWTO.
Georg Brandl116aa622007-08-15 14:28:22 +0000307
Georg Brandl6911e3c2007-09-04 07:15:32 +0000308
Georg Brandl116aa622007-08-15 14:28:22 +0000309Unicode Literals in Python Source Code
310--------------------------------------
311
Georg Brandlf6945182008-02-01 11:56:49 +0000312In Python source code, specific Unicode code points can be written using the
313``\u`` escape sequence, which is followed by four hex digits giving the code
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000314point. The ``\U`` escape sequence is similar, but expects eight hex digits,
315not four::
Georg Brandl116aa622007-08-15 14:28:22 +0000316
Georg Brandlf6945182008-02-01 11:56:49 +0000317 >>> s = "a\xac\u1234\u20ac\U00008000"
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700318 ... # ^^^^ two-digit hex escape
319 ... # ^^^^^^ four-digit Unicode escape
320 ... # ^^^^^^^^^^ eight-digit Unicode escape
321 >>> [ord(c) for c in s]
322 [97, 172, 4660, 8364, 32768]
Georg Brandl116aa622007-08-15 14:28:22 +0000323
324Using escape sequences for code points greater than 127 is fine in small doses,
325but becomes an annoyance if you're using many accented characters, as you would
326in a program with messages in French or some other accent-using language. You
Georg Brandlf6945182008-02-01 11:56:49 +0000327can also assemble strings using the :func:`chr` built-in function, but this is
Georg Brandl116aa622007-08-15 14:28:22 +0000328even more tedious.
329
330Ideally, you'd want to be able to write literals in your language's natural
331encoding. You could then edit Python source code with your favorite editor
332which would display the accented characters naturally, and have the right
333characters used at runtime.
334
Georg Brandl0c074222008-11-22 10:26:59 +0000335Python supports writing source code in UTF-8 by default, but you can use almost
336any encoding if you declare the encoding being used. This is done by including
337a special comment as either the first or second line of the source file::
Georg Brandl116aa622007-08-15 14:28:22 +0000338
339 #!/usr/bin/env python
340 # -*- coding: latin-1 -*-
Georg Brandl6911e3c2007-09-04 07:15:32 +0000341
Georg Brandlf6945182008-02-01 11:56:49 +0000342 u = 'abcdé'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000343 print(ord(u[-1]))
344
Georg Brandl116aa622007-08-15 14:28:22 +0000345The syntax is inspired by Emacs's notation for specifying variables local to a
346file. Emacs supports many different variables, but Python only supports
Georg Brandl0c074222008-11-22 10:26:59 +0000347'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
348they have no significance to Python but are a convention. Python looks for
349``coding: name`` or ``coding=name`` in the comment.
Georg Brandl116aa622007-08-15 14:28:22 +0000350
Georg Brandlf6945182008-02-01 11:56:49 +0000351If you don't include such a comment, the default encoding used will be UTF-8 as
Ezio Melotti410eee52013-01-20 12:16:03 +0200352already mentioned. See also :pep:`263` for more information.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000353
Georg Brandl116aa622007-08-15 14:28:22 +0000354
355Unicode Properties
356------------------
357
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500358The Unicode specification includes a database of information about
359code points. For each defined code point, the information includes
360the character's name, its category, the numeric value if applicable
361(for characters representing numeric concepts such as the Roman
362numerals, fractions such as one-third and four-fifths, etc.). There
363are also display-related properties, such as how to use the code point
364in bidirectional text.
Georg Brandl116aa622007-08-15 14:28:22 +0000365
366The following program displays some information about several characters, and
367prints the numeric value of one particular character::
368
369 import unicodedata
Georg Brandl6911e3c2007-09-04 07:15:32 +0000370
Georg Brandlf6945182008-02-01 11:56:49 +0000371 u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
Georg Brandl6911e3c2007-09-04 07:15:32 +0000372
Georg Brandl116aa622007-08-15 14:28:22 +0000373 for i, c in enumerate(u):
Georg Brandl6911e3c2007-09-04 07:15:32 +0000374 print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
375 print(unicodedata.name(c))
376
Georg Brandl116aa622007-08-15 14:28:22 +0000377 # Get numeric value of second character
Georg Brandl6911e3c2007-09-04 07:15:32 +0000378 print(unicodedata.numeric(u[1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000379
Ezio Melotti410eee52013-01-20 12:16:03 +0200380When run, this prints:
381
382.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000383
384 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
385 1 0bf2 No TAMIL NUMBER ONE THOUSAND
386 2 0f84 Mn TIBETAN MARK HALANTA
387 3 1770 Lo TAGBANWA LETTER SA
388 4 33af So SQUARE RAD OVER S SQUARED
389 1000.0
390
391The category codes are abbreviations describing the nature of the character.
392These are grouped into categories such as "Letter", "Number", "Punctuation", or
393"Symbol", which in turn are broken up into subcategories. To take the codes
394from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
395"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
396other". See
Benjamin Peterson51796e52020-03-10 21:10:59 -0700397`the General Category Values section of the Unicode Character Database documentation <https://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
Georg Brandl116aa622007-08-15 14:28:22 +0000398list of category codes.
399
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400400
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500401Comparing Strings
402-----------------
403
404Unicode adds some complication to comparing strings, because the same
405set of characters can be represented by different sequences of code
406points. For example, a letter like 'ê' can be represented as a single
407code point U+00EA, or as U+0065 U+0302, which is the code point for
408'e' followed by a code point for 'COMBINING CIRCUMFLEX ACCENT'. These
409will produce the same output when printed, but one is a string of
410length 1 and the other is of length 2.
411
412One tool for a case-insensitive comparison is the
413:meth:`~str.casefold` string method that converts a string to a
414case-insensitive form following an algorithm described by the Unicode
415Standard. This algorithm has special handling for characters such as
416the German letter 'ß' (code point U+00DF), which becomes the pair of
417lowercase letters 'ss'.
418
419::
420
421 >>> street = 'Gürzenichstraße'
422 >>> street.casefold()
423 'gürzenichstrasse'
424
425A second tool is the :mod:`unicodedata` module's
426:func:`~unicodedata.normalize` function that converts strings to one
427of several normal forms, where letters followed by a combining
428character are replaced with single characters. :func:`normalize` can
429be used to perform string comparisons that won't falsely report
430inequality if two strings use combining characters differently:
431
432::
433
434 import unicodedata
435
436 def compare_strs(s1, s2):
437 def NFD(s):
438 return unicodedata.normalize('NFD', s)
439
440 return NFD(s1) == NFD(s2)
441
442 single_char = 'ê'
443 multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
444 print('length of first string=', len(single_char))
445 print('length of second string=', len(multiple_chars))
446 print(compare_strs(single_char, multiple_chars))
447
448When run, this outputs:
449
450.. code-block:: shell-session
451
452 $ python3 compare-strs.py
453 length of first string= 1
454 length of second string= 2
455 True
456
457The first argument to the :func:`~unicodedata.normalize` function is a
458string giving the desired normalization form, which can be one of
459'NFC', 'NFKC', 'NFD', and 'NFKD'.
460
461The Unicode Standard also specifies how to do caseless comparisons::
462
463 import unicodedata
464
465 def compare_caseless(s1, s2):
466 def NFD(s):
467 return unicodedata.normalize('NFD', s)
468
469 return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())
470
471 # Example usage
472 single_char = 'ê'
473 multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
474
475 print(compare_caseless(single_char, multiple_chars))
476
477This will print ``True``. (Why is :func:`NFD` invoked twice? Because
478there are a few characters that make :meth:`casefold` return a
479non-normalized string, so the result needs to be normalized again. See
480section 3.13 of the Unicode Standard for a discussion and an example.)
481
482
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400483Unicode Regular Expressions
484---------------------------
485
486The regular expressions supported by the :mod:`re` module can be provided
487either as bytes or strings. Some of the special character sequences such as
488``\d`` and ``\w`` have different meanings depending on whether
489the pattern is supplied as bytes or a string. For example,
490``\d`` will match the characters ``[0-9]`` in bytes but
491in strings will match any character that's in the ``'Nd'`` category.
492
493The string in this example has the number 57 written in both Thai and
494Arabic numerals::
495
496 import re
Cheryl Sabella66771422018-02-02 16:16:27 -0500497 p = re.compile(r'\d+')
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400498
499 s = "Over \u0e55\u0e57 57 flavours"
500 m = p.search(s)
501 print(repr(m.group()))
502
503When executed, ``\d+`` will match the Thai numerals and print them
504out. If you supply the :const:`re.ASCII` flag to
505:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
506
507Similarly, ``\w`` matches a wide variety of Unicode characters but
508only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
509and ``\s`` will match either Unicode whitespace characters or
510``[ \t\n\r\f\v]``.
511
512
Georg Brandl116aa622007-08-15 14:28:22 +0000513References
514----------
515
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400516.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
517
518Some good alternative discussions of Python's Unicode support are:
519
520* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
Sanyam Khurana1b4587a2017-12-06 22:09:33 +0530521* `Pragmatic Unicode <https://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400522
Ezio Melotti410eee52013-01-20 12:16:03 +0200523The :class:`str` type is described in the Python library reference at
Ezio Melottia6229e62012-10-12 10:59:14 +0300524:ref:`textseq`.
Georg Brandl116aa622007-08-15 14:28:22 +0000525
526The documentation for the :mod:`unicodedata` module.
527
528The documentation for the :mod:`codecs` module.
529
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100530Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides)
531<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
532EuroPython 2002. The slides are an excellent overview of the design of Python
5332's Unicode features (where the Unicode string type is called ``unicode`` and
534literals start with ``u``).
Georg Brandl116aa622007-08-15 14:28:22 +0000535
536
537Reading and Writing Unicode Data
538================================
539
540Once you've written some code that works with Unicode data, the next problem is
541input/output. How do you get Unicode strings into your program, and how do you
542convert Unicode into a form suitable for storage or transmission?
543
544It's possible that you may not need to do anything depending on your input
545sources and output destinations; you should check whether the libraries used in
546your application support Unicode natively. XML parsers often return Unicode
547data, for example. Many relational databases also support Unicode-valued
548columns and can return Unicode values from an SQL query.
549
550Unicode data is usually converted to a particular encoding before it gets
551written to disk or sent over a socket. It's possible to do all the work
Georg Brandl3d596fa2013-10-29 08:16:56 +0100552yourself: open a file, read an 8-bit bytes object from it, and convert the bytes
Ezio Melotti410eee52013-01-20 12:16:03 +0200553with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
Georg Brandl116aa622007-08-15 14:28:22 +0000554
555One problem is the multi-byte nature of encodings; one Unicode character can be
556represented by several bytes. If you want to read the file in arbitrary-sized
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200557chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case
Georg Brandl116aa622007-08-15 14:28:22 +0000558where only part of the bytes encoding a single Unicode character are read at the
559end of a chunk. One solution would be to read the entire file into memory and
560then perform the decoding, but that prevents you from working with files that
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200561are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.
Georg Brandl116aa622007-08-15 14:28:22 +0000562(More, really, since for at least a moment you'd need to have both the encoded
563string and its Unicode version in memory.)
564
565The solution would be to use the low-level decoding interface to catch the case
566of partial coding sequences. The work of implementing this has already been
Georg Brandl0c074222008-11-22 10:26:59 +0000567done for you: the built-in :func:`open` function can return a file-like object
568that assumes the file's contents are in a specified encoding and accepts Unicode
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300569parameters for methods such as :meth:`~io.TextIOBase.read` and
Georg Brandl325a1c22013-10-27 09:16:01 +0100570:meth:`~io.TextIOBase.write`. This works through :func:`open`\'s *encoding* and
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300571*errors* parameters which are interpreted just like those in :meth:`str.encode`
572and :meth:`bytes.decode`.
Georg Brandl116aa622007-08-15 14:28:22 +0000573
574Reading Unicode from a file is therefore simple::
575
Georg Brandle47e1842013-10-06 13:07:10 +0200576 with open('unicode.txt', encoding='utf-8') as f:
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000577 for line in f:
578 print(repr(line))
Georg Brandl116aa622007-08-15 14:28:22 +0000579
580It's also possible to open files in update mode, allowing both reading and
581writing::
582
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000583 with open('test', encoding='utf-8', mode='w+') as f:
584 f.write('\u4500 blah blah blah\n')
585 f.seek(0)
586 print(repr(f.readline()[:1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000587
Ezio Melotti410eee52013-01-20 12:16:03 +0200588The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
Georg Brandl116aa622007-08-15 14:28:22 +0000589written as the first character of a file in order to assist with autodetection
590of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
591present at the start of a file; when such an encoding is used, the BOM will be
592automatically written as the first character and will be silently dropped when
593the file is read. There are variants of these encodings, such as 'utf-16-le'
594and 'utf-16-be' for little-endian and big-endian encodings, that specify one
595particular byte ordering and don't skip the BOM.
596
Georg Brandl0c074222008-11-22 10:26:59 +0000597In some areas, it is also convention to use a "BOM" at the start of UTF-8
598encoded files; the name is misleading since UTF-8 is not byte-order dependent.
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500599The mark simply announces that the file is encoded in UTF-8. For reading such
600files, use the 'utf-8-sig' codec to automatically skip the mark if present.
Georg Brandl0c074222008-11-22 10:26:59 +0000601
Georg Brandl116aa622007-08-15 14:28:22 +0000602
603Unicode filenames
604-----------------
605
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500606Most of the operating systems in common use today support filenames
607that contain arbitrary Unicode characters. Usually this is
608implemented by converting the Unicode string into some encoding that
609varies depending on the system. Today Python is converging on using
610UTF-8: Python on MacOS has used UTF-8 for several versions, and Python
6113.6 switched to using UTF-8 on Windows as well. On Unix systems,
Victor Stinner4b9aad42020-11-02 16:49:54 +0100612there will only be a :term:`filesystem encoding <filesystem encoding and error
613handler>`. if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
614you haven't, the default encoding is again UTF-8.
Georg Brandl116aa622007-08-15 14:28:22 +0000615
616The :func:`sys.getfilesystemencoding` function returns the encoding to use on
617your current system, in case you want to do the encoding manually, but there's
618not much reason to bother. When opening a file for reading or writing, you can
619usually just provide the Unicode string as the filename, and it will be
620automatically converted to the right encoding for you::
621
Georg Brandlf6945182008-02-01 11:56:49 +0000622 filename = 'filename\u4500abc'
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000623 with open(filename, 'w') as f:
624 f.write('blah\n')
Georg Brandl116aa622007-08-15 14:28:22 +0000625
626Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
627filenames.
628
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500629The :func:`os.listdir` function returns filenames, which raises an issue: should it return
Ezio Melotti410eee52013-01-20 12:16:03 +0200630the Unicode version of filenames, or should it return bytes containing
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500631the encoded versions? :func:`os.listdir` can do both, depending on whether you
Ezio Melotti410eee52013-01-20 12:16:03 +0200632provided the directory path as bytes or a Unicode string. If you pass a
Georg Brandl0c074222008-11-22 10:26:59 +0000633Unicode string as the path, filenames will be decoded using the filesystem's
634encoding and a list of Unicode strings will be returned, while passing a byte
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400635path will return the filenames as bytes. For example,
Victor Stinner4b9aad42020-11-02 16:49:54 +0100636assuming the default :term:`filesystem encoding <filesystem encoding and error
637handler>` is UTF-8, running the following program::
Georg Brandl116aa622007-08-15 14:28:22 +0000638
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000639 fn = 'filename\u4500abc'
640 f = open(fn, 'w')
641 f.close()
Georg Brandl116aa622007-08-15 14:28:22 +0000642
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000643 import os
644 print(os.listdir(b'.'))
645 print(os.listdir('.'))
Georg Brandl116aa622007-08-15 14:28:22 +0000646
Martin Panter1050d2d2016-07-26 11:18:21 +0200647will produce the following output:
648
649.. code-block:: shell-session
Georg Brandl116aa622007-08-15 14:28:22 +0000650
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500651 $ python listdir-test.py
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400652 [b'filename\xe4\x94\x80abc', ...]
653 ['filename\u4500abc', ...]
Georg Brandl116aa622007-08-15 14:28:22 +0000654
655The first list contains UTF-8-encoded filenames, and the second list contains
656the Unicode versions.
657
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500658Note that on most occasions, you should can just stick with using
659Unicode with these APIs. The bytes APIs should only be used on
660systems where undecodable file names can be present; that's
661pretty much only Unix systems now.
Georg Brandl0c074222008-11-22 10:26:59 +0000662
Georg Brandl116aa622007-08-15 14:28:22 +0000663
Georg Brandl116aa622007-08-15 14:28:22 +0000664Tips for Writing Unicode-aware Programs
665---------------------------------------
666
667This section provides some suggestions on writing software that deals with
668Unicode.
669
670The most important tip is:
671
Ezio Melotti410eee52013-01-20 12:16:03 +0200672 Software should only work with Unicode strings internally, decoding the input
673 data as soon as possible and encoding the output only at the end.
Georg Brandl116aa622007-08-15 14:28:22 +0000674
Georg Brandl0c074222008-11-22 10:26:59 +0000675If you attempt to write processing functions that accept both Unicode and byte
Georg Brandl116aa622007-08-15 14:28:22 +0000676strings, you will find your program vulnerable to bugs wherever you combine the
Ezio Melotti410eee52013-01-20 12:16:03 +0200677two different kinds of strings. There is no automatic encoding or decoding: if
678you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
Georg Brandl116aa622007-08-15 14:28:22 +0000679
Georg Brandl116aa622007-08-15 14:28:22 +0000680When using data coming from a web browser or some other untrusted source, a
681common technique is to check for illegal characters in a string before using the
682string in a generated command line or storing it in a database. If you're doing
Antoine Pitrou534e2532011-12-05 01:21:46 +0100683this, be careful to check the decoded string, not the encoded bytes data;
684some encodings may have interesting properties, such as not being bijective
685or not being fully ASCII-compatible. This is especially true if the input
686data also specifies the encoding, since the attacker can then choose a
687clever way to hide malicious text in the encoded bytestream.
Georg Brandl116aa622007-08-15 14:28:22 +0000688
Georg Brandl116aa622007-08-15 14:28:22 +0000689
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400690Converting Between File Encodings
691'''''''''''''''''''''''''''''''''
692
693The :class:`~codecs.StreamRecoder` class can transparently convert between
694encodings, taking a stream that returns data in encoding #1
695and behaving like a stream returning data in encoding #2.
696
697For example, if you have an input file *f* that's in Latin-1, you
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300698can wrap it with a :class:`~codecs.StreamRecoder` to return bytes encoded in
699UTF-8::
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400700
701 new_f = codecs.StreamRecoder(f,
702 # en/decoder: used by read() to encode its results and
703 # by write() to decode its input.
704 codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
705
706 # reader/writer: used to read and write to the stream.
707 codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
708
709
710Files in an Unknown Encoding
711''''''''''''''''''''''''''''
712
713What can you do if you need to make a change to a file, but don't know
714the file's encoding? If you know the encoding is ASCII-compatible and
715only want to examine or modify the ASCII parts, you can open the file
716with the ``surrogateescape`` error handler::
717
718 with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
719 data = f.read()
720
721 # make changes to the string 'data'
722
723 with open(fname + '.new', 'w',
Serhiy Storchakadba90392016-05-10 12:01:23 +0300724 encoding="ascii", errors="surrogateescape") as f:
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400725 f.write(data)
726
727The ``surrogateescape`` error handler will decode any non-ASCII bytes
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500728as code points in a special range running from U+DC80 to
729U+DCFF. These code points will then turn back into the
730same bytes when the ``surrogateescape`` error handler is used to
731encode the data and write it back out.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400732
733
Georg Brandl116aa622007-08-15 14:28:22 +0000734References
735----------
736
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100737One section of `Mastering Python 3 Input/Output
738<http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_,
739a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400740
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100741The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
742Applications in Python"
743<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400744discuss questions of character encodings as well as how to internationalize
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000745and localize an application. These slides cover Python 2.x only.
Georg Brandl116aa622007-08-15 14:28:22 +0000746
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100747`The Guts of Unicode in Python
748<http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_
749is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
750representation in Python 3.3.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400751
Georg Brandl116aa622007-08-15 14:28:22 +0000752
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000753Acknowledgements
754================
Georg Brandl116aa622007-08-15 14:28:22 +0000755
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400756The initial draft of this document was written by Andrew Kuchling.
757It has since been revised further by Alexander Belopolsky, Georg Brandl,
758Andrew Kuchling, and Ezio Melotti.
Georg Brandl116aa622007-08-15 14:28:22 +0000759
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400760Thanks to the following people who have noted errors or offered
761suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
762Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500763Lemburg, Martin von Löwis, Terry J. Reedy, Serhiy Storchaka,
764Eryk Sun, Chad Whitacre, Graham Wideman.