blob: 5339bf45bf0e804d8ebcda1638c54ea711f15e9c [file] [log] [blame]
Guido van Rossum715287f2008-12-02 22:34:15 +00001.. _unicode-howto:
2
Georg Brandl116aa622007-08-15 14:28:22 +00003*****************
4 Unicode HOWTO
5*****************
6
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00007:Release: 1.12
Georg Brandl116aa622007-08-15 14:28:22 +00008
Andrew Kuchling97c288d2019-03-03 23:10:28 -05009This HOWTO discusses Python's support for the Unicode specification
10for representing textual data, and explains various problems that
11people commonly encounter when trying to work with Unicode.
12
Georg Brandl6911e3c2007-09-04 07:15:32 +000013
Georg Brandl116aa622007-08-15 14:28:22 +000014Introduction to Unicode
15=======================
16
Georg Brandl116aa622007-08-15 14:28:22 +000017Definitions
18-----------
19
Andrew Kuchling97c288d2019-03-03 23:10:28 -050020Today's programs need to be able to handle a wide variety of
21characters. Applications are often internationalized to display
22messages and output in a variety of user-selectable languages; the
23same program might need to output an error message in English, French,
24Japanese, Hebrew, or Russian. Web content can be written in any of
25these languages and can also include a variety of emoji symbols.
26Python's string type uses the Unicode Standard for representing
27characters, which lets Python programs work with all these different
28possible characters.
Georg Brandl116aa622007-08-15 14:28:22 +000029
Andrew Kuchling97c288d2019-03-03 23:10:28 -050030Unicode (https://www.unicode.org/) is a specification that aims to
31list every character used by human languages and give each character
32its own unique code. The Unicode specifications are continually
33revised and updated to add new languages and symbols.
34
35A **character** is the smallest possible component of a text. 'A', 'B', 'C',
36etc., are all different characters. So are 'È' and 'Í'. Characters vary
37depending on the language or context you're talking
38about. For example, there's a character for "Roman Numeral One", '', that's
39separate from the uppercase letter 'I'. They'll usually look the same,
40but these are two different characters that have different meanings.
41
42The Unicode standard describes how characters are represented by
43**code points**. A code point value is an integer in the range 0 to
440x10FFFF (about 1.1 million values, with some 110 thousand assigned so
45far). In the standard and in this document, a code point is written
46using the notation ``U+265E`` to mean the character with value
47``0x265e`` (9,822 in decimal).
48
49The Unicode standard contains a lot of tables listing characters and
50their corresponding code points:
Ezio Melotti410eee52013-01-20 12:16:03 +020051
52.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +000053
Georg Brandla1c6a1c2009-01-03 21:26:05 +000054 0061 'a'; LATIN SMALL LETTER A
55 0062 'b'; LATIN SMALL LETTER B
56 0063 'c'; LATIN SMALL LETTER C
57 ...
58 007B '{'; LEFT CURLY BRACKET
Andrew Kuchling97c288d2019-03-03 23:10:28 -050059 ...
60 2167 '': ROMAN NUMERAL EIGHT
61 2168 '': ROMAN NUMERAL NINE
62 ...
63 265E '': BLACK CHESS KNIGHT
64 265F '': BLACK CHESS PAWN
65 ...
66 1F600 '😀': GRINNING FACE
67 1F609 '😉': WINKING FACE
68 ...
Georg Brandl116aa622007-08-15 14:28:22 +000069
70Strictly, these definitions imply that it's meaningless to say 'this is
Andrew Kuchling97c288d2019-03-03 23:10:28 -050071character ``U+265E``'. ``U+265E`` is a code point, which represents some particular
72character; in this case, it represents the character 'BLACK CHESS KNIGHT',
73'♞'. In
Georg Brandl116aa622007-08-15 14:28:22 +000074informal contexts, this distinction between code points and characters will
75sometimes be forgotten.
76
77A character is represented on a screen or on paper by a set of graphical
78elements that's called a **glyph**. The glyph for an uppercase A, for example,
79is two diagonal strokes and a horizontal stroke, though the exact details will
80depend on the font being used. Most Python code doesn't need to worry about
81glyphs; figuring out the correct glyph to display is generally the job of a GUI
82toolkit or a terminal's font renderer.
83
84
85Encodings
86---------
87
Andrew Kuchling97c288d2019-03-03 23:10:28 -050088To summarize the previous section: a Unicode string is a sequence of
89code points, which are numbers from 0 through ``0x10FFFF`` (1,114,111
90decimal). This sequence of code points needs to be represented in
91memory as a set of **code units**, and **code units** are then mapped
92to 8-bit bytes. The rules for translating a Unicode string into a
93sequence of bytes are called a **character encoding**, or just
94an **encoding**.
Georg Brandl116aa622007-08-15 14:28:22 +000095
Andrew Kuchling97c288d2019-03-03 23:10:28 -050096The first encoding you might think of is using 32-bit integers as the
97code unit, and then using the CPU's representation of 32-bit integers.
98In this representation, the string "Python" might look like this:
Ezio Melotti410eee52013-01-20 12:16:03 +020099
100.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000101
102 P y t h o n
Georg Brandl6911e3c2007-09-04 07:15:32 +0000103 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
104 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl116aa622007-08-15 14:28:22 +0000105
106This representation is straightforward but using it presents a number of
107problems.
108
1091. It's not portable; different processors order the bytes differently.
110
1112. It's very wasteful of space. In most texts, the majority of the code points
Ezio Melotti410eee52013-01-20 12:16:03 +0200112 are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
Georg Brandl116aa622007-08-15 14:28:22 +0000113 bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
114 ASCII representation. Increased RAM usage doesn't matter too much (desktop
Ezio Melotti410eee52013-01-20 12:16:03 +0200115 computers have gigabytes of RAM, and strings aren't usually that large), but
Georg Brandl116aa622007-08-15 14:28:22 +0000116 expanding our usage of disk and network bandwidth by a factor of 4 is
117 intolerable.
118
1193. It's not compatible with existing C functions such as ``strlen()``, so a new
120 family of wide string functions would need to be used.
121
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500122Therefore this encoding isn't used very much, and people instead choose other
123encodings that are more efficient and convenient, such as UTF-8.
Georg Brandl116aa622007-08-15 14:28:22 +0000124
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500125UTF-8 is one of the most commonly used encodings, and Python often
126defaults to using it. UTF stands for "Unicode Transformation Format",
127and the '8' means that 8-bit values are used in the encoding. (There
128are also UTF-16 and UTF-32 encodings, but they are less frequently
129used than UTF-8.) UTF-8 uses the following rules:
Georg Brandl116aa622007-08-15 14:28:22 +0000130
Ezio Melotti410eee52013-01-20 12:16:03 +02001311. If the code point is < 128, it's represented by the corresponding byte value.
1322. If the code point is >= 128, it's turned into a sequence of two, three, or
133 four bytes, where each byte of the sequence is between 128 and 255.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000134
Georg Brandl116aa622007-08-15 14:28:22 +0000135UTF-8 has several convenient properties:
136
1371. It can handle any Unicode code point.
R David Murray48de2822016-08-23 20:43:56 -04001382. A Unicode string is turned into a sequence of bytes containing no embedded zero
Georg Brandl116aa622007-08-15 14:28:22 +0000139 bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
140 processed by C functions such as ``strcpy()`` and sent through protocols that
141 can't handle zero bytes.
1423. A string of ASCII text is also valid UTF-8 text.
Ezio Melotti410eee52013-01-20 12:16:03 +02001434. UTF-8 is fairly compact; the majority of commonly used characters can be
144 represented with one or two bytes.
Georg Brandl116aa622007-08-15 14:28:22 +00001455. If bytes are corrupted or lost, it's possible to determine the start of the
146 next UTF-8-encoded code point and resynchronize. It's also unlikely that
147 random 8-bit data will look like valid UTF-8.
148
149
150
151References
152----------
153
Ezio Melotti410eee52013-01-20 12:16:03 +0200154The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
Georg Brandl116aa622007-08-15 14:28:22 +0000155glossary, and PDF versions of the Unicode specification. Be prepared for some
Ezio Melotti410eee52013-01-20 12:16:03 +0200156difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
157origin and development of Unicode is also available on the site.
Georg Brandl116aa622007-08-15 14:28:22 +0000158
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500159On the Computerphile Youtube channel, Tom Scott briefly
160`discusses the history of Unicode and UTF-8 <https://www.youtube.com/watch?v=MijmeoH9LT4>`
161(9 minutes 36 seconds).
162
Ezio Melotti410eee52013-01-20 12:16:03 +0200163To help understand the standard, Jukka Korpela has written `an introductory
Sanyam Khurana338cd832018-01-20 05:55:37 +0530164guide <http://jkorpela.fi/unicode/guide.html>`_ to reading the
Ezio Melotti410eee52013-01-20 12:16:03 +0200165Unicode character tables.
Georg Brandl116aa622007-08-15 14:28:22 +0000166
Sanyam Khurana1b4587a2017-12-06 22:09:33 +0530167Another `good introductory article <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
Ezio Melotti410eee52013-01-20 12:16:03 +0200168was written by Joel Spolsky.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400169If this introduction didn't make things clear to you, you should try
170reading this alternate article before continuing.
Georg Brandl116aa622007-08-15 14:28:22 +0000171
Ezio Melotti410eee52013-01-20 12:16:03 +0200172Wikipedia entries are often helpful; see the entries for "`character encoding
Georg Brandl5d941342016-02-26 19:37:12 +0100173<https://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
174<https://en.wikipedia.org/wiki/UTF-8>`_, for example.
Georg Brandl116aa622007-08-15 14:28:22 +0000175
176
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000177Python's Unicode Support
178========================
Georg Brandl116aa622007-08-15 14:28:22 +0000179
180Now that you've learned the rudiments of Unicode, we can look at Python's
181Unicode features.
182
Georg Brandlf6945182008-02-01 11:56:49 +0000183The String Type
184---------------
Georg Brandl116aa622007-08-15 14:28:22 +0000185
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500186Since Python 3.0, the language's :class:`str` type contains Unicode
Georg Brandlf6945182008-02-01 11:56:49 +0000187characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
Georg Brandl4f5f98d2009-05-04 21:01:20 +0000188rocks!'``, or the triple-quoted string syntax is stored as Unicode.
Georg Brandl116aa622007-08-15 14:28:22 +0000189
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400190The default encoding for Python source code is UTF-8, so you can simply
191include a Unicode character in a string literal::
192
193 try:
194 with open('/tmp/input.txt', 'r') as f:
195 ...
Andrew Svetlov08af0002014-04-01 01:13:30 +0300196 except OSError:
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400197 # 'File not found' error message.
198 print("Fichier non trouvé")
199
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400200Side note: Python 3 also supports using Unicode characters in identifiers::
201
202 répertoire = "/tmp/records.log"
203 with open(répertoire, "w") as f:
204 f.write("test\n")
205
206If you can't enter a particular character in your editor or want to
207keep the source code ASCII-only for some reason, you can also use
208escape sequences in string literals. (Depending on your system,
209you may see the actual capital-delta glyph instead of a \u escape.) ::
Georg Brandl116aa622007-08-15 14:28:22 +0000210
Georg Brandlf6945182008-02-01 11:56:49 +0000211 >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
212 '\u0394'
213 >>> "\u0394" # Using a 16-bit hex value
214 '\u0394'
215 >>> "\U00000394" # Using a 32-bit hex value
216 '\u0394'
Georg Brandl116aa622007-08-15 14:28:22 +0000217
Ezio Melotti410eee52013-01-20 12:16:03 +0200218In addition, one can create a string using the :func:`~bytes.decode` method of
219:class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``,
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400220and optionally an *errors* argument.
Georg Brandl116aa622007-08-15 14:28:22 +0000221
Georg Brandlf6945182008-02-01 11:56:49 +0000222The *errors* argument specifies the response when the input string can't be
Georg Brandl116aa622007-08-15 14:28:22 +0000223converted according to the encoding's rules. Legal values for this argument are
Ezio Melotti410eee52013-01-20 12:16:03 +0200224``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200225``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
226character out of the Unicode result), or ``'backslashreplace'`` (inserts a
227``\xNN`` escape sequence).
Ezio Melotti410eee52013-01-20 12:16:03 +0200228The following examples show the differences::
Georg Brandl116aa622007-08-15 14:28:22 +0000229
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700230 >>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000231 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700232 ...
233 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
234 invalid start byte
Ezio Melotti20b8d992012-09-23 15:55:14 +0300235 >>> b'\x80abc'.decode("utf-8", "replace")
236 '\ufffdabc'
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200237 >>> b'\x80abc'.decode("utf-8", "backslashreplace")
238 '\\x80abc'
Georg Brandlf6945182008-02-01 11:56:49 +0000239 >>> b'\x80abc'.decode("utf-8", "ignore")
240 'abc'
Georg Brandl116aa622007-08-15 14:28:22 +0000241
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500242Encodings are specified as strings containing the encoding's name. Python
Benjamin Petersond7c3ed52010-06-27 22:32:30 +0000243comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl0c074222008-11-22 10:26:59 +0000244:ref:`standard-encodings` for a list. Some encodings have multiple names; for
Ezio Melotti410eee52013-01-20 12:16:03 +0200245example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
246the same encoding.
Georg Brandl116aa622007-08-15 14:28:22 +0000247
Georg Brandlf6945182008-02-01 11:56:49 +0000248One-character Unicode strings can also be created with the :func:`chr`
Georg Brandl116aa622007-08-15 14:28:22 +0000249built-in function, which takes integers and returns a Unicode string of length 1
250that contains the corresponding code point. The reverse operation is the
251built-in :func:`ord` function that takes a one-character Unicode string and
252returns the code point value::
253
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000254 >>> chr(57344)
255 '\ue000'
256 >>> ord('\ue000')
257 57344
Georg Brandl116aa622007-08-15 14:28:22 +0000258
Georg Brandlf6945182008-02-01 11:56:49 +0000259Converting to Bytes
260-------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000261
Ezio Melotti410eee52013-01-20 12:16:03 +0200262The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
263which returns a :class:`bytes` representation of the Unicode string, encoded in the
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400264requested *encoding*.
265
266The *errors* parameter is the same as the parameter of the
267:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
268``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
269inserts a question mark instead of the unencodable character), there is
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200270also ``'xmlcharrefreplace'`` (inserts an XML character reference),
271``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and
272``namereplace`` (inserts a ``\N{...}`` escape sequence).
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400273
Ezio Melotti410eee52013-01-20 12:16:03 +0200274The following example shows the different results::
Georg Brandl116aa622007-08-15 14:28:22 +0000275
Georg Brandlf6945182008-02-01 11:56:49 +0000276 >>> u = chr(40960) + 'abcd' + chr(1972)
Georg Brandl116aa622007-08-15 14:28:22 +0000277 >>> u.encode('utf-8')
Georg Brandlf6945182008-02-01 11:56:49 +0000278 b'\xea\x80\x80abcd\xde\xb4'
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700279 >>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000280 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700281 ...
Georg Brandl0c074222008-11-22 10:26:59 +0000282 UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700283 position 0: ordinal not in range(128)
Georg Brandl116aa622007-08-15 14:28:22 +0000284 >>> u.encode('ascii', 'ignore')
Georg Brandlf6945182008-02-01 11:56:49 +0000285 b'abcd'
Georg Brandl116aa622007-08-15 14:28:22 +0000286 >>> u.encode('ascii', 'replace')
Georg Brandlf6945182008-02-01 11:56:49 +0000287 b'?abcd?'
Georg Brandl116aa622007-08-15 14:28:22 +0000288 >>> u.encode('ascii', 'xmlcharrefreplace')
Georg Brandlf6945182008-02-01 11:56:49 +0000289 b'&#40960;abcd&#1972;'
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400290 >>> u.encode('ascii', 'backslashreplace')
291 b'\\ua000abcd\\u07b4'
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200292 >>> u.encode('ascii', 'namereplace')
293 b'\\N{YI SYLLABLE IT}abcd\\u07b4'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000294
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400295The low-level routines for registering and accessing the available
296encodings are found in the :mod:`codecs` module. Implementing new
297encodings also requires understanding the :mod:`codecs` module.
298However, the encoding and decoding functions returned by this module
299are usually more low-level than is comfortable, and writing new encodings
300is a specialized task, so the module won't be covered in this HOWTO.
Georg Brandl116aa622007-08-15 14:28:22 +0000301
Georg Brandl6911e3c2007-09-04 07:15:32 +0000302
Georg Brandl116aa622007-08-15 14:28:22 +0000303Unicode Literals in Python Source Code
304--------------------------------------
305
Georg Brandlf6945182008-02-01 11:56:49 +0000306In Python source code, specific Unicode code points can be written using the
307``\u`` escape sequence, which is followed by four hex digits giving the code
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000308point. The ``\U`` escape sequence is similar, but expects eight hex digits,
309not four::
Georg Brandl116aa622007-08-15 14:28:22 +0000310
Georg Brandlf6945182008-02-01 11:56:49 +0000311 >>> s = "a\xac\u1234\u20ac\U00008000"
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700312 ... # ^^^^ two-digit hex escape
313 ... # ^^^^^^ four-digit Unicode escape
314 ... # ^^^^^^^^^^ eight-digit Unicode escape
315 >>> [ord(c) for c in s]
316 [97, 172, 4660, 8364, 32768]
Georg Brandl116aa622007-08-15 14:28:22 +0000317
318Using escape sequences for code points greater than 127 is fine in small doses,
319but becomes an annoyance if you're using many accented characters, as you would
320in a program with messages in French or some other accent-using language. You
Georg Brandlf6945182008-02-01 11:56:49 +0000321can also assemble strings using the :func:`chr` built-in function, but this is
Georg Brandl116aa622007-08-15 14:28:22 +0000322even more tedious.
323
324Ideally, you'd want to be able to write literals in your language's natural
325encoding. You could then edit Python source code with your favorite editor
326which would display the accented characters naturally, and have the right
327characters used at runtime.
328
Georg Brandl0c074222008-11-22 10:26:59 +0000329Python supports writing source code in UTF-8 by default, but you can use almost
330any encoding if you declare the encoding being used. This is done by including
331a special comment as either the first or second line of the source file::
Georg Brandl116aa622007-08-15 14:28:22 +0000332
333 #!/usr/bin/env python
334 # -*- coding: latin-1 -*-
Georg Brandl6911e3c2007-09-04 07:15:32 +0000335
Georg Brandlf6945182008-02-01 11:56:49 +0000336 u = 'abcdé'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000337 print(ord(u[-1]))
338
Georg Brandl116aa622007-08-15 14:28:22 +0000339The syntax is inspired by Emacs's notation for specifying variables local to a
340file. Emacs supports many different variables, but Python only supports
Georg Brandl0c074222008-11-22 10:26:59 +0000341'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
342they have no significance to Python but are a convention. Python looks for
343``coding: name`` or ``coding=name`` in the comment.
Georg Brandl116aa622007-08-15 14:28:22 +0000344
Georg Brandlf6945182008-02-01 11:56:49 +0000345If you don't include such a comment, the default encoding used will be UTF-8 as
Ezio Melotti410eee52013-01-20 12:16:03 +0200346already mentioned. See also :pep:`263` for more information.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000347
Georg Brandl116aa622007-08-15 14:28:22 +0000348
349Unicode Properties
350------------------
351
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500352The Unicode specification includes a database of information about
353code points. For each defined code point, the information includes
354the character's name, its category, the numeric value if applicable
355(for characters representing numeric concepts such as the Roman
356numerals, fractions such as one-third and four-fifths, etc.). There
357are also display-related properties, such as how to use the code point
358in bidirectional text.
Georg Brandl116aa622007-08-15 14:28:22 +0000359
360The following program displays some information about several characters, and
361prints the numeric value of one particular character::
362
363 import unicodedata
Georg Brandl6911e3c2007-09-04 07:15:32 +0000364
Georg Brandlf6945182008-02-01 11:56:49 +0000365 u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
Georg Brandl6911e3c2007-09-04 07:15:32 +0000366
Georg Brandl116aa622007-08-15 14:28:22 +0000367 for i, c in enumerate(u):
Georg Brandl6911e3c2007-09-04 07:15:32 +0000368 print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
369 print(unicodedata.name(c))
370
Georg Brandl116aa622007-08-15 14:28:22 +0000371 # Get numeric value of second character
Georg Brandl6911e3c2007-09-04 07:15:32 +0000372 print(unicodedata.numeric(u[1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000373
Ezio Melotti410eee52013-01-20 12:16:03 +0200374When run, this prints:
375
376.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000377
378 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
379 1 0bf2 No TAMIL NUMBER ONE THOUSAND
380 2 0f84 Mn TIBETAN MARK HALANTA
381 3 1770 Lo TAGBANWA LETTER SA
382 4 33af So SQUARE RAD OVER S SQUARED
383 1000.0
384
385The category codes are abbreviations describing the nature of the character.
386These are grouped into categories such as "Letter", "Number", "Punctuation", or
387"Symbol", which in turn are broken up into subcategories. To take the codes
388from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
389"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
390other". See
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400391`the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
Georg Brandl116aa622007-08-15 14:28:22 +0000392list of category codes.
393
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400394
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500395Comparing Strings
396-----------------
397
398Unicode adds some complication to comparing strings, because the same
399set of characters can be represented by different sequences of code
400points. For example, a letter like 'ê' can be represented as a single
401code point U+00EA, or as U+0065 U+0302, which is the code point for
402'e' followed by a code point for 'COMBINING CIRCUMFLEX ACCENT'. These
403will produce the same output when printed, but one is a string of
404length 1 and the other is of length 2.
405
406One tool for a case-insensitive comparison is the
407:meth:`~str.casefold` string method that converts a string to a
408case-insensitive form following an algorithm described by the Unicode
409Standard. This algorithm has special handling for characters such as
410the German letter 'ß' (code point U+00DF), which becomes the pair of
411lowercase letters 'ss'.
412
413::
414
415 >>> street = 'Gürzenichstraße'
416 >>> street.casefold()
417 'gürzenichstrasse'
418
419A second tool is the :mod:`unicodedata` module's
420:func:`~unicodedata.normalize` function that converts strings to one
421of several normal forms, where letters followed by a combining
422character are replaced with single characters. :func:`normalize` can
423be used to perform string comparisons that won't falsely report
424inequality if two strings use combining characters differently:
425
426::
427
428 import unicodedata
429
430 def compare_strs(s1, s2):
431 def NFD(s):
432 return unicodedata.normalize('NFD', s)
433
434 return NFD(s1) == NFD(s2)
435
436 single_char = 'ê'
437 multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
438 print('length of first string=', len(single_char))
439 print('length of second string=', len(multiple_chars))
440 print(compare_strs(single_char, multiple_chars))
441
442When run, this outputs:
443
444.. code-block:: shell-session
445
446 $ python3 compare-strs.py
447 length of first string= 1
448 length of second string= 2
449 True
450
451The first argument to the :func:`~unicodedata.normalize` function is a
452string giving the desired normalization form, which can be one of
453'NFC', 'NFKC', 'NFD', and 'NFKD'.
454
455The Unicode Standard also specifies how to do caseless comparisons::
456
457 import unicodedata
458
459 def compare_caseless(s1, s2):
460 def NFD(s):
461 return unicodedata.normalize('NFD', s)
462
463 return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())
464
465 # Example usage
466 single_char = 'ê'
467 multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
468
469 print(compare_caseless(single_char, multiple_chars))
470
471This will print ``True``. (Why is :func:`NFD` invoked twice? Because
472there are a few characters that make :meth:`casefold` return a
473non-normalized string, so the result needs to be normalized again. See
474section 3.13 of the Unicode Standard for a discussion and an example.)
475
476
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400477Unicode Regular Expressions
478---------------------------
479
480The regular expressions supported by the :mod:`re` module can be provided
481either as bytes or strings. Some of the special character sequences such as
482``\d`` and ``\w`` have different meanings depending on whether
483the pattern is supplied as bytes or a string. For example,
484``\d`` will match the characters ``[0-9]`` in bytes but
485in strings will match any character that's in the ``'Nd'`` category.
486
487The string in this example has the number 57 written in both Thai and
488Arabic numerals::
489
490 import re
Cheryl Sabella66771422018-02-02 16:16:27 -0500491 p = re.compile(r'\d+')
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400492
493 s = "Over \u0e55\u0e57 57 flavours"
494 m = p.search(s)
495 print(repr(m.group()))
496
497When executed, ``\d+`` will match the Thai numerals and print them
498out. If you supply the :const:`re.ASCII` flag to
499:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
500
501Similarly, ``\w`` matches a wide variety of Unicode characters but
502only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
503and ``\s`` will match either Unicode whitespace characters or
504``[ \t\n\r\f\v]``.
505
506
Georg Brandl116aa622007-08-15 14:28:22 +0000507References
508----------
509
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400510.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
511
512Some good alternative discussions of Python's Unicode support are:
513
514* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
Sanyam Khurana1b4587a2017-12-06 22:09:33 +0530515* `Pragmatic Unicode <https://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400516
Ezio Melotti410eee52013-01-20 12:16:03 +0200517The :class:`str` type is described in the Python library reference at
Ezio Melottia6229e62012-10-12 10:59:14 +0300518:ref:`textseq`.
Georg Brandl116aa622007-08-15 14:28:22 +0000519
520The documentation for the :mod:`unicodedata` module.
521
522The documentation for the :mod:`codecs` module.
523
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100524Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides)
525<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
526EuroPython 2002. The slides are an excellent overview of the design of Python
5272's Unicode features (where the Unicode string type is called ``unicode`` and
528literals start with ``u``).
Georg Brandl116aa622007-08-15 14:28:22 +0000529
530
531Reading and Writing Unicode Data
532================================
533
534Once you've written some code that works with Unicode data, the next problem is
535input/output. How do you get Unicode strings into your program, and how do you
536convert Unicode into a form suitable for storage or transmission?
537
538It's possible that you may not need to do anything depending on your input
539sources and output destinations; you should check whether the libraries used in
540your application support Unicode natively. XML parsers often return Unicode
541data, for example. Many relational databases also support Unicode-valued
542columns and can return Unicode values from an SQL query.
543
544Unicode data is usually converted to a particular encoding before it gets
545written to disk or sent over a socket. It's possible to do all the work
Georg Brandl3d596fa2013-10-29 08:16:56 +0100546yourself: open a file, read an 8-bit bytes object from it, and convert the bytes
Ezio Melotti410eee52013-01-20 12:16:03 +0200547with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
Georg Brandl116aa622007-08-15 14:28:22 +0000548
549One problem is the multi-byte nature of encodings; one Unicode character can be
550represented by several bytes. If you want to read the file in arbitrary-sized
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200551chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case
Georg Brandl116aa622007-08-15 14:28:22 +0000552where only part of the bytes encoding a single Unicode character are read at the
553end of a chunk. One solution would be to read the entire file into memory and
554then perform the decoding, but that prevents you from working with files that
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200555are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.
Georg Brandl116aa622007-08-15 14:28:22 +0000556(More, really, since for at least a moment you'd need to have both the encoded
557string and its Unicode version in memory.)
558
559The solution would be to use the low-level decoding interface to catch the case
560of partial coding sequences. The work of implementing this has already been
Georg Brandl0c074222008-11-22 10:26:59 +0000561done for you: the built-in :func:`open` function can return a file-like object
562that assumes the file's contents are in a specified encoding and accepts Unicode
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300563parameters for methods such as :meth:`~io.TextIOBase.read` and
Georg Brandl325a1c22013-10-27 09:16:01 +0100564:meth:`~io.TextIOBase.write`. This works through :func:`open`\'s *encoding* and
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300565*errors* parameters which are interpreted just like those in :meth:`str.encode`
566and :meth:`bytes.decode`.
Georg Brandl116aa622007-08-15 14:28:22 +0000567
568Reading Unicode from a file is therefore simple::
569
Georg Brandle47e1842013-10-06 13:07:10 +0200570 with open('unicode.txt', encoding='utf-8') as f:
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000571 for line in f:
572 print(repr(line))
Georg Brandl116aa622007-08-15 14:28:22 +0000573
574It's also possible to open files in update mode, allowing both reading and
575writing::
576
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000577 with open('test', encoding='utf-8', mode='w+') as f:
578 f.write('\u4500 blah blah blah\n')
579 f.seek(0)
580 print(repr(f.readline()[:1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000581
Ezio Melotti410eee52013-01-20 12:16:03 +0200582The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
Georg Brandl116aa622007-08-15 14:28:22 +0000583written as the first character of a file in order to assist with autodetection
584of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
585present at the start of a file; when such an encoding is used, the BOM will be
586automatically written as the first character and will be silently dropped when
587the file is read. There are variants of these encodings, such as 'utf-16-le'
588and 'utf-16-be' for little-endian and big-endian encodings, that specify one
589particular byte ordering and don't skip the BOM.
590
Georg Brandl0c074222008-11-22 10:26:59 +0000591In some areas, it is also convention to use a "BOM" at the start of UTF-8
592encoded files; the name is misleading since UTF-8 is not byte-order dependent.
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500593The mark simply announces that the file is encoded in UTF-8. For reading such
594files, use the 'utf-8-sig' codec to automatically skip the mark if present.
Georg Brandl0c074222008-11-22 10:26:59 +0000595
Georg Brandl116aa622007-08-15 14:28:22 +0000596
597Unicode filenames
598-----------------
599
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500600Most of the operating systems in common use today support filenames
601that contain arbitrary Unicode characters. Usually this is
602implemented by converting the Unicode string into some encoding that
603varies depending on the system. Today Python is converging on using
604UTF-8: Python on MacOS has used UTF-8 for several versions, and Python
6053.6 switched to using UTF-8 on Windows as well. On Unix systems,
606there will only be a filesystem encoding if you've set the ``LANG`` or
607``LC_CTYPE`` environment variables; if you haven't, the default
608encoding is again UTF-8.
Georg Brandl116aa622007-08-15 14:28:22 +0000609
610The :func:`sys.getfilesystemencoding` function returns the encoding to use on
611your current system, in case you want to do the encoding manually, but there's
612not much reason to bother. When opening a file for reading or writing, you can
613usually just provide the Unicode string as the filename, and it will be
614automatically converted to the right encoding for you::
615
Georg Brandlf6945182008-02-01 11:56:49 +0000616 filename = 'filename\u4500abc'
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000617 with open(filename, 'w') as f:
618 f.write('blah\n')
Georg Brandl116aa622007-08-15 14:28:22 +0000619
620Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
621filenames.
622
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500623The :func:`os.listdir` function returns filenames, which raises an issue: should it return
Ezio Melotti410eee52013-01-20 12:16:03 +0200624the Unicode version of filenames, or should it return bytes containing
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500625the encoded versions? :func:`os.listdir` can do both, depending on whether you
Ezio Melotti410eee52013-01-20 12:16:03 +0200626provided the directory path as bytes or a Unicode string. If you pass a
Georg Brandl0c074222008-11-22 10:26:59 +0000627Unicode string as the path, filenames will be decoded using the filesystem's
628encoding and a list of Unicode strings will be returned, while passing a byte
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400629path will return the filenames as bytes. For example,
Georg Brandl0c074222008-11-22 10:26:59 +0000630assuming the default filesystem encoding is UTF-8, running the following
631program::
Georg Brandl116aa622007-08-15 14:28:22 +0000632
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000633 fn = 'filename\u4500abc'
634 f = open(fn, 'w')
635 f.close()
Georg Brandl116aa622007-08-15 14:28:22 +0000636
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000637 import os
638 print(os.listdir(b'.'))
639 print(os.listdir('.'))
Georg Brandl116aa622007-08-15 14:28:22 +0000640
Martin Panter1050d2d2016-07-26 11:18:21 +0200641will produce the following output:
642
643.. code-block:: shell-session
Georg Brandl116aa622007-08-15 14:28:22 +0000644
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500645 $ python listdir-test.py
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400646 [b'filename\xe4\x94\x80abc', ...]
647 ['filename\u4500abc', ...]
Georg Brandl116aa622007-08-15 14:28:22 +0000648
649The first list contains UTF-8-encoded filenames, and the second list contains
650the Unicode versions.
651
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500652Note that on most occasions, you should can just stick with using
653Unicode with these APIs. The bytes APIs should only be used on
654systems where undecodable file names can be present; that's
655pretty much only Unix systems now.
Georg Brandl0c074222008-11-22 10:26:59 +0000656
Georg Brandl116aa622007-08-15 14:28:22 +0000657
Georg Brandl116aa622007-08-15 14:28:22 +0000658Tips for Writing Unicode-aware Programs
659---------------------------------------
660
661This section provides some suggestions on writing software that deals with
662Unicode.
663
664The most important tip is:
665
Ezio Melotti410eee52013-01-20 12:16:03 +0200666 Software should only work with Unicode strings internally, decoding the input
667 data as soon as possible and encoding the output only at the end.
Georg Brandl116aa622007-08-15 14:28:22 +0000668
Georg Brandl0c074222008-11-22 10:26:59 +0000669If you attempt to write processing functions that accept both Unicode and byte
Georg Brandl116aa622007-08-15 14:28:22 +0000670strings, you will find your program vulnerable to bugs wherever you combine the
Ezio Melotti410eee52013-01-20 12:16:03 +0200671two different kinds of strings. There is no automatic encoding or decoding: if
672you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
Georg Brandl116aa622007-08-15 14:28:22 +0000673
Georg Brandl116aa622007-08-15 14:28:22 +0000674When using data coming from a web browser or some other untrusted source, a
675common technique is to check for illegal characters in a string before using the
676string in a generated command line or storing it in a database. If you're doing
Antoine Pitrou534e2532011-12-05 01:21:46 +0100677this, be careful to check the decoded string, not the encoded bytes data;
678some encodings may have interesting properties, such as not being bijective
679or not being fully ASCII-compatible. This is especially true if the input
680data also specifies the encoding, since the attacker can then choose a
681clever way to hide malicious text in the encoded bytestream.
Georg Brandl116aa622007-08-15 14:28:22 +0000682
Georg Brandl116aa622007-08-15 14:28:22 +0000683
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400684Converting Between File Encodings
685'''''''''''''''''''''''''''''''''
686
687The :class:`~codecs.StreamRecoder` class can transparently convert between
688encodings, taking a stream that returns data in encoding #1
689and behaving like a stream returning data in encoding #2.
690
691For example, if you have an input file *f* that's in Latin-1, you
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300692can wrap it with a :class:`~codecs.StreamRecoder` to return bytes encoded in
693UTF-8::
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400694
695 new_f = codecs.StreamRecoder(f,
696 # en/decoder: used by read() to encode its results and
697 # by write() to decode its input.
698 codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
699
700 # reader/writer: used to read and write to the stream.
701 codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
702
703
704Files in an Unknown Encoding
705''''''''''''''''''''''''''''
706
707What can you do if you need to make a change to a file, but don't know
708the file's encoding? If you know the encoding is ASCII-compatible and
709only want to examine or modify the ASCII parts, you can open the file
710with the ``surrogateescape`` error handler::
711
712 with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
713 data = f.read()
714
715 # make changes to the string 'data'
716
717 with open(fname + '.new', 'w',
Serhiy Storchakadba90392016-05-10 12:01:23 +0300718 encoding="ascii", errors="surrogateescape") as f:
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400719 f.write(data)
720
721The ``surrogateescape`` error handler will decode any non-ASCII bytes
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500722as code points in a special range running from U+DC80 to
723U+DCFF. These code points will then turn back into the
724same bytes when the ``surrogateescape`` error handler is used to
725encode the data and write it back out.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400726
727
Georg Brandl116aa622007-08-15 14:28:22 +0000728References
729----------
730
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100731One section of `Mastering Python 3 Input/Output
732<http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_,
733a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400734
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100735The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
736Applications in Python"
737<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400738discuss questions of character encodings as well as how to internationalize
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000739and localize an application. These slides cover Python 2.x only.
Georg Brandl116aa622007-08-15 14:28:22 +0000740
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100741`The Guts of Unicode in Python
742<http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_
743is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
744representation in Python 3.3.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400745
Georg Brandl116aa622007-08-15 14:28:22 +0000746
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000747Acknowledgements
748================
Georg Brandl116aa622007-08-15 14:28:22 +0000749
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400750The initial draft of this document was written by Andrew Kuchling.
751It has since been revised further by Alexander Belopolsky, Georg Brandl,
752Andrew Kuchling, and Ezio Melotti.
Georg Brandl116aa622007-08-15 14:28:22 +0000753
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400754Thanks to the following people who have noted errors or offered
755suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
756Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500757Lemburg, Martin von Löwis, Terry J. Reedy, Serhiy Storchaka,
758Eryk Sun, Chad Whitacre, Graham Wideman.