blob: 51bd64bfc232ca03c8aa8195341e22463fa5cfff [file] [log] [blame]
Guido van Rossum715287f2008-12-02 22:34:15 +00001.. _unicode-howto:
2
Georg Brandl116aa622007-08-15 14:28:22 +00003*****************
4 Unicode HOWTO
5*****************
6
Alexander Belopolsky93a6b132010-11-19 16:09:58 +00007:Release: 1.12
Georg Brandl116aa622007-08-15 14:28:22 +00008
Andrew Kuchling97c288d2019-03-03 23:10:28 -05009This HOWTO discusses Python's support for the Unicode specification
10for representing textual data, and explains various problems that
11people commonly encounter when trying to work with Unicode.
12
Georg Brandl6911e3c2007-09-04 07:15:32 +000013
Georg Brandl116aa622007-08-15 14:28:22 +000014Introduction to Unicode
15=======================
16
Georg Brandl116aa622007-08-15 14:28:22 +000017Definitions
18-----------
19
Andrew Kuchling97c288d2019-03-03 23:10:28 -050020Today's programs need to be able to handle a wide variety of
21characters. Applications are often internationalized to display
22messages and output in a variety of user-selectable languages; the
23same program might need to output an error message in English, French,
24Japanese, Hebrew, or Russian. Web content can be written in any of
25these languages and can also include a variety of emoji symbols.
26Python's string type uses the Unicode Standard for representing
27characters, which lets Python programs work with all these different
28possible characters.
Georg Brandl116aa622007-08-15 14:28:22 +000029
Andrew Kuchling97c288d2019-03-03 23:10:28 -050030Unicode (https://www.unicode.org/) is a specification that aims to
31list every character used by human languages and give each character
32its own unique code. The Unicode specifications are continually
33revised and updated to add new languages and symbols.
34
35A **character** is the smallest possible component of a text. 'A', 'B', 'C',
36etc., are all different characters. So are 'È' and 'Í'. Characters vary
37depending on the language or context you're talking
38about. For example, there's a character for "Roman Numeral One", '', that's
39separate from the uppercase letter 'I'. They'll usually look the same,
40but these are two different characters that have different meanings.
41
42The Unicode standard describes how characters are represented by
43**code points**. A code point value is an integer in the range 0 to
440x10FFFF (about 1.1 million values, with some 110 thousand assigned so
45far). In the standard and in this document, a code point is written
46using the notation ``U+265E`` to mean the character with value
47``0x265e`` (9,822 in decimal).
48
49The Unicode standard contains a lot of tables listing characters and
50their corresponding code points:
Ezio Melotti410eee52013-01-20 12:16:03 +020051
52.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +000053
Georg Brandla1c6a1c2009-01-03 21:26:05 +000054 0061 'a'; LATIN SMALL LETTER A
55 0062 'b'; LATIN SMALL LETTER B
56 0063 'c'; LATIN SMALL LETTER C
57 ...
58 007B '{'; LEFT CURLY BRACKET
Andrew Kuchling97c288d2019-03-03 23:10:28 -050059 ...
Greg Price32a960f2019-09-08 02:42:13 -070060 2167 ''; ROMAN NUMERAL EIGHT
61 2168 ''; ROMAN NUMERAL NINE
Andrew Kuchling97c288d2019-03-03 23:10:28 -050062 ...
Greg Price32a960f2019-09-08 02:42:13 -070063 265E ''; BLACK CHESS KNIGHT
64 265F ''; BLACK CHESS PAWN
Andrew Kuchling97c288d2019-03-03 23:10:28 -050065 ...
Greg Price32a960f2019-09-08 02:42:13 -070066 1F600 '😀'; GRINNING FACE
67 1F609 '😉'; WINKING FACE
Andrew Kuchling97c288d2019-03-03 23:10:28 -050068 ...
Georg Brandl116aa622007-08-15 14:28:22 +000069
70Strictly, these definitions imply that it's meaningless to say 'this is
Andrew Kuchling97c288d2019-03-03 23:10:28 -050071character ``U+265E``'. ``U+265E`` is a code point, which represents some particular
72character; in this case, it represents the character 'BLACK CHESS KNIGHT',
73'♞'. In
Georg Brandl116aa622007-08-15 14:28:22 +000074informal contexts, this distinction between code points and characters will
75sometimes be forgotten.
76
77A character is represented on a screen or on paper by a set of graphical
78elements that's called a **glyph**. The glyph for an uppercase A, for example,
79is two diagonal strokes and a horizontal stroke, though the exact details will
80depend on the font being used. Most Python code doesn't need to worry about
81glyphs; figuring out the correct glyph to display is generally the job of a GUI
82toolkit or a terminal's font renderer.
83
84
85Encodings
86---------
87
Andrew Kuchling97c288d2019-03-03 23:10:28 -050088To summarize the previous section: a Unicode string is a sequence of
89code points, which are numbers from 0 through ``0x10FFFF`` (1,114,111
90decimal). This sequence of code points needs to be represented in
91memory as a set of **code units**, and **code units** are then mapped
92to 8-bit bytes. The rules for translating a Unicode string into a
93sequence of bytes are called a **character encoding**, or just
94an **encoding**.
Georg Brandl116aa622007-08-15 14:28:22 +000095
Andrew Kuchling97c288d2019-03-03 23:10:28 -050096The first encoding you might think of is using 32-bit integers as the
97code unit, and then using the CPU's representation of 32-bit integers.
98In this representation, the string "Python" might look like this:
Ezio Melotti410eee52013-01-20 12:16:03 +020099
100.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000101
102 P y t h o n
Georg Brandl6911e3c2007-09-04 07:15:32 +0000103 0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
104 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl116aa622007-08-15 14:28:22 +0000105
106This representation is straightforward but using it presents a number of
107problems.
108
1091. It's not portable; different processors order the bytes differently.
110
1112. It's very wasteful of space. In most texts, the majority of the code points
Ezio Melotti410eee52013-01-20 12:16:03 +0200112 are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
Georg Brandl116aa622007-08-15 14:28:22 +0000113 bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
114 ASCII representation. Increased RAM usage doesn't matter too much (desktop
Ezio Melotti410eee52013-01-20 12:16:03 +0200115 computers have gigabytes of RAM, and strings aren't usually that large), but
Georg Brandl116aa622007-08-15 14:28:22 +0000116 expanding our usage of disk and network bandwidth by a factor of 4 is
117 intolerable.
118
1193. It's not compatible with existing C functions such as ``strlen()``, so a new
120 family of wide string functions would need to be used.
121
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500122Therefore this encoding isn't used very much, and people instead choose other
123encodings that are more efficient and convenient, such as UTF-8.
Georg Brandl116aa622007-08-15 14:28:22 +0000124
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500125UTF-8 is one of the most commonly used encodings, and Python often
126defaults to using it. UTF stands for "Unicode Transformation Format",
127and the '8' means that 8-bit values are used in the encoding. (There
128are also UTF-16 and UTF-32 encodings, but they are less frequently
129used than UTF-8.) UTF-8 uses the following rules:
Georg Brandl116aa622007-08-15 14:28:22 +0000130
Ezio Melotti410eee52013-01-20 12:16:03 +02001311. If the code point is < 128, it's represented by the corresponding byte value.
1322. If the code point is >= 128, it's turned into a sequence of two, three, or
133 four bytes, where each byte of the sequence is between 128 and 255.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000134
Georg Brandl116aa622007-08-15 14:28:22 +0000135UTF-8 has several convenient properties:
136
1371. It can handle any Unicode code point.
redshiftzerof98c3c52019-05-17 03:44:18 -07001382. A Unicode string is turned into a sequence of bytes that contains embedded
139 zero bytes only where they represent the null character (U+0000). This means
140 that UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent
141 through protocols that can't handle zero bytes for anything other than
142 end-of-string markers.
Georg Brandl116aa622007-08-15 14:28:22 +00001433. A string of ASCII text is also valid UTF-8 text.
Ezio Melotti410eee52013-01-20 12:16:03 +02001444. UTF-8 is fairly compact; the majority of commonly used characters can be
145 represented with one or two bytes.
Georg Brandl116aa622007-08-15 14:28:22 +00001465. If bytes are corrupted or lost, it's possible to determine the start of the
147 next UTF-8-encoded code point and resynchronize. It's also unlikely that
148 random 8-bit data will look like valid UTF-8.
redshiftzerof98c3c52019-05-17 03:44:18 -07001496. UTF-8 is a byte oriented encoding. The encoding specifies that each
150 character is represented by a specific sequence of one or more bytes. This
151 avoids the byte-ordering issues that can occur with integer and word oriented
152 encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending
153 on the hardware on which the string was encoded.
Georg Brandl116aa622007-08-15 14:28:22 +0000154
155
156References
157----------
158
Ezio Melotti410eee52013-01-20 12:16:03 +0200159The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
Georg Brandl116aa622007-08-15 14:28:22 +0000160glossary, and PDF versions of the Unicode specification. Be prepared for some
Ezio Melotti410eee52013-01-20 12:16:03 +0200161difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
162origin and development of Unicode is also available on the site.
Georg Brandl116aa622007-08-15 14:28:22 +0000163
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500164On the Computerphile Youtube channel, Tom Scott briefly
redshiftzero3b2f9ab2019-05-09 15:13:40 -0400165`discusses the history of Unicode and UTF-8 <https://www.youtube.com/watch?v=MijmeoH9LT4>`_
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500166(9 minutes 36 seconds).
167
Ezio Melotti410eee52013-01-20 12:16:03 +0200168To help understand the standard, Jukka Korpela has written `an introductory
Sanyam Khurana338cd832018-01-20 05:55:37 +0530169guide <http://jkorpela.fi/unicode/guide.html>`_ to reading the
Ezio Melotti410eee52013-01-20 12:16:03 +0200170Unicode character tables.
Georg Brandl116aa622007-08-15 14:28:22 +0000171
Sanyam Khurana1b4587a2017-12-06 22:09:33 +0530172Another `good introductory article <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
Ezio Melotti410eee52013-01-20 12:16:03 +0200173was written by Joel Spolsky.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400174If this introduction didn't make things clear to you, you should try
175reading this alternate article before continuing.
Georg Brandl116aa622007-08-15 14:28:22 +0000176
Ezio Melotti410eee52013-01-20 12:16:03 +0200177Wikipedia entries are often helpful; see the entries for "`character encoding
Georg Brandl5d941342016-02-26 19:37:12 +0100178<https://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
179<https://en.wikipedia.org/wiki/UTF-8>`_, for example.
Georg Brandl116aa622007-08-15 14:28:22 +0000180
181
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000182Python's Unicode Support
183========================
Georg Brandl116aa622007-08-15 14:28:22 +0000184
185Now that you've learned the rudiments of Unicode, we can look at Python's
186Unicode features.
187
Georg Brandlf6945182008-02-01 11:56:49 +0000188The String Type
189---------------
Georg Brandl116aa622007-08-15 14:28:22 +0000190
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500191Since Python 3.0, the language's :class:`str` type contains Unicode
Georg Brandlf6945182008-02-01 11:56:49 +0000192characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
Georg Brandl4f5f98d2009-05-04 21:01:20 +0000193rocks!'``, or the triple-quoted string syntax is stored as Unicode.
Georg Brandl116aa622007-08-15 14:28:22 +0000194
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400195The default encoding for Python source code is UTF-8, so you can simply
196include a Unicode character in a string literal::
197
198 try:
199 with open('/tmp/input.txt', 'r') as f:
200 ...
Andrew Svetlov08af0002014-04-01 01:13:30 +0300201 except OSError:
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400202 # 'File not found' error message.
203 print("Fichier non trouvé")
204
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400205Side note: Python 3 also supports using Unicode characters in identifiers::
206
207 répertoire = "/tmp/records.log"
208 with open(répertoire, "w") as f:
209 f.write("test\n")
210
211If you can't enter a particular character in your editor or want to
212keep the source code ASCII-only for some reason, you can also use
213escape sequences in string literals. (Depending on your system,
214you may see the actual capital-delta glyph instead of a \u escape.) ::
Georg Brandl116aa622007-08-15 14:28:22 +0000215
Georg Brandlf6945182008-02-01 11:56:49 +0000216 >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
217 '\u0394'
218 >>> "\u0394" # Using a 16-bit hex value
219 '\u0394'
220 >>> "\U00000394" # Using a 32-bit hex value
221 '\u0394'
Georg Brandl116aa622007-08-15 14:28:22 +0000222
Ezio Melotti410eee52013-01-20 12:16:03 +0200223In addition, one can create a string using the :func:`~bytes.decode` method of
224:class:`bytes`. This method takes an *encoding* argument, such as ``UTF-8``,
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400225and optionally an *errors* argument.
Georg Brandl116aa622007-08-15 14:28:22 +0000226
Georg Brandlf6945182008-02-01 11:56:49 +0000227The *errors* argument specifies the response when the input string can't be
Georg Brandl116aa622007-08-15 14:28:22 +0000228converted according to the encoding's rules. Legal values for this argument are
Ezio Melotti410eee52013-01-20 12:16:03 +0200229``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200230``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
231character out of the Unicode result), or ``'backslashreplace'`` (inserts a
232``\xNN`` escape sequence).
Ezio Melotti410eee52013-01-20 12:16:03 +0200233The following examples show the differences::
Georg Brandl116aa622007-08-15 14:28:22 +0000234
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700235 >>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000236 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700237 ...
238 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
239 invalid start byte
Ezio Melotti20b8d992012-09-23 15:55:14 +0300240 >>> b'\x80abc'.decode("utf-8", "replace")
241 '\ufffdabc'
Serhiy Storchaka07985ef2015-01-25 22:56:57 +0200242 >>> b'\x80abc'.decode("utf-8", "backslashreplace")
243 '\\x80abc'
Georg Brandlf6945182008-02-01 11:56:49 +0000244 >>> b'\x80abc'.decode("utf-8", "ignore")
245 'abc'
Georg Brandl116aa622007-08-15 14:28:22 +0000246
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500247Encodings are specified as strings containing the encoding's name. Python
Benjamin Petersond7c3ed52010-06-27 22:32:30 +0000248comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl0c074222008-11-22 10:26:59 +0000249:ref:`standard-encodings` for a list. Some encodings have multiple names; for
Ezio Melotti410eee52013-01-20 12:16:03 +0200250example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
251the same encoding.
Georg Brandl116aa622007-08-15 14:28:22 +0000252
Georg Brandlf6945182008-02-01 11:56:49 +0000253One-character Unicode strings can also be created with the :func:`chr`
Georg Brandl116aa622007-08-15 14:28:22 +0000254built-in function, which takes integers and returns a Unicode string of length 1
255that contains the corresponding code point. The reverse operation is the
256built-in :func:`ord` function that takes a one-character Unicode string and
257returns the code point value::
258
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000259 >>> chr(57344)
260 '\ue000'
261 >>> ord('\ue000')
262 57344
Georg Brandl116aa622007-08-15 14:28:22 +0000263
Georg Brandlf6945182008-02-01 11:56:49 +0000264Converting to Bytes
265-------------------
Georg Brandl116aa622007-08-15 14:28:22 +0000266
Ezio Melotti410eee52013-01-20 12:16:03 +0200267The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
268which returns a :class:`bytes` representation of the Unicode string, encoded in the
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400269requested *encoding*.
270
271The *errors* parameter is the same as the parameter of the
272:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
273``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
274inserts a question mark instead of the unencodable character), there is
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200275also ``'xmlcharrefreplace'`` (inserts an XML character reference),
276``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and
277``namereplace`` (inserts a ``\N{...}`` escape sequence).
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400278
Ezio Melotti410eee52013-01-20 12:16:03 +0200279The following example shows the different results::
Georg Brandl116aa622007-08-15 14:28:22 +0000280
Georg Brandlf6945182008-02-01 11:56:49 +0000281 >>> u = chr(40960) + 'abcd' + chr(1972)
Georg Brandl116aa622007-08-15 14:28:22 +0000282 >>> u.encode('utf-8')
Georg Brandlf6945182008-02-01 11:56:49 +0000283 b'\xea\x80\x80abcd\xde\xb4'
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700284 >>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Georg Brandl116aa622007-08-15 14:28:22 +0000285 Traceback (most recent call last):
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700286 ...
Georg Brandl0c074222008-11-22 10:26:59 +0000287 UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700288 position 0: ordinal not in range(128)
Georg Brandl116aa622007-08-15 14:28:22 +0000289 >>> u.encode('ascii', 'ignore')
Georg Brandlf6945182008-02-01 11:56:49 +0000290 b'abcd'
Georg Brandl116aa622007-08-15 14:28:22 +0000291 >>> u.encode('ascii', 'replace')
Georg Brandlf6945182008-02-01 11:56:49 +0000292 b'?abcd?'
Georg Brandl116aa622007-08-15 14:28:22 +0000293 >>> u.encode('ascii', 'xmlcharrefreplace')
Georg Brandlf6945182008-02-01 11:56:49 +0000294 b'&#40960;abcd&#1972;'
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400295 >>> u.encode('ascii', 'backslashreplace')
296 b'\\ua000abcd\\u07b4'
Serhiy Storchaka166ebc42014-11-25 13:57:17 +0200297 >>> u.encode('ascii', 'namereplace')
298 b'\\N{YI SYLLABLE IT}abcd\\u07b4'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000299
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400300The low-level routines for registering and accessing the available
301encodings are found in the :mod:`codecs` module. Implementing new
302encodings also requires understanding the :mod:`codecs` module.
303However, the encoding and decoding functions returned by this module
304are usually more low-level than is comfortable, and writing new encodings
305is a specialized task, so the module won't be covered in this HOWTO.
Georg Brandl116aa622007-08-15 14:28:22 +0000306
Georg Brandl6911e3c2007-09-04 07:15:32 +0000307
Georg Brandl116aa622007-08-15 14:28:22 +0000308Unicode Literals in Python Source Code
309--------------------------------------
310
Georg Brandlf6945182008-02-01 11:56:49 +0000311In Python source code, specific Unicode code points can be written using the
312``\u`` escape sequence, which is followed by four hex digits giving the code
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000313point. The ``\U`` escape sequence is similar, but expects eight hex digits,
314not four::
Georg Brandl116aa622007-08-15 14:28:22 +0000315
Georg Brandlf6945182008-02-01 11:56:49 +0000316 >>> s = "a\xac\u1234\u20ac\U00008000"
Senthil Kumaran2fd8bdb2012-09-11 03:17:52 -0700317 ... # ^^^^ two-digit hex escape
318 ... # ^^^^^^ four-digit Unicode escape
319 ... # ^^^^^^^^^^ eight-digit Unicode escape
320 >>> [ord(c) for c in s]
321 [97, 172, 4660, 8364, 32768]
Georg Brandl116aa622007-08-15 14:28:22 +0000322
323Using escape sequences for code points greater than 127 is fine in small doses,
324but becomes an annoyance if you're using many accented characters, as you would
325in a program with messages in French or some other accent-using language. You
Georg Brandlf6945182008-02-01 11:56:49 +0000326can also assemble strings using the :func:`chr` built-in function, but this is
Georg Brandl116aa622007-08-15 14:28:22 +0000327even more tedious.
328
329Ideally, you'd want to be able to write literals in your language's natural
330encoding. You could then edit Python source code with your favorite editor
331which would display the accented characters naturally, and have the right
332characters used at runtime.
333
Georg Brandl0c074222008-11-22 10:26:59 +0000334Python supports writing source code in UTF-8 by default, but you can use almost
335any encoding if you declare the encoding being used. This is done by including
336a special comment as either the first or second line of the source file::
Georg Brandl116aa622007-08-15 14:28:22 +0000337
338 #!/usr/bin/env python
339 # -*- coding: latin-1 -*-
Georg Brandl6911e3c2007-09-04 07:15:32 +0000340
Georg Brandlf6945182008-02-01 11:56:49 +0000341 u = 'abcdé'
Georg Brandl6911e3c2007-09-04 07:15:32 +0000342 print(ord(u[-1]))
343
Georg Brandl116aa622007-08-15 14:28:22 +0000344The syntax is inspired by Emacs's notation for specifying variables local to a
345file. Emacs supports many different variables, but Python only supports
Georg Brandl0c074222008-11-22 10:26:59 +0000346'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
347they have no significance to Python but are a convention. Python looks for
348``coding: name`` or ``coding=name`` in the comment.
Georg Brandl116aa622007-08-15 14:28:22 +0000349
Georg Brandlf6945182008-02-01 11:56:49 +0000350If you don't include such a comment, the default encoding used will be UTF-8 as
Ezio Melotti410eee52013-01-20 12:16:03 +0200351already mentioned. See also :pep:`263` for more information.
Georg Brandl6911e3c2007-09-04 07:15:32 +0000352
Georg Brandl116aa622007-08-15 14:28:22 +0000353
354Unicode Properties
355------------------
356
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500357The Unicode specification includes a database of information about
358code points. For each defined code point, the information includes
359the character's name, its category, the numeric value if applicable
360(for characters representing numeric concepts such as the Roman
361numerals, fractions such as one-third and four-fifths, etc.). There
362are also display-related properties, such as how to use the code point
363in bidirectional text.
Georg Brandl116aa622007-08-15 14:28:22 +0000364
365The following program displays some information about several characters, and
366prints the numeric value of one particular character::
367
368 import unicodedata
Georg Brandl6911e3c2007-09-04 07:15:32 +0000369
Georg Brandlf6945182008-02-01 11:56:49 +0000370 u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
Georg Brandl6911e3c2007-09-04 07:15:32 +0000371
Georg Brandl116aa622007-08-15 14:28:22 +0000372 for i, c in enumerate(u):
Georg Brandl6911e3c2007-09-04 07:15:32 +0000373 print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
374 print(unicodedata.name(c))
375
Georg Brandl116aa622007-08-15 14:28:22 +0000376 # Get numeric value of second character
Georg Brandl6911e3c2007-09-04 07:15:32 +0000377 print(unicodedata.numeric(u[1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000378
Ezio Melotti410eee52013-01-20 12:16:03 +0200379When run, this prints:
380
381.. code-block:: none
Georg Brandl116aa622007-08-15 14:28:22 +0000382
383 0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
384 1 0bf2 No TAMIL NUMBER ONE THOUSAND
385 2 0f84 Mn TIBETAN MARK HALANTA
386 3 1770 Lo TAGBANWA LETTER SA
387 4 33af So SQUARE RAD OVER S SQUARED
388 1000.0
389
390The category codes are abbreviations describing the nature of the character.
391These are grouped into categories such as "Letter", "Number", "Punctuation", or
392"Symbol", which in turn are broken up into subcategories. To take the codes
393from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
394"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
395other". See
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400396`the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
Georg Brandl116aa622007-08-15 14:28:22 +0000397list of category codes.
398
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400399
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500400Comparing Strings
401-----------------
402
403Unicode adds some complication to comparing strings, because the same
404set of characters can be represented by different sequences of code
405points. For example, a letter like 'ê' can be represented as a single
406code point U+00EA, or as U+0065 U+0302, which is the code point for
407'e' followed by a code point for 'COMBINING CIRCUMFLEX ACCENT'. These
408will produce the same output when printed, but one is a string of
409length 1 and the other is of length 2.
410
411One tool for a case-insensitive comparison is the
412:meth:`~str.casefold` string method that converts a string to a
413case-insensitive form following an algorithm described by the Unicode
414Standard. This algorithm has special handling for characters such as
415the German letter 'ß' (code point U+00DF), which becomes the pair of
416lowercase letters 'ss'.
417
418::
419
420 >>> street = 'Gürzenichstraße'
421 >>> street.casefold()
422 'gürzenichstrasse'
423
424A second tool is the :mod:`unicodedata` module's
425:func:`~unicodedata.normalize` function that converts strings to one
426of several normal forms, where letters followed by a combining
427character are replaced with single characters. :func:`normalize` can
428be used to perform string comparisons that won't falsely report
429inequality if two strings use combining characters differently:
430
431::
432
433 import unicodedata
434
435 def compare_strs(s1, s2):
436 def NFD(s):
437 return unicodedata.normalize('NFD', s)
438
439 return NFD(s1) == NFD(s2)
440
441 single_char = 'ê'
442 multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
443 print('length of first string=', len(single_char))
444 print('length of second string=', len(multiple_chars))
445 print(compare_strs(single_char, multiple_chars))
446
447When run, this outputs:
448
449.. code-block:: shell-session
450
451 $ python3 compare-strs.py
452 length of first string= 1
453 length of second string= 2
454 True
455
456The first argument to the :func:`~unicodedata.normalize` function is a
457string giving the desired normalization form, which can be one of
458'NFC', 'NFKC', 'NFD', and 'NFKD'.
459
460The Unicode Standard also specifies how to do caseless comparisons::
461
462 import unicodedata
463
464 def compare_caseless(s1, s2):
465 def NFD(s):
466 return unicodedata.normalize('NFD', s)
467
468 return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())
469
470 # Example usage
471 single_char = 'ê'
472 multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
473
474 print(compare_caseless(single_char, multiple_chars))
475
476This will print ``True``. (Why is :func:`NFD` invoked twice? Because
477there are a few characters that make :meth:`casefold` return a
478non-normalized string, so the result needs to be normalized again. See
479section 3.13 of the Unicode Standard for a discussion and an example.)
480
481
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400482Unicode Regular Expressions
483---------------------------
484
485The regular expressions supported by the :mod:`re` module can be provided
486either as bytes or strings. Some of the special character sequences such as
487``\d`` and ``\w`` have different meanings depending on whether
488the pattern is supplied as bytes or a string. For example,
489``\d`` will match the characters ``[0-9]`` in bytes but
490in strings will match any character that's in the ``'Nd'`` category.
491
492The string in this example has the number 57 written in both Thai and
493Arabic numerals::
494
495 import re
Cheryl Sabella66771422018-02-02 16:16:27 -0500496 p = re.compile(r'\d+')
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400497
498 s = "Over \u0e55\u0e57 57 flavours"
499 m = p.search(s)
500 print(repr(m.group()))
501
502When executed, ``\d+`` will match the Thai numerals and print them
503out. If you supply the :const:`re.ASCII` flag to
504:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
505
506Similarly, ``\w`` matches a wide variety of Unicode characters but
507only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
508and ``\s`` will match either Unicode whitespace characters or
509``[ \t\n\r\f\v]``.
510
511
Georg Brandl116aa622007-08-15 14:28:22 +0000512References
513----------
514
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400515.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
516
517Some good alternative discussions of Python's Unicode support are:
518
519* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
Sanyam Khurana1b4587a2017-12-06 22:09:33 +0530520* `Pragmatic Unicode <https://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400521
Ezio Melotti410eee52013-01-20 12:16:03 +0200522The :class:`str` type is described in the Python library reference at
Ezio Melottia6229e62012-10-12 10:59:14 +0300523:ref:`textseq`.
Georg Brandl116aa622007-08-15 14:28:22 +0000524
525The documentation for the :mod:`unicodedata` module.
526
527The documentation for the :mod:`codecs` module.
528
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100529Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides)
530<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
531EuroPython 2002. The slides are an excellent overview of the design of Python
5322's Unicode features (where the Unicode string type is called ``unicode`` and
533literals start with ``u``).
Georg Brandl116aa622007-08-15 14:28:22 +0000534
535
536Reading and Writing Unicode Data
537================================
538
539Once you've written some code that works with Unicode data, the next problem is
540input/output. How do you get Unicode strings into your program, and how do you
541convert Unicode into a form suitable for storage or transmission?
542
543It's possible that you may not need to do anything depending on your input
544sources and output destinations; you should check whether the libraries used in
545your application support Unicode natively. XML parsers often return Unicode
546data, for example. Many relational databases also support Unicode-valued
547columns and can return Unicode values from an SQL query.
548
549Unicode data is usually converted to a particular encoding before it gets
550written to disk or sent over a socket. It's possible to do all the work
Georg Brandl3d596fa2013-10-29 08:16:56 +0100551yourself: open a file, read an 8-bit bytes object from it, and convert the bytes
Ezio Melotti410eee52013-01-20 12:16:03 +0200552with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
Georg Brandl116aa622007-08-15 14:28:22 +0000553
554One problem is the multi-byte nature of encodings; one Unicode character can be
555represented by several bytes. If you want to read the file in arbitrary-sized
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200556chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case
Georg Brandl116aa622007-08-15 14:28:22 +0000557where only part of the bytes encoding a single Unicode character are read at the
558end of a chunk. One solution would be to read the entire file into memory and
559then perform the decoding, but that prevents you from working with files that
Serhiy Storchakaf8def282013-02-16 17:29:56 +0200560are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.
Georg Brandl116aa622007-08-15 14:28:22 +0000561(More, really, since for at least a moment you'd need to have both the encoded
562string and its Unicode version in memory.)
563
564The solution would be to use the low-level decoding interface to catch the case
565of partial coding sequences. The work of implementing this has already been
Georg Brandl0c074222008-11-22 10:26:59 +0000566done for you: the built-in :func:`open` function can return a file-like object
567that assumes the file's contents are in a specified encoding and accepts Unicode
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300568parameters for methods such as :meth:`~io.TextIOBase.read` and
Georg Brandl325a1c22013-10-27 09:16:01 +0100569:meth:`~io.TextIOBase.write`. This works through :func:`open`\'s *encoding* and
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300570*errors* parameters which are interpreted just like those in :meth:`str.encode`
571and :meth:`bytes.decode`.
Georg Brandl116aa622007-08-15 14:28:22 +0000572
573Reading Unicode from a file is therefore simple::
574
Georg Brandle47e1842013-10-06 13:07:10 +0200575 with open('unicode.txt', encoding='utf-8') as f:
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000576 for line in f:
577 print(repr(line))
Georg Brandl116aa622007-08-15 14:28:22 +0000578
579It's also possible to open files in update mode, allowing both reading and
580writing::
581
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000582 with open('test', encoding='utf-8', mode='w+') as f:
583 f.write('\u4500 blah blah blah\n')
584 f.seek(0)
585 print(repr(f.readline()[:1]))
Georg Brandl116aa622007-08-15 14:28:22 +0000586
Ezio Melotti410eee52013-01-20 12:16:03 +0200587The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
Georg Brandl116aa622007-08-15 14:28:22 +0000588written as the first character of a file in order to assist with autodetection
589of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
590present at the start of a file; when such an encoding is used, the BOM will be
591automatically written as the first character and will be silently dropped when
592the file is read. There are variants of these encodings, such as 'utf-16-le'
593and 'utf-16-be' for little-endian and big-endian encodings, that specify one
594particular byte ordering and don't skip the BOM.
595
Georg Brandl0c074222008-11-22 10:26:59 +0000596In some areas, it is also convention to use a "BOM" at the start of UTF-8
597encoded files; the name is misleading since UTF-8 is not byte-order dependent.
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500598The mark simply announces that the file is encoded in UTF-8. For reading such
599files, use the 'utf-8-sig' codec to automatically skip the mark if present.
Georg Brandl0c074222008-11-22 10:26:59 +0000600
Georg Brandl116aa622007-08-15 14:28:22 +0000601
602Unicode filenames
603-----------------
604
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500605Most of the operating systems in common use today support filenames
606that contain arbitrary Unicode characters. Usually this is
607implemented by converting the Unicode string into some encoding that
608varies depending on the system. Today Python is converging on using
609UTF-8: Python on MacOS has used UTF-8 for several versions, and Python
6103.6 switched to using UTF-8 on Windows as well. On Unix systems,
611there will only be a filesystem encoding if you've set the ``LANG`` or
612``LC_CTYPE`` environment variables; if you haven't, the default
613encoding is again UTF-8.
Georg Brandl116aa622007-08-15 14:28:22 +0000614
615The :func:`sys.getfilesystemencoding` function returns the encoding to use on
616your current system, in case you want to do the encoding manually, but there's
617not much reason to bother. When opening a file for reading or writing, you can
618usually just provide the Unicode string as the filename, and it will be
619automatically converted to the right encoding for you::
620
Georg Brandlf6945182008-02-01 11:56:49 +0000621 filename = 'filename\u4500abc'
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000622 with open(filename, 'w') as f:
623 f.write('blah\n')
Georg Brandl116aa622007-08-15 14:28:22 +0000624
625Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
626filenames.
627
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500628The :func:`os.listdir` function returns filenames, which raises an issue: should it return
Ezio Melotti410eee52013-01-20 12:16:03 +0200629the Unicode version of filenames, or should it return bytes containing
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500630the encoded versions? :func:`os.listdir` can do both, depending on whether you
Ezio Melotti410eee52013-01-20 12:16:03 +0200631provided the directory path as bytes or a Unicode string. If you pass a
Georg Brandl0c074222008-11-22 10:26:59 +0000632Unicode string as the path, filenames will be decoded using the filesystem's
633encoding and a list of Unicode strings will be returned, while passing a byte
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400634path will return the filenames as bytes. For example,
Georg Brandl0c074222008-11-22 10:26:59 +0000635assuming the default filesystem encoding is UTF-8, running the following
636program::
Georg Brandl116aa622007-08-15 14:28:22 +0000637
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000638 fn = 'filename\u4500abc'
639 f = open(fn, 'w')
640 f.close()
Georg Brandl116aa622007-08-15 14:28:22 +0000641
Georg Brandla1c6a1c2009-01-03 21:26:05 +0000642 import os
643 print(os.listdir(b'.'))
644 print(os.listdir('.'))
Georg Brandl116aa622007-08-15 14:28:22 +0000645
Martin Panter1050d2d2016-07-26 11:18:21 +0200646will produce the following output:
647
648.. code-block:: shell-session
Georg Brandl116aa622007-08-15 14:28:22 +0000649
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500650 $ python listdir-test.py
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400651 [b'filename\xe4\x94\x80abc', ...]
652 ['filename\u4500abc', ...]
Georg Brandl116aa622007-08-15 14:28:22 +0000653
654The first list contains UTF-8-encoded filenames, and the second list contains
655the Unicode versions.
656
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500657Note that on most occasions, you should can just stick with using
658Unicode with these APIs. The bytes APIs should only be used on
659systems where undecodable file names can be present; that's
660pretty much only Unix systems now.
Georg Brandl0c074222008-11-22 10:26:59 +0000661
Georg Brandl116aa622007-08-15 14:28:22 +0000662
Georg Brandl116aa622007-08-15 14:28:22 +0000663Tips for Writing Unicode-aware Programs
664---------------------------------------
665
666This section provides some suggestions on writing software that deals with
667Unicode.
668
669The most important tip is:
670
Ezio Melotti410eee52013-01-20 12:16:03 +0200671 Software should only work with Unicode strings internally, decoding the input
672 data as soon as possible and encoding the output only at the end.
Georg Brandl116aa622007-08-15 14:28:22 +0000673
Georg Brandl0c074222008-11-22 10:26:59 +0000674If you attempt to write processing functions that accept both Unicode and byte
Georg Brandl116aa622007-08-15 14:28:22 +0000675strings, you will find your program vulnerable to bugs wherever you combine the
Ezio Melotti410eee52013-01-20 12:16:03 +0200676two different kinds of strings. There is no automatic encoding or decoding: if
677you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
Georg Brandl116aa622007-08-15 14:28:22 +0000678
Georg Brandl116aa622007-08-15 14:28:22 +0000679When using data coming from a web browser or some other untrusted source, a
680common technique is to check for illegal characters in a string before using the
681string in a generated command line or storing it in a database. If you're doing
Antoine Pitrou534e2532011-12-05 01:21:46 +0100682this, be careful to check the decoded string, not the encoded bytes data;
683some encodings may have interesting properties, such as not being bijective
684or not being fully ASCII-compatible. This is especially true if the input
685data also specifies the encoding, since the attacker can then choose a
686clever way to hide malicious text in the encoded bytestream.
Georg Brandl116aa622007-08-15 14:28:22 +0000687
Georg Brandl116aa622007-08-15 14:28:22 +0000688
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400689Converting Between File Encodings
690'''''''''''''''''''''''''''''''''
691
692The :class:`~codecs.StreamRecoder` class can transparently convert between
693encodings, taking a stream that returns data in encoding #1
694and behaving like a stream returning data in encoding #2.
695
696For example, if you have an input file *f* that's in Latin-1, you
Serhiy Storchakabfdcd432013-10-13 23:09:14 +0300697can wrap it with a :class:`~codecs.StreamRecoder` to return bytes encoded in
698UTF-8::
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400699
700 new_f = codecs.StreamRecoder(f,
701 # en/decoder: used by read() to encode its results and
702 # by write() to decode its input.
703 codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
704
705 # reader/writer: used to read and write to the stream.
706 codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
707
708
709Files in an Unknown Encoding
710''''''''''''''''''''''''''''
711
712What can you do if you need to make a change to a file, but don't know
713the file's encoding? If you know the encoding is ASCII-compatible and
714only want to examine or modify the ASCII parts, you can open the file
715with the ``surrogateescape`` error handler::
716
717 with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
718 data = f.read()
719
720 # make changes to the string 'data'
721
722 with open(fname + '.new', 'w',
Serhiy Storchakadba90392016-05-10 12:01:23 +0300723 encoding="ascii", errors="surrogateescape") as f:
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400724 f.write(data)
725
726The ``surrogateescape`` error handler will decode any non-ASCII bytes
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500727as code points in a special range running from U+DC80 to
728U+DCFF. These code points will then turn back into the
729same bytes when the ``surrogateescape`` error handler is used to
730encode the data and write it back out.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400731
732
Georg Brandl116aa622007-08-15 14:28:22 +0000733References
734----------
735
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100736One section of `Mastering Python 3 Input/Output
737<http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_,
738a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400739
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100740The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
741Applications in Python"
742<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400743discuss questions of character encodings as well as how to internationalize
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000744and localize an application. These slides cover Python 2.x only.
Georg Brandl116aa622007-08-15 14:28:22 +0000745
Georg Brandl9bdcb3b2014-10-29 09:37:43 +0100746`The Guts of Unicode in Python
747<http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_
748is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
749representation in Python 3.3.
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400750
Georg Brandl116aa622007-08-15 14:28:22 +0000751
Alexander Belopolsky93a6b132010-11-19 16:09:58 +0000752Acknowledgements
753================
Georg Brandl116aa622007-08-15 14:28:22 +0000754
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400755The initial draft of this document was written by Andrew Kuchling.
756It has since been revised further by Alexander Belopolsky, Georg Brandl,
757Andrew Kuchling, and Ezio Melotti.
Georg Brandl116aa622007-08-15 14:28:22 +0000758
Andrew Kuchling2151fc62013-06-20 09:29:09 -0400759Thanks to the following people who have noted errors or offered
760suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
761Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
Andrew Kuchling97c288d2019-03-03 23:10:28 -0500762Lemburg, Martin von Löwis, Terry J. Reedy, Serhiy Storchaka,
763Eryk Sun, Chad Whitacre, Graham Wideman.