Blame - Doc/howto/unicode.rst - platform/external/python/cpython3

blob: 535b21bd4a54f56ee739d5d3f3c453bbf48f2d09 [file] [log] [blame]

Guido van Rossum	715287f	2008-12-02 22:34:15 +0000	[diff] [blame]	1	.. _unicode-howto:
				2
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3	*****************
				4	Unicode HOWTO
				5	*****************
				6
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	7	:Release: 1.12
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	8
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	9	This HOWTO discusses Python's support for the Unicode specification
				10	for representing textual data, and explains various problems that
				11	people commonly encounter when trying to work with Unicode.
				12
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	13
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14	Introduction to Unicode
				15	=======================
				16
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	17	Definitions
				18	-----------
				19
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	20	Today's programs need to be able to handle a wide variety of
				21	characters. Applications are often internationalized to display
				22	messages and output in a variety of user-selectable languages; the
				23	same program might need to output an error message in English, French,
				24	Japanese, Hebrew, or Russian. Web content can be written in any of
				25	these languages and can also include a variety of emoji symbols.
				26	Python's string type uses the Unicode Standard for representing
				27	characters, which lets Python programs work with all these different
				28	possible characters.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	29
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	30	Unicode (https://www.unicode.org/) is a specification that aims to
				31	list every character used by human languages and give each character
				32	its own unique code. The Unicode specifications are continually
				33	revised and updated to add new languages and symbols.
				34
				35	A character is the smallest possible component of a text. 'A', 'B', 'C',
				36	etc., are all different characters. So are 'È' and 'Í'. Characters vary
				37	depending on the language or context you're talking
				38	about. For example, there's a character for "Roman Numeral One", 'Ⅰ', that's
				39	separate from the uppercase letter 'I'. They'll usually look the same,
				40	but these are two different characters that have different meanings.
				41
				42	The Unicode standard describes how characters are represented by
				43	code points. A code point value is an integer in the range 0 to
amaajemyfren	8ea10a9	2020-04-07 07:16:02 +0300	[diff] [blame]	44	0x10FFFF (about 1.1 million values, the
				45	`actual number assigned <https://www.unicode.org/versions/latest/#Summary>`_
				46	is less than that). In the standard and in this document, a code point is written
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	47	using the notation ``U+265E`` to mean the character with value
				48	``0x265e`` (9,822 in decimal).
				49
				50	The Unicode standard contains a lot of tables listing characters and
				51	their corresponding code points:
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	52
				53	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	54
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	55	0061 'a'; LATIN SMALL LETTER A
				56	0062 'b'; LATIN SMALL LETTER B
				57	0063 'c'; LATIN SMALL LETTER C
				58	...
				59	007B '{'; LEFT CURLY BRACKET
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	60	...
Greg Price	32a960f	2019-09-08 02:42:13 -0700	[diff] [blame]	61	2167 'Ⅷ'; ROMAN NUMERAL EIGHT
				62	2168 'Ⅸ'; ROMAN NUMERAL NINE
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	63	...
Greg Price	32a960f	2019-09-08 02:42:13 -0700	[diff] [blame]	64	265E '♞'; BLACK CHESS KNIGHT
				65	265F '♟'; BLACK CHESS PAWN
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	66	...
Greg Price	32a960f	2019-09-08 02:42:13 -0700	[diff] [blame]	67	1F600 '😀'; GRINNING FACE
				68	1F609 '😉'; WINKING FACE
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	69	...
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	70
				71	Strictly, these definitions imply that it's meaningless to say 'this is
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	72	character ``U+265E``'. ``U+265E`` is a code point, which represents some particular
				73	character; in this case, it represents the character 'BLACK CHESS KNIGHT',
				74	'♞'. In
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	75	informal contexts, this distinction between code points and characters will
				76	sometimes be forgotten.
				77
				78	A character is represented on a screen or on paper by a set of graphical
				79	elements that's called a glyph. The glyph for an uppercase A, for example,
				80	is two diagonal strokes and a horizontal stroke, though the exact details will
				81	depend on the font being used. Most Python code doesn't need to worry about
				82	glyphs; figuring out the correct glyph to display is generally the job of a GUI
				83	toolkit or a terminal's font renderer.
				84
				85
				86	Encodings
				87	---------
				88
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	89	To summarize the previous section: a Unicode string is a sequence of
				90	code points, which are numbers from 0 through ``0x10FFFF`` (1,114,111
				91	decimal). This sequence of code points needs to be represented in
				92	memory as a set of code units, and code units are then mapped
				93	to 8-bit bytes. The rules for translating a Unicode string into a
				94	sequence of bytes are called a character encoding, or just
				95	an encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	96
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	97	The first encoding you might think of is using 32-bit integers as the
				98	code unit, and then using the CPU's representation of 32-bit integers.
				99	In this representation, the string "Python" might look like this:
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	100
				101	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	102
				103	P y t h o n
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	104	0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
				105	0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	106
				107	This representation is straightforward but using it presents a number of
				108	problems.
				109
				110	1. It's not portable; different processors order the bytes differently.
				111
				112	2. It's very wasteful of space. In most texts, the majority of the code points
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	113	are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	114	bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
				115	ASCII representation. Increased RAM usage doesn't matter too much (desktop
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	116	computers have gigabytes of RAM, and strings aren't usually that large), but
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	117	expanding our usage of disk and network bandwidth by a factor of 4 is
				118	intolerable.
				119
				120	3. It's not compatible with existing C functions such as ``strlen()``, so a new
				121	family of wide string functions would need to be used.
				122
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	123	Therefore this encoding isn't used very much, and people instead choose other
				124	encodings that are more efficient and convenient, such as UTF-8.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	125
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	126	UTF-8 is one of the most commonly used encodings, and Python often
				127	defaults to using it. UTF stands for "Unicode Transformation Format",
				128	and the '8' means that 8-bit values are used in the encoding. (There
				129	are also UTF-16 and UTF-32 encodings, but they are less frequently
				130	used than UTF-8.) UTF-8 uses the following rules:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	131
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	132	1. If the code point is < 128, it's represented by the corresponding byte value.
				133	2. If the code point is >= 128, it's turned into a sequence of two, three, or
				134	four bytes, where each byte of the sequence is between 128 and 255.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	135
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	136	UTF-8 has several convenient properties:
				137
				138	1. It can handle any Unicode code point.
redshiftzero	f98c3c5	2019-05-17 03:44:18 -0700	[diff] [blame]	139	2. A Unicode string is turned into a sequence of bytes that contains embedded
				140	zero bytes only where they represent the null character (U+0000). This means
				141	that UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent
				142	through protocols that can't handle zero bytes for anything other than
				143	end-of-string markers.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	144	3. A string of ASCII text is also valid UTF-8 text.
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	145	4. UTF-8 is fairly compact; the majority of commonly used characters can be
				146	represented with one or two bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	147	5. If bytes are corrupted or lost, it's possible to determine the start of the
				148	next UTF-8-encoded code point and resynchronize. It's also unlikely that
				149	random 8-bit data will look like valid UTF-8.
redshiftzero	f98c3c5	2019-05-17 03:44:18 -0700	[diff] [blame]	150	6. UTF-8 is a byte oriented encoding. The encoding specifies that each
				151	character is represented by a specific sequence of one or more bytes. This
				152	avoids the byte-ordering issues that can occur with integer and word oriented
				153	encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending
				154	on the hardware on which the string was encoded.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	155
				156
				157	References
				158	----------
				159
Benjamin Peterson	51796e5	2020-03-10 21:10:59 -0700	[diff] [blame]	160	The `Unicode Consortium site <https://www.unicode.org>`_ has character charts, a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	161	glossary, and PDF versions of the Unicode specification. Be prepared for some
Benjamin Peterson	51796e5	2020-03-10 21:10:59 -0700	[diff] [blame]	162	difficult reading. `A chronology <https://www.unicode.org/history/>`_ of the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	163	origin and development of Unicode is also available on the site.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	164
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	165	On the Computerphile Youtube channel, Tom Scott briefly
redshiftzero	3b2f9ab	2019-05-09 15:13:40 -0400	[diff] [blame]	166	`discusses the history of Unicode and UTF-8 <https://www.youtube.com/watch?v=MijmeoH9LT4>`_
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	167	(9 minutes 36 seconds).
				168
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	169	To help understand the standard, Jukka Korpela has written `an introductory
Sanyam Khurana	338cd83	2018-01-20 05:55:37 +0530	[diff] [blame]	170	guide <http://jkorpela.fi/unicode/guide.html>`_ to reading the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	171	Unicode character tables.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	172
Sanyam Khurana	1b4587a	2017-12-06 22:09:33 +0530	[diff] [blame]	173	Another `good introductory article <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	174	was written by Joel Spolsky.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	175	If this introduction didn't make things clear to you, you should try
				176	reading this alternate article before continuing.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	177
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	178	Wikipedia entries are often helpful; see the entries for "`character encoding
Georg Brandl	5d94134	2016-02-26 19:37:12 +0100	[diff] [blame]	179	<https://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
				180	<https://en.wikipedia.org/wiki/UTF-8>`_, for example.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	181
				182
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	183	Python's Unicode Support
				184	========================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	185
				186	Now that you've learned the rudiments of Unicode, we can look at Python's
				187	Unicode features.
				188
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	189	The String Type
				190	---------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	191
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	192	Since Python 3.0, the language's :class:`str` type contains Unicode
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	193	characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
Georg Brandl	4f5f98d	2009-05-04 21:01:20 +0000	[diff] [blame]	194	rocks!'``, or the triple-quoted string syntax is stored as Unicode.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	195
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	196	The default encoding for Python source code is UTF-8, so you can simply
				197	include a Unicode character in a string literal::
				198
				199	try:
				200	with open('/tmp/input.txt', 'r') as f:
				201	...
Andrew Svetlov	08af000	2014-04-01 01:13:30 +0300	[diff] [blame]	202	except OSError:
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	203	# 'File not found' error message.
				204	print("Fichier non trouvé")
				205
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	206	Side note: Python 3 also supports using Unicode characters in identifiers::
				207
				208	répertoire = "/tmp/records.log"
				209	with open(répertoire, "w") as f:
				210	f.write("test\n")
				211
				212	If you can't enter a particular character in your editor or want to
				213	keep the source code ASCII-only for some reason, you can also use
				214	escape sequences in string literals. (Depending on your system,
				215	you may see the actual capital-delta glyph instead of a \u escape.) ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	216
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	217	>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
				218	'\u0394'
				219	>>> "\u0394" # Using a 16-bit hex value
				220	'\u0394'
				221	>>> "\U00000394" # Using a 32-bit hex value
				222	'\u0394'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	223
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	224	In addition, one can create a string using the :func:`~bytes.decode` method of
				225	:class:`bytes`. This method takes an encoding argument, such as ``UTF-8``,
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	226	and optionally an errors argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	227
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	228	The errors argument specifies the response when the input string can't be
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	229	converted according to the encoding's rules. Legal values for this argument are
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	230	``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
Serhiy Storchaka	07985ef	2015-01-25 22:56:57 +0200	[diff] [blame]	231	``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
				232	character out of the Unicode result), or ``'backslashreplace'`` (inserts a
				233	``\xNN`` escape sequence).
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	234	The following examples show the differences::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	235
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	236	>>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	237	Traceback (most recent call last):
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	238	...
				239	UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
				240	invalid start byte
Ezio Melotti	20b8d99	2012-09-23 15:55:14 +0300	[diff] [blame]	241	>>> b'\x80abc'.decode("utf-8", "replace")
				242	'\ufffdabc'
Serhiy Storchaka	07985ef	2015-01-25 22:56:57 +0200	[diff] [blame]	243	>>> b'\x80abc'.decode("utf-8", "backslashreplace")
				244	'\\x80abc'
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	245	>>> b'\x80abc'.decode("utf-8", "ignore")
				246	'abc'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	247
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	248	Encodings are specified as strings containing the encoding's name. Python
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	249	comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	250	:ref:`standard-encodings` for a list. Some encodings have multiple names; for
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	251	example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
				252	the same encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	253
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	254	One-character Unicode strings can also be created with the :func:`chr`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	255	built-in function, which takes integers and returns a Unicode string of length 1
				256	that contains the corresponding code point. The reverse operation is the
				257	built-in :func:`ord` function that takes a one-character Unicode string and
				258	returns the code point value::
				259
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	260	>>> chr(57344)
				261	'\ue000'
				262	>>> ord('\ue000')
				263	57344
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	264
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	265	Converting to Bytes
				266	-------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	267
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	268	The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
				269	which returns a :class:`bytes` representation of the Unicode string, encoded in the
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	270	requested encoding.
				271
				272	The errors parameter is the same as the parameter of the
				273	:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
				274	``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
				275	inserts a question mark instead of the unencodable character), there is
Serhiy Storchaka	166ebc4	2014-11-25 13:57:17 +0200	[diff] [blame]	276	also ``'xmlcharrefreplace'`` (inserts an XML character reference),
				277	``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and
				278	``namereplace`` (inserts a ``\N{...}`` escape sequence).
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	279
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	280	The following example shows the different results::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	281
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	282	>>> u = chr(40960) + 'abcd' + chr(1972)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	283	>>> u.encode('utf-8')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	284	b'\xea\x80\x80abcd\xde\xb4'
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	285	>>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	286	Traceback (most recent call last):
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	287	...
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	288	UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	289	position 0: ordinal not in range(128)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	290	>>> u.encode('ascii', 'ignore')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	291	b'abcd'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	292	>>> u.encode('ascii', 'replace')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	293	b'?abcd?'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	294	>>> u.encode('ascii', 'xmlcharrefreplace')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	295	b'ꀀabcd޴'
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	296	>>> u.encode('ascii', 'backslashreplace')
				297	b'\\ua000abcd\\u07b4'
Serhiy Storchaka	166ebc4	2014-11-25 13:57:17 +0200	[diff] [blame]	298	>>> u.encode('ascii', 'namereplace')
				299	b'\\N{YI SYLLABLE IT}abcd\\u07b4'
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	300
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	301	The low-level routines for registering and accessing the available
				302	encodings are found in the :mod:`codecs` module. Implementing new
				303	encodings also requires understanding the :mod:`codecs` module.
				304	However, the encoding and decoding functions returned by this module
				305	are usually more low-level than is comfortable, and writing new encodings
				306	is a specialized task, so the module won't be covered in this HOWTO.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	307
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	308
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	309	Unicode Literals in Python Source Code
				310	--------------------------------------
				311
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	312	In Python source code, specific Unicode code points can be written using the
				313	``\u`` escape sequence, which is followed by four hex digits giving the code
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	314	point. The ``\U`` escape sequence is similar, but expects eight hex digits,
				315	not four::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	316
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	317	>>> s = "a\xac\u1234\u20ac\U00008000"
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	318	... # ^^^^ two-digit hex escape
				319	... # ^^^^^^ four-digit Unicode escape
				320	... # ^^^^^^^^^^ eight-digit Unicode escape
				321	>>> [ord(c) for c in s]
				322	[97, 172, 4660, 8364, 32768]
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	323
				324	Using escape sequences for code points greater than 127 is fine in small doses,
				325	but becomes an annoyance if you're using many accented characters, as you would
				326	in a program with messages in French or some other accent-using language. You
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	327	can also assemble strings using the :func:`chr` built-in function, but this is
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	328	even more tedious.
				329
				330	Ideally, you'd want to be able to write literals in your language's natural
				331	encoding. You could then edit Python source code with your favorite editor
				332	which would display the accented characters naturally, and have the right
				333	characters used at runtime.
				334
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	335	Python supports writing source code in UTF-8 by default, but you can use almost
				336	any encoding if you declare the encoding being used. This is done by including
				337	a special comment as either the first or second line of the source file::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	338
				339	#!/usr/bin/env python
				340	# -- coding: latin-1 --
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	341
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	342	u = 'abcdé'
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	343	print(ord(u[-1]))
				344
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	345	The syntax is inspired by Emacs's notation for specifying variables local to a
				346	file. Emacs supports many different variables, but Python only supports
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	347	'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
				348	they have no significance to Python but are a convention. Python looks for
				349	``coding: name`` or ``coding=name`` in the comment.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	350
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	351	If you don't include such a comment, the default encoding used will be UTF-8 as
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	352	already mentioned. See also :pep:`263` for more information.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	353
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	354
				355	Unicode Properties
				356	------------------
				357
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	358	The Unicode specification includes a database of information about
				359	code points. For each defined code point, the information includes
				360	the character's name, its category, the numeric value if applicable
				361	(for characters representing numeric concepts such as the Roman
				362	numerals, fractions such as one-third and four-fifths, etc.). There
				363	are also display-related properties, such as how to use the code point
				364	in bidirectional text.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	365
				366	The following program displays some information about several characters, and
				367	prints the numeric value of one particular character::
				368
				369	import unicodedata
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	370
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	371	u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	372
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	373	for i, c in enumerate(u):
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	374	print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
				375	print(unicodedata.name(c))
				376
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	377	# Get numeric value of second character
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	378	print(unicodedata.numeric(u[1]))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	379
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	380	When run, this prints:
				381
				382	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	383
				384	0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
				385	1 0bf2 No TAMIL NUMBER ONE THOUSAND
				386	2 0f84 Mn TIBETAN MARK HALANTA
				387	3 1770 Lo TAGBANWA LETTER SA
				388	4 33af So SQUARE RAD OVER S SQUARED
				389	1000.0
				390
				391	The category codes are abbreviations describing the nature of the character.
				392	These are grouped into categories such as "Letter", "Number", "Punctuation", or
				393	"Symbol", which in turn are broken up into subcategories. To take the codes
				394	from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
				395	"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
				396	other". See
Benjamin Peterson	51796e5	2020-03-10 21:10:59 -0700	[diff] [blame]	397	`the General Category Values section of the Unicode Character Database documentation <https://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	398	list of category codes.
				399
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	400
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	401	Comparing Strings
				402	-----------------
				403
				404	Unicode adds some complication to comparing strings, because the same
				405	set of characters can be represented by different sequences of code
				406	points. For example, a letter like 'ê' can be represented as a single
				407	code point U+00EA, or as U+0065 U+0302, which is the code point for
				408	'e' followed by a code point for 'COMBINING CIRCUMFLEX ACCENT'. These
				409	will produce the same output when printed, but one is a string of
				410	length 1 and the other is of length 2.
				411
				412	One tool for a case-insensitive comparison is the
				413	:meth:`~str.casefold` string method that converts a string to a
				414	case-insensitive form following an algorithm described by the Unicode
				415	Standard. This algorithm has special handling for characters such as
				416	the German letter 'ß' (code point U+00DF), which becomes the pair of
				417	lowercase letters 'ss'.
				418
				419	::
				420
				421	>>> street = 'Gürzenichstraße'
				422	>>> street.casefold()
				423	'gürzenichstrasse'
				424
				425	A second tool is the :mod:`unicodedata` module's
				426	:func:`~unicodedata.normalize` function that converts strings to one
				427	of several normal forms, where letters followed by a combining
				428	character are replaced with single characters. :func:`normalize` can
				429	be used to perform string comparisons that won't falsely report
				430	inequality if two strings use combining characters differently:
				431
				432	::
				433
				434	import unicodedata
				435
				436	def compare_strs(s1, s2):
				437	def NFD(s):
				438	return unicodedata.normalize('NFD', s)
				439
				440	return NFD(s1) == NFD(s2)
				441
				442	single_char = 'ê'
				443	multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
				444	print('length of first string=', len(single_char))
				445	print('length of second string=', len(multiple_chars))
				446	print(compare_strs(single_char, multiple_chars))
				447
				448	When run, this outputs:
				449
				450	.. code-block:: shell-session
				451
				452	$ python3 compare-strs.py
				453	length of first string= 1
				454	length of second string= 2
				455	True
				456
				457	The first argument to the :func:`~unicodedata.normalize` function is a
				458	string giving the desired normalization form, which can be one of
				459	'NFC', 'NFKC', 'NFD', and 'NFKD'.
				460
				461	The Unicode Standard also specifies how to do caseless comparisons::
				462
				463	import unicodedata
				464
				465	def compare_caseless(s1, s2):
				466	def NFD(s):
				467	return unicodedata.normalize('NFD', s)
				468
				469	return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())
				470
				471	# Example usage
				472	single_char = 'ê'
				473	multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
				474
				475	print(compare_caseless(single_char, multiple_chars))
				476
				477	This will print ``True``. (Why is :func:`NFD` invoked twice? Because
				478	there are a few characters that make :meth:`casefold` return a
				479	non-normalized string, so the result needs to be normalized again. See
				480	section 3.13 of the Unicode Standard for a discussion and an example.)
				481
				482
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	483	Unicode Regular Expressions
				484	---------------------------
				485
				486	The regular expressions supported by the :mod:`re` module can be provided
				487	either as bytes or strings. Some of the special character sequences such as
				488	``\d`` and ``\w`` have different meanings depending on whether
				489	the pattern is supplied as bytes or a string. For example,
				490	``\d`` will match the characters ``[0-9]`` in bytes but
				491	in strings will match any character that's in the ``'Nd'`` category.
				492
				493	The string in this example has the number 57 written in both Thai and
				494	Arabic numerals::
				495
				496	import re
Cheryl Sabella	6677142	2018-02-02 16:16:27 -0500	[diff] [blame]	497	p = re.compile(r'\d+')
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	498
				499	s = "Over \u0e55\u0e57 57 flavours"
				500	m = p.search(s)
				501	print(repr(m.group()))
				502
				503	When executed, ``\d+`` will match the Thai numerals and print them
				504	out. If you supply the :const:`re.ASCII` flag to
				505	:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
				506
				507	Similarly, ``\w`` matches a wide variety of Unicode characters but
				508	only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
				509	and ``\s`` will match either Unicode whitespace characters or
				510	``[ \t\n\r\f\v]``.
				511
				512
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	513	References
				514	----------
				515
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	516	.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
				517
				518	Some good alternative discussions of Python's Unicode support are:
				519
				520	* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
Sanyam Khurana	1b4587a	2017-12-06 22:09:33 +0530	[diff] [blame]	521	* `Pragmatic Unicode <https://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	522
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	523	The :class:`str` type is described in the Python library reference at
Ezio Melotti	a6229e6	2012-10-12 10:59:14 +0300	[diff] [blame]	524	:ref:`textseq`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	525
				526	The documentation for the :mod:`unicodedata` module.
				527
				528	The documentation for the :mod:`codecs` module.
				529
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	530	Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides)
				531	<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
				532	EuroPython 2002. The slides are an excellent overview of the design of Python
				533	2's Unicode features (where the Unicode string type is called ``unicode`` and
				534	literals start with ``u``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	535
				536
				537	Reading and Writing Unicode Data
				538	================================
				539
				540	Once you've written some code that works with Unicode data, the next problem is
				541	input/output. How do you get Unicode strings into your program, and how do you
				542	convert Unicode into a form suitable for storage or transmission?
				543
				544	It's possible that you may not need to do anything depending on your input
				545	sources and output destinations; you should check whether the libraries used in
				546	your application support Unicode natively. XML parsers often return Unicode
				547	data, for example. Many relational databases also support Unicode-valued
				548	columns and can return Unicode values from an SQL query.
				549
				550	Unicode data is usually converted to a particular encoding before it gets
				551	written to disk or sent over a socket. It's possible to do all the work
Georg Brandl	3d596fa	2013-10-29 08:16:56 +0100	[diff] [blame]	552	yourself: open a file, read an 8-bit bytes object from it, and convert the bytes
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	553	with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	554
				555	One problem is the multi-byte nature of encodings; one Unicode character can be
				556	represented by several bytes. If you want to read the file in arbitrary-sized
Serhiy Storchaka	f8def28	2013-02-16 17:29:56 +0200	[diff] [blame]	557	chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	558	where only part of the bytes encoding a single Unicode character are read at the
				559	end of a chunk. One solution would be to read the entire file into memory and
				560	then perform the decoding, but that prevents you from working with files that
Serhiy Storchaka	f8def28	2013-02-16 17:29:56 +0200	[diff] [blame]	561	are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	562	(More, really, since for at least a moment you'd need to have both the encoded
				563	string and its Unicode version in memory.)
				564
				565	The solution would be to use the low-level decoding interface to catch the case
				566	of partial coding sequences. The work of implementing this has already been
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	567	done for you: the built-in :func:`open` function can return a file-like object
				568	that assumes the file's contents are in a specified encoding and accepts Unicode
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	569	parameters for methods such as :meth:`~io.TextIOBase.read` and
Georg Brandl	325a1c2	2013-10-27 09:16:01 +0100	[diff] [blame]	570	:meth:`~io.TextIOBase.write`. This works through :func:`open`\'s encoding and
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	571	errors parameters which are interpreted just like those in :meth:`str.encode`
				572	and :meth:`bytes.decode`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	573
				574	Reading Unicode from a file is therefore simple::
				575
Georg Brandl	e47e184	2013-10-06 13:07:10 +0200	[diff] [blame]	576	with open('unicode.txt', encoding='utf-8') as f:
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	577	for line in f:
				578	print(repr(line))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	579
				580	It's also possible to open files in update mode, allowing both reading and
				581	writing::
				582
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	583	with open('test', encoding='utf-8', mode='w+') as f:
				584	f.write('\u4500 blah blah blah\n')
				585	f.seek(0)
				586	print(repr(f.readline()[:1]))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	587
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	588	The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	589	written as the first character of a file in order to assist with autodetection
				590	of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
				591	present at the start of a file; when such an encoding is used, the BOM will be
				592	automatically written as the first character and will be silently dropped when
				593	the file is read. There are variants of these encodings, such as 'utf-16-le'
				594	and 'utf-16-be' for little-endian and big-endian encodings, that specify one
				595	particular byte ordering and don't skip the BOM.
				596
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	597	In some areas, it is also convention to use a "BOM" at the start of UTF-8
				598	encoded files; the name is misleading since UTF-8 is not byte-order dependent.
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	599	The mark simply announces that the file is encoded in UTF-8. For reading such
				600	files, use the 'utf-8-sig' codec to automatically skip the mark if present.
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	601
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	602
				603	Unicode filenames
				604	-----------------
				605
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	606	Most of the operating systems in common use today support filenames
				607	that contain arbitrary Unicode characters. Usually this is
				608	implemented by converting the Unicode string into some encoding that
				609	varies depending on the system. Today Python is converging on using
				610	UTF-8: Python on MacOS has used UTF-8 for several versions, and Python
				611	3.6 switched to using UTF-8 on Windows as well. On Unix systems,
Victor Stinner	4b9aad4	2020-11-02 16:49:54 +0100	[diff] [blame]	612	there will only be a :term:`filesystem encoding <filesystem encoding and error
				613	handler>`. if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
				614	you haven't, the default encoding is again UTF-8.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	615
				616	The :func:`sys.getfilesystemencoding` function returns the encoding to use on
				617	your current system, in case you want to do the encoding manually, but there's
				618	not much reason to bother. When opening a file for reading or writing, you can
				619	usually just provide the Unicode string as the filename, and it will be
				620	automatically converted to the right encoding for you::
				621
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	622	filename = 'filename\u4500abc'
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	623	with open(filename, 'w') as f:
				624	f.write('blah\n')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	625
				626	Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
				627	filenames.
				628
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	629	The :func:`os.listdir` function returns filenames, which raises an issue: should it return
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	630	the Unicode version of filenames, or should it return bytes containing
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	631	the encoded versions? :func:`os.listdir` can do both, depending on whether you
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	632	provided the directory path as bytes or a Unicode string. If you pass a
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	633	Unicode string as the path, filenames will be decoded using the filesystem's
				634	encoding and a list of Unicode strings will be returned, while passing a byte
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	635	path will return the filenames as bytes. For example,
Victor Stinner	4b9aad4	2020-11-02 16:49:54 +0100	[diff] [blame]	636	assuming the default :term:`filesystem encoding <filesystem encoding and error
				637	handler>` is UTF-8, running the following program::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	638
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	639	fn = 'filename\u4500abc'
				640	f = open(fn, 'w')
				641	f.close()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	642
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	643	import os
				644	print(os.listdir(b'.'))
				645	print(os.listdir('.'))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	646
Martin Panter	1050d2d	2016-07-26 11:18:21 +0200	[diff] [blame]	647	will produce the following output:
				648
				649	.. code-block:: shell-session
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	650
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	651	$ python listdir-test.py
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	652	[b'filename\xe4\x94\x80abc', ...]
				653	['filename\u4500abc', ...]
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	654
				655	The first list contains UTF-8-encoded filenames, and the second list contains
				656	the Unicode versions.
				657
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	658	Note that on most occasions, you should can just stick with using
				659	Unicode with these APIs. The bytes APIs should only be used on
				660	systems where undecodable file names can be present; that's
				661	pretty much only Unix systems now.
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	662
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	663
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	664	Tips for Writing Unicode-aware Programs
				665	---------------------------------------
				666
				667	This section provides some suggestions on writing software that deals with
				668	Unicode.
				669
				670	The most important tip is:
				671
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	672	Software should only work with Unicode strings internally, decoding the input
				673	data as soon as possible and encoding the output only at the end.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	674
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	675	If you attempt to write processing functions that accept both Unicode and byte
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	676	strings, you will find your program vulnerable to bugs wherever you combine the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	677	two different kinds of strings. There is no automatic encoding or decoding: if
				678	you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	679
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	680	When using data coming from a web browser or some other untrusted source, a
				681	common technique is to check for illegal characters in a string before using the
				682	string in a generated command line or storing it in a database. If you're doing
Antoine Pitrou	534e253	2011-12-05 01:21:46 +0100	[diff] [blame]	683	this, be careful to check the decoded string, not the encoded bytes data;
				684	some encodings may have interesting properties, such as not being bijective
				685	or not being fully ASCII-compatible. This is especially true if the input
				686	data also specifies the encoding, since the attacker can then choose a
				687	clever way to hide malicious text in the encoded bytestream.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	688
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	689
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	690	Converting Between File Encodings
				691	'''''''''''''''''''''''''''''''''
				692
				693	The :class:`~codecs.StreamRecoder` class can transparently convert between
				694	encodings, taking a stream that returns data in encoding #1
				695	and behaving like a stream returning data in encoding #2.
				696
				697	For example, if you have an input file f that's in Latin-1, you
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	698	can wrap it with a :class:`~codecs.StreamRecoder` to return bytes encoded in
				699	UTF-8::
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	700
				701	new_f = codecs.StreamRecoder(f,
				702	# en/decoder: used by read() to encode its results and
				703	# by write() to decode its input.
				704	codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
				705
				706	# reader/writer: used to read and write to the stream.
				707	codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
				708
				709
				710	Files in an Unknown Encoding
				711	''''''''''''''''''''''''''''
				712
				713	What can you do if you need to make a change to a file, but don't know
				714	the file's encoding? If you know the encoding is ASCII-compatible and
				715	only want to examine or modify the ASCII parts, you can open the file
				716	with the ``surrogateescape`` error handler::
				717
				718	with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
				719	data = f.read()
				720
				721	# make changes to the string 'data'
				722
				723	with open(fname + '.new', 'w',
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	724	encoding="ascii", errors="surrogateescape") as f:
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	725	f.write(data)
				726
				727	The ``surrogateescape`` error handler will decode any non-ASCII bytes
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	728	as code points in a special range running from U+DC80 to
				729	U+DCFF. These code points will then turn back into the
				730	same bytes when the ``surrogateescape`` error handler is used to
				731	encode the data and write it back out.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	732
				733
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	734	References
				735	----------
				736
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	737	One section of `Mastering Python 3 Input/Output
				738	<http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_,
				739	a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	740
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	741	The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
				742	Applications in Python"
				743	<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	744	discuss questions of character encodings as well as how to internationalize
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	745	and localize an application. These slides cover Python 2.x only.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	746
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	747	`The Guts of Unicode in Python
				748	<http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_
				749	is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
				750	representation in Python 3.3.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	751
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	752
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	753	Acknowledgements
				754	================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	755
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	756	The initial draft of this document was written by Andrew Kuchling.
				757	It has since been revised further by Alexander Belopolsky, Georg Brandl,
				758	Andrew Kuchling, and Ezio Melotti.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	759
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	760	Thanks to the following people who have noted errors or offered
				761	suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
				762	Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	763	Lemburg, Martin von Löwis, Terry J. Reedy, Serhiy Storchaka,
				764	Eryk Sun, Chad Whitacre, Graham Wideman.