Blame - Doc/howto/unicode.rst - platform/external/python/cpython3

blob: 51bd64bfc232ca03c8aa8195341e22463fa5cfff [file] [log] [blame]

Guido van Rossum	715287f	2008-12-02 22:34:15 +0000	[diff] [blame]	1	.. _unicode-howto:
				2
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3	*****************
				4	Unicode HOWTO
				5	*****************
				6
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	7	:Release: 1.12
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	8
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	9	This HOWTO discusses Python's support for the Unicode specification
				10	for representing textual data, and explains various problems that
				11	people commonly encounter when trying to work with Unicode.
				12
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	13
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14	Introduction to Unicode
				15	=======================
				16
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	17	Definitions
				18	-----------
				19
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	20	Today's programs need to be able to handle a wide variety of
				21	characters. Applications are often internationalized to display
				22	messages and output in a variety of user-selectable languages; the
				23	same program might need to output an error message in English, French,
				24	Japanese, Hebrew, or Russian. Web content can be written in any of
				25	these languages and can also include a variety of emoji symbols.
				26	Python's string type uses the Unicode Standard for representing
				27	characters, which lets Python programs work with all these different
				28	possible characters.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	29
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	30	Unicode (https://www.unicode.org/) is a specification that aims to
				31	list every character used by human languages and give each character
				32	its own unique code. The Unicode specifications are continually
				33	revised and updated to add new languages and symbols.
				34
				35	A character is the smallest possible component of a text. 'A', 'B', 'C',
				36	etc., are all different characters. So are 'È' and 'Í'. Characters vary
				37	depending on the language or context you're talking
				38	about. For example, there's a character for "Roman Numeral One", 'Ⅰ', that's
				39	separate from the uppercase letter 'I'. They'll usually look the same,
				40	but these are two different characters that have different meanings.
				41
				42	The Unicode standard describes how characters are represented by
				43	code points. A code point value is an integer in the range 0 to
				44	0x10FFFF (about 1.1 million values, with some 110 thousand assigned so
				45	far). In the standard and in this document, a code point is written
				46	using the notation ``U+265E`` to mean the character with value
				47	``0x265e`` (9,822 in decimal).
				48
				49	The Unicode standard contains a lot of tables listing characters and
				50	their corresponding code points:
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	51
				52	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	53
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	54	0061 'a'; LATIN SMALL LETTER A
				55	0062 'b'; LATIN SMALL LETTER B
				56	0063 'c'; LATIN SMALL LETTER C
				57	...
				58	007B '{'; LEFT CURLY BRACKET
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	59	...
Greg Price	32a960f	2019-09-08 02:42:13 -0700	[diff] [blame]	60	2167 'Ⅷ'; ROMAN NUMERAL EIGHT
				61	2168 'Ⅸ'; ROMAN NUMERAL NINE
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	62	...
Greg Price	32a960f	2019-09-08 02:42:13 -0700	[diff] [blame]	63	265E '♞'; BLACK CHESS KNIGHT
				64	265F '♟'; BLACK CHESS PAWN
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	65	...
Greg Price	32a960f	2019-09-08 02:42:13 -0700	[diff] [blame]	66	1F600 '😀'; GRINNING FACE
				67	1F609 '😉'; WINKING FACE
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	68	...
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	69
				70	Strictly, these definitions imply that it's meaningless to say 'this is
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	71	character ``U+265E``'. ``U+265E`` is a code point, which represents some particular
				72	character; in this case, it represents the character 'BLACK CHESS KNIGHT',
				73	'♞'. In
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	74	informal contexts, this distinction between code points and characters will
				75	sometimes be forgotten.
				76
				77	A character is represented on a screen or on paper by a set of graphical
				78	elements that's called a glyph. The glyph for an uppercase A, for example,
				79	is two diagonal strokes and a horizontal stroke, though the exact details will
				80	depend on the font being used. Most Python code doesn't need to worry about
				81	glyphs; figuring out the correct glyph to display is generally the job of a GUI
				82	toolkit or a terminal's font renderer.
				83
				84
				85	Encodings
				86	---------
				87
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	88	To summarize the previous section: a Unicode string is a sequence of
				89	code points, which are numbers from 0 through ``0x10FFFF`` (1,114,111
				90	decimal). This sequence of code points needs to be represented in
				91	memory as a set of code units, and code units are then mapped
				92	to 8-bit bytes. The rules for translating a Unicode string into a
				93	sequence of bytes are called a character encoding, or just
				94	an encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	95
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	96	The first encoding you might think of is using 32-bit integers as the
				97	code unit, and then using the CPU's representation of 32-bit integers.
				98	In this representation, the string "Python" might look like this:
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	99
				100	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	101
				102	P y t h o n
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	103	0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
				104	0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	105
				106	This representation is straightforward but using it presents a number of
				107	problems.
				108
				109	1. It's not portable; different processors order the bytes differently.
				110
				111	2. It's very wasteful of space. In most texts, the majority of the code points
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	112	are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	113	bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
				114	ASCII representation. Increased RAM usage doesn't matter too much (desktop
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	115	computers have gigabytes of RAM, and strings aren't usually that large), but
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	116	expanding our usage of disk and network bandwidth by a factor of 4 is
				117	intolerable.
				118
				119	3. It's not compatible with existing C functions such as ``strlen()``, so a new
				120	family of wide string functions would need to be used.
				121
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	122	Therefore this encoding isn't used very much, and people instead choose other
				123	encodings that are more efficient and convenient, such as UTF-8.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	124
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	125	UTF-8 is one of the most commonly used encodings, and Python often
				126	defaults to using it. UTF stands for "Unicode Transformation Format",
				127	and the '8' means that 8-bit values are used in the encoding. (There
				128	are also UTF-16 and UTF-32 encodings, but they are less frequently
				129	used than UTF-8.) UTF-8 uses the following rules:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	130
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	131	1. If the code point is < 128, it's represented by the corresponding byte value.
				132	2. If the code point is >= 128, it's turned into a sequence of two, three, or
				133	four bytes, where each byte of the sequence is between 128 and 255.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	134
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	135	UTF-8 has several convenient properties:
				136
				137	1. It can handle any Unicode code point.
redshiftzero	f98c3c5	2019-05-17 03:44:18 -0700	[diff] [blame]	138	2. A Unicode string is turned into a sequence of bytes that contains embedded
				139	zero bytes only where they represent the null character (U+0000). This means
				140	that UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent
				141	through protocols that can't handle zero bytes for anything other than
				142	end-of-string markers.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	143	3. A string of ASCII text is also valid UTF-8 text.
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	144	4. UTF-8 is fairly compact; the majority of commonly used characters can be
				145	represented with one or two bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	146	5. If bytes are corrupted or lost, it's possible to determine the start of the
				147	next UTF-8-encoded code point and resynchronize. It's also unlikely that
				148	random 8-bit data will look like valid UTF-8.
redshiftzero	f98c3c5	2019-05-17 03:44:18 -0700	[diff] [blame]	149	6. UTF-8 is a byte oriented encoding. The encoding specifies that each
				150	character is represented by a specific sequence of one or more bytes. This
				151	avoids the byte-ordering issues that can occur with integer and word oriented
				152	encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending
				153	on the hardware on which the string was encoded.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	154
				155
				156	References
				157	----------
				158
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	159	The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	160	glossary, and PDF versions of the Unicode specification. Be prepared for some
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	161	difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
				162	origin and development of Unicode is also available on the site.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	163
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	164	On the Computerphile Youtube channel, Tom Scott briefly
redshiftzero	3b2f9ab	2019-05-09 15:13:40 -0400	[diff] [blame]	165	`discusses the history of Unicode and UTF-8 <https://www.youtube.com/watch?v=MijmeoH9LT4>`_
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	166	(9 minutes 36 seconds).
				167
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	168	To help understand the standard, Jukka Korpela has written `an introductory
Sanyam Khurana	338cd83	2018-01-20 05:55:37 +0530	[diff] [blame]	169	guide <http://jkorpela.fi/unicode/guide.html>`_ to reading the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	170	Unicode character tables.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	171
Sanyam Khurana	1b4587a	2017-12-06 22:09:33 +0530	[diff] [blame]	172	Another `good introductory article <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	173	was written by Joel Spolsky.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	174	If this introduction didn't make things clear to you, you should try
				175	reading this alternate article before continuing.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	176
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	177	Wikipedia entries are often helpful; see the entries for "`character encoding
Georg Brandl	5d94134	2016-02-26 19:37:12 +0100	[diff] [blame]	178	<https://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
				179	<https://en.wikipedia.org/wiki/UTF-8>`_, for example.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	180
				181
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	182	Python's Unicode Support
				183	========================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	184
				185	Now that you've learned the rudiments of Unicode, we can look at Python's
				186	Unicode features.
				187
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	188	The String Type
				189	---------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	190
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	191	Since Python 3.0, the language's :class:`str` type contains Unicode
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	192	characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
Georg Brandl	4f5f98d	2009-05-04 21:01:20 +0000	[diff] [blame]	193	rocks!'``, or the triple-quoted string syntax is stored as Unicode.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	194
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	195	The default encoding for Python source code is UTF-8, so you can simply
				196	include a Unicode character in a string literal::
				197
				198	try:
				199	with open('/tmp/input.txt', 'r') as f:
				200	...
Andrew Svetlov	08af000	2014-04-01 01:13:30 +0300	[diff] [blame]	201	except OSError:
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	202	# 'File not found' error message.
				203	print("Fichier non trouvé")
				204
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	205	Side note: Python 3 also supports using Unicode characters in identifiers::
				206
				207	répertoire = "/tmp/records.log"
				208	with open(répertoire, "w") as f:
				209	f.write("test\n")
				210
				211	If you can't enter a particular character in your editor or want to
				212	keep the source code ASCII-only for some reason, you can also use
				213	escape sequences in string literals. (Depending on your system,
				214	you may see the actual capital-delta glyph instead of a \u escape.) ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	215
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	216	>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
				217	'\u0394'
				218	>>> "\u0394" # Using a 16-bit hex value
				219	'\u0394'
				220	>>> "\U00000394" # Using a 32-bit hex value
				221	'\u0394'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	222
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	223	In addition, one can create a string using the :func:`~bytes.decode` method of
				224	:class:`bytes`. This method takes an encoding argument, such as ``UTF-8``,
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	225	and optionally an errors argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	226
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	227	The errors argument specifies the response when the input string can't be
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	228	converted according to the encoding's rules. Legal values for this argument are
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	229	``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
Serhiy Storchaka	07985ef	2015-01-25 22:56:57 +0200	[diff] [blame]	230	``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
				231	character out of the Unicode result), or ``'backslashreplace'`` (inserts a
				232	``\xNN`` escape sequence).
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	233	The following examples show the differences::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	234
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	235	>>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	236	Traceback (most recent call last):
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	237	...
				238	UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
				239	invalid start byte
Ezio Melotti	20b8d99	2012-09-23 15:55:14 +0300	[diff] [blame]	240	>>> b'\x80abc'.decode("utf-8", "replace")
				241	'\ufffdabc'
Serhiy Storchaka	07985ef	2015-01-25 22:56:57 +0200	[diff] [blame]	242	>>> b'\x80abc'.decode("utf-8", "backslashreplace")
				243	'\\x80abc'
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	244	>>> b'\x80abc'.decode("utf-8", "ignore")
				245	'abc'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	246
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	247	Encodings are specified as strings containing the encoding's name. Python
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	248	comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	249	:ref:`standard-encodings` for a list. Some encodings have multiple names; for
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	250	example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
				251	the same encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	252
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	253	One-character Unicode strings can also be created with the :func:`chr`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	254	built-in function, which takes integers and returns a Unicode string of length 1
				255	that contains the corresponding code point. The reverse operation is the
				256	built-in :func:`ord` function that takes a one-character Unicode string and
				257	returns the code point value::
				258
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	259	>>> chr(57344)
				260	'\ue000'
				261	>>> ord('\ue000')
				262	57344
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	263
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	264	Converting to Bytes
				265	-------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	266
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	267	The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
				268	which returns a :class:`bytes` representation of the Unicode string, encoded in the
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	269	requested encoding.
				270
				271	The errors parameter is the same as the parameter of the
				272	:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
				273	``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
				274	inserts a question mark instead of the unencodable character), there is
Serhiy Storchaka	166ebc4	2014-11-25 13:57:17 +0200	[diff] [blame]	275	also ``'xmlcharrefreplace'`` (inserts an XML character reference),
				276	``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and
				277	``namereplace`` (inserts a ``\N{...}`` escape sequence).
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	278
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	279	The following example shows the different results::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	280
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	281	>>> u = chr(40960) + 'abcd' + chr(1972)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	282	>>> u.encode('utf-8')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	283	b'\xea\x80\x80abcd\xde\xb4'
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	284	>>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	285	Traceback (most recent call last):
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	286	...
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	287	UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	288	position 0: ordinal not in range(128)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	289	>>> u.encode('ascii', 'ignore')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	290	b'abcd'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	291	>>> u.encode('ascii', 'replace')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	292	b'?abcd?'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	293	>>> u.encode('ascii', 'xmlcharrefreplace')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	294	b'ꀀabcd޴'
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	295	>>> u.encode('ascii', 'backslashreplace')
				296	b'\\ua000abcd\\u07b4'
Serhiy Storchaka	166ebc4	2014-11-25 13:57:17 +0200	[diff] [blame]	297	>>> u.encode('ascii', 'namereplace')
				298	b'\\N{YI SYLLABLE IT}abcd\\u07b4'
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	299
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	300	The low-level routines for registering and accessing the available
				301	encodings are found in the :mod:`codecs` module. Implementing new
				302	encodings also requires understanding the :mod:`codecs` module.
				303	However, the encoding and decoding functions returned by this module
				304	are usually more low-level than is comfortable, and writing new encodings
				305	is a specialized task, so the module won't be covered in this HOWTO.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	306
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	307
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	308	Unicode Literals in Python Source Code
				309	--------------------------------------
				310
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	311	In Python source code, specific Unicode code points can be written using the
				312	``\u`` escape sequence, which is followed by four hex digits giving the code
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	313	point. The ``\U`` escape sequence is similar, but expects eight hex digits,
				314	not four::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	315
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	316	>>> s = "a\xac\u1234\u20ac\U00008000"
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	317	... # ^^^^ two-digit hex escape
				318	... # ^^^^^^ four-digit Unicode escape
				319	... # ^^^^^^^^^^ eight-digit Unicode escape
				320	>>> [ord(c) for c in s]
				321	[97, 172, 4660, 8364, 32768]
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	322
				323	Using escape sequences for code points greater than 127 is fine in small doses,
				324	but becomes an annoyance if you're using many accented characters, as you would
				325	in a program with messages in French or some other accent-using language. You
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	326	can also assemble strings using the :func:`chr` built-in function, but this is
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	327	even more tedious.
				328
				329	Ideally, you'd want to be able to write literals in your language's natural
				330	encoding. You could then edit Python source code with your favorite editor
				331	which would display the accented characters naturally, and have the right
				332	characters used at runtime.
				333
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	334	Python supports writing source code in UTF-8 by default, but you can use almost
				335	any encoding if you declare the encoding being used. This is done by including
				336	a special comment as either the first or second line of the source file::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	337
				338	#!/usr/bin/env python
				339	# -- coding: latin-1 --
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	340
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	341	u = 'abcdé'
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	342	print(ord(u[-1]))
				343
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	344	The syntax is inspired by Emacs's notation for specifying variables local to a
				345	file. Emacs supports many different variables, but Python only supports
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	346	'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
				347	they have no significance to Python but are a convention. Python looks for
				348	``coding: name`` or ``coding=name`` in the comment.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	349
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	350	If you don't include such a comment, the default encoding used will be UTF-8 as
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	351	already mentioned. See also :pep:`263` for more information.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	352
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	353
				354	Unicode Properties
				355	------------------
				356
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	357	The Unicode specification includes a database of information about
				358	code points. For each defined code point, the information includes
				359	the character's name, its category, the numeric value if applicable
				360	(for characters representing numeric concepts such as the Roman
				361	numerals, fractions such as one-third and four-fifths, etc.). There
				362	are also display-related properties, such as how to use the code point
				363	in bidirectional text.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	364
				365	The following program displays some information about several characters, and
				366	prints the numeric value of one particular character::
				367
				368	import unicodedata
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	369
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	370	u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	371
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	372	for i, c in enumerate(u):
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	373	print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
				374	print(unicodedata.name(c))
				375
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	376	# Get numeric value of second character
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	377	print(unicodedata.numeric(u[1]))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	378
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	379	When run, this prints:
				380
				381	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	382
				383	0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
				384	1 0bf2 No TAMIL NUMBER ONE THOUSAND
				385	2 0f84 Mn TIBETAN MARK HALANTA
				386	3 1770 Lo TAGBANWA LETTER SA
				387	4 33af So SQUARE RAD OVER S SQUARED
				388	1000.0
				389
				390	The category codes are abbreviations describing the nature of the character.
				391	These are grouped into categories such as "Letter", "Number", "Punctuation", or
				392	"Symbol", which in turn are broken up into subcategories. To take the codes
				393	from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
				394	"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
				395	other". See
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	396	`the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	397	list of category codes.
				398
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	399
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	400	Comparing Strings
				401	-----------------
				402
				403	Unicode adds some complication to comparing strings, because the same
				404	set of characters can be represented by different sequences of code
				405	points. For example, a letter like 'ê' can be represented as a single
				406	code point U+00EA, or as U+0065 U+0302, which is the code point for
				407	'e' followed by a code point for 'COMBINING CIRCUMFLEX ACCENT'. These
				408	will produce the same output when printed, but one is a string of
				409	length 1 and the other is of length 2.
				410
				411	One tool for a case-insensitive comparison is the
				412	:meth:`~str.casefold` string method that converts a string to a
				413	case-insensitive form following an algorithm described by the Unicode
				414	Standard. This algorithm has special handling for characters such as
				415	the German letter 'ß' (code point U+00DF), which becomes the pair of
				416	lowercase letters 'ss'.
				417
				418	::
				419
				420	>>> street = 'Gürzenichstraße'
				421	>>> street.casefold()
				422	'gürzenichstrasse'
				423
				424	A second tool is the :mod:`unicodedata` module's
				425	:func:`~unicodedata.normalize` function that converts strings to one
				426	of several normal forms, where letters followed by a combining
				427	character are replaced with single characters. :func:`normalize` can
				428	be used to perform string comparisons that won't falsely report
				429	inequality if two strings use combining characters differently:
				430
				431	::
				432
				433	import unicodedata
				434
				435	def compare_strs(s1, s2):
				436	def NFD(s):
				437	return unicodedata.normalize('NFD', s)
				438
				439	return NFD(s1) == NFD(s2)
				440
				441	single_char = 'ê'
				442	multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
				443	print('length of first string=', len(single_char))
				444	print('length of second string=', len(multiple_chars))
				445	print(compare_strs(single_char, multiple_chars))
				446
				447	When run, this outputs:
				448
				449	.. code-block:: shell-session
				450
				451	$ python3 compare-strs.py
				452	length of first string= 1
				453	length of second string= 2
				454	True
				455
				456	The first argument to the :func:`~unicodedata.normalize` function is a
				457	string giving the desired normalization form, which can be one of
				458	'NFC', 'NFKC', 'NFD', and 'NFKD'.
				459
				460	The Unicode Standard also specifies how to do caseless comparisons::
				461
				462	import unicodedata
				463
				464	def compare_caseless(s1, s2):
				465	def NFD(s):
				466	return unicodedata.normalize('NFD', s)
				467
				468	return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())
				469
				470	# Example usage
				471	single_char = 'ê'
				472	multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
				473
				474	print(compare_caseless(single_char, multiple_chars))
				475
				476	This will print ``True``. (Why is :func:`NFD` invoked twice? Because
				477	there are a few characters that make :meth:`casefold` return a
				478	non-normalized string, so the result needs to be normalized again. See
				479	section 3.13 of the Unicode Standard for a discussion and an example.)
				480
				481
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	482	Unicode Regular Expressions
				483	---------------------------
				484
				485	The regular expressions supported by the :mod:`re` module can be provided
				486	either as bytes or strings. Some of the special character sequences such as
				487	``\d`` and ``\w`` have different meanings depending on whether
				488	the pattern is supplied as bytes or a string. For example,
				489	``\d`` will match the characters ``[0-9]`` in bytes but
				490	in strings will match any character that's in the ``'Nd'`` category.
				491
				492	The string in this example has the number 57 written in both Thai and
				493	Arabic numerals::
				494
				495	import re
Cheryl Sabella	6677142	2018-02-02 16:16:27 -0500	[diff] [blame]	496	p = re.compile(r'\d+')
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	497
				498	s = "Over \u0e55\u0e57 57 flavours"
				499	m = p.search(s)
				500	print(repr(m.group()))
				501
				502	When executed, ``\d+`` will match the Thai numerals and print them
				503	out. If you supply the :const:`re.ASCII` flag to
				504	:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
				505
				506	Similarly, ``\w`` matches a wide variety of Unicode characters but
				507	only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
				508	and ``\s`` will match either Unicode whitespace characters or
				509	``[ \t\n\r\f\v]``.
				510
				511
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	512	References
				513	----------
				514
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	515	.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
				516
				517	Some good alternative discussions of Python's Unicode support are:
				518
				519	* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
Sanyam Khurana	1b4587a	2017-12-06 22:09:33 +0530	[diff] [blame]	520	* `Pragmatic Unicode <https://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	521
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	522	The :class:`str` type is described in the Python library reference at
Ezio Melotti	a6229e6	2012-10-12 10:59:14 +0300	[diff] [blame]	523	:ref:`textseq`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	524
				525	The documentation for the :mod:`unicodedata` module.
				526
				527	The documentation for the :mod:`codecs` module.
				528
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	529	Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides)
				530	<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
				531	EuroPython 2002. The slides are an excellent overview of the design of Python
				532	2's Unicode features (where the Unicode string type is called ``unicode`` and
				533	literals start with ``u``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	534
				535
				536	Reading and Writing Unicode Data
				537	================================
				538
				539	Once you've written some code that works with Unicode data, the next problem is
				540	input/output. How do you get Unicode strings into your program, and how do you
				541	convert Unicode into a form suitable for storage or transmission?
				542
				543	It's possible that you may not need to do anything depending on your input
				544	sources and output destinations; you should check whether the libraries used in
				545	your application support Unicode natively. XML parsers often return Unicode
				546	data, for example. Many relational databases also support Unicode-valued
				547	columns and can return Unicode values from an SQL query.
				548
				549	Unicode data is usually converted to a particular encoding before it gets
				550	written to disk or sent over a socket. It's possible to do all the work
Georg Brandl	3d596fa	2013-10-29 08:16:56 +0100	[diff] [blame]	551	yourself: open a file, read an 8-bit bytes object from it, and convert the bytes
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	552	with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	553
				554	One problem is the multi-byte nature of encodings; one Unicode character can be
				555	represented by several bytes. If you want to read the file in arbitrary-sized
Serhiy Storchaka	f8def28	2013-02-16 17:29:56 +0200	[diff] [blame]	556	chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	557	where only part of the bytes encoding a single Unicode character are read at the
				558	end of a chunk. One solution would be to read the entire file into memory and
				559	then perform the decoding, but that prevents you from working with files that
Serhiy Storchaka	f8def28	2013-02-16 17:29:56 +0200	[diff] [blame]	560	are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	561	(More, really, since for at least a moment you'd need to have both the encoded
				562	string and its Unicode version in memory.)
				563
				564	The solution would be to use the low-level decoding interface to catch the case
				565	of partial coding sequences. The work of implementing this has already been
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	566	done for you: the built-in :func:`open` function can return a file-like object
				567	that assumes the file's contents are in a specified encoding and accepts Unicode
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	568	parameters for methods such as :meth:`~io.TextIOBase.read` and
Georg Brandl	325a1c2	2013-10-27 09:16:01 +0100	[diff] [blame]	569	:meth:`~io.TextIOBase.write`. This works through :func:`open`\'s encoding and
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	570	errors parameters which are interpreted just like those in :meth:`str.encode`
				571	and :meth:`bytes.decode`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	572
				573	Reading Unicode from a file is therefore simple::
				574
Georg Brandl	e47e184	2013-10-06 13:07:10 +0200	[diff] [blame]	575	with open('unicode.txt', encoding='utf-8') as f:
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	576	for line in f:
				577	print(repr(line))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	578
				579	It's also possible to open files in update mode, allowing both reading and
				580	writing::
				581
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	582	with open('test', encoding='utf-8', mode='w+') as f:
				583	f.write('\u4500 blah blah blah\n')
				584	f.seek(0)
				585	print(repr(f.readline()[:1]))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	586
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	587	The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	588	written as the first character of a file in order to assist with autodetection
				589	of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
				590	present at the start of a file; when such an encoding is used, the BOM will be
				591	automatically written as the first character and will be silently dropped when
				592	the file is read. There are variants of these encodings, such as 'utf-16-le'
				593	and 'utf-16-be' for little-endian and big-endian encodings, that specify one
				594	particular byte ordering and don't skip the BOM.
				595
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	596	In some areas, it is also convention to use a "BOM" at the start of UTF-8
				597	encoded files; the name is misleading since UTF-8 is not byte-order dependent.
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	598	The mark simply announces that the file is encoded in UTF-8. For reading such
				599	files, use the 'utf-8-sig' codec to automatically skip the mark if present.
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	600
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	601
				602	Unicode filenames
				603	-----------------
				604
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	605	Most of the operating systems in common use today support filenames
				606	that contain arbitrary Unicode characters. Usually this is
				607	implemented by converting the Unicode string into some encoding that
				608	varies depending on the system. Today Python is converging on using
				609	UTF-8: Python on MacOS has used UTF-8 for several versions, and Python
				610	3.6 switched to using UTF-8 on Windows as well. On Unix systems,
				611	there will only be a filesystem encoding if you've set the ``LANG`` or
				612	``LC_CTYPE`` environment variables; if you haven't, the default
				613	encoding is again UTF-8.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	614
				615	The :func:`sys.getfilesystemencoding` function returns the encoding to use on
				616	your current system, in case you want to do the encoding manually, but there's
				617	not much reason to bother. When opening a file for reading or writing, you can
				618	usually just provide the Unicode string as the filename, and it will be
				619	automatically converted to the right encoding for you::
				620
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	621	filename = 'filename\u4500abc'
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	622	with open(filename, 'w') as f:
				623	f.write('blah\n')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	624
				625	Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
				626	filenames.
				627
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	628	The :func:`os.listdir` function returns filenames, which raises an issue: should it return
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	629	the Unicode version of filenames, or should it return bytes containing
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	630	the encoded versions? :func:`os.listdir` can do both, depending on whether you
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	631	provided the directory path as bytes or a Unicode string. If you pass a
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	632	Unicode string as the path, filenames will be decoded using the filesystem's
				633	encoding and a list of Unicode strings will be returned, while passing a byte
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	634	path will return the filenames as bytes. For example,
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	635	assuming the default filesystem encoding is UTF-8, running the following
				636	program::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	637
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	638	fn = 'filename\u4500abc'
				639	f = open(fn, 'w')
				640	f.close()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	641
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	642	import os
				643	print(os.listdir(b'.'))
				644	print(os.listdir('.'))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	645
Martin Panter	1050d2d	2016-07-26 11:18:21 +0200	[diff] [blame]	646	will produce the following output:
				647
				648	.. code-block:: shell-session
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	649
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	650	$ python listdir-test.py
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	651	[b'filename\xe4\x94\x80abc', ...]
				652	['filename\u4500abc', ...]
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	653
				654	The first list contains UTF-8-encoded filenames, and the second list contains
				655	the Unicode versions.
				656
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	657	Note that on most occasions, you should can just stick with using
				658	Unicode with these APIs. The bytes APIs should only be used on
				659	systems where undecodable file names can be present; that's
				660	pretty much only Unix systems now.
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	661
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	662
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	663	Tips for Writing Unicode-aware Programs
				664	---------------------------------------
				665
				666	This section provides some suggestions on writing software that deals with
				667	Unicode.
				668
				669	The most important tip is:
				670
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	671	Software should only work with Unicode strings internally, decoding the input
				672	data as soon as possible and encoding the output only at the end.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	673
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	674	If you attempt to write processing functions that accept both Unicode and byte
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	675	strings, you will find your program vulnerable to bugs wherever you combine the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	676	two different kinds of strings. There is no automatic encoding or decoding: if
				677	you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	678
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	679	When using data coming from a web browser or some other untrusted source, a
				680	common technique is to check for illegal characters in a string before using the
				681	string in a generated command line or storing it in a database. If you're doing
Antoine Pitrou	534e253	2011-12-05 01:21:46 +0100	[diff] [blame]	682	this, be careful to check the decoded string, not the encoded bytes data;
				683	some encodings may have interesting properties, such as not being bijective
				684	or not being fully ASCII-compatible. This is especially true if the input
				685	data also specifies the encoding, since the attacker can then choose a
				686	clever way to hide malicious text in the encoded bytestream.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	687
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	688
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	689	Converting Between File Encodings
				690	'''''''''''''''''''''''''''''''''
				691
				692	The :class:`~codecs.StreamRecoder` class can transparently convert between
				693	encodings, taking a stream that returns data in encoding #1
				694	and behaving like a stream returning data in encoding #2.
				695
				696	For example, if you have an input file f that's in Latin-1, you
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	697	can wrap it with a :class:`~codecs.StreamRecoder` to return bytes encoded in
				698	UTF-8::
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	699
				700	new_f = codecs.StreamRecoder(f,
				701	# en/decoder: used by read() to encode its results and
				702	# by write() to decode its input.
				703	codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
				704
				705	# reader/writer: used to read and write to the stream.
				706	codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
				707
				708
				709	Files in an Unknown Encoding
				710	''''''''''''''''''''''''''''
				711
				712	What can you do if you need to make a change to a file, but don't know
				713	the file's encoding? If you know the encoding is ASCII-compatible and
				714	only want to examine or modify the ASCII parts, you can open the file
				715	with the ``surrogateescape`` error handler::
				716
				717	with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
				718	data = f.read()
				719
				720	# make changes to the string 'data'
				721
				722	with open(fname + '.new', 'w',
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	723	encoding="ascii", errors="surrogateescape") as f:
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	724	f.write(data)
				725
				726	The ``surrogateescape`` error handler will decode any non-ASCII bytes
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	727	as code points in a special range running from U+DC80 to
				728	U+DCFF. These code points will then turn back into the
				729	same bytes when the ``surrogateescape`` error handler is used to
				730	encode the data and write it back out.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	731
				732
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	733	References
				734	----------
				735
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	736	One section of `Mastering Python 3 Input/Output
				737	<http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_,
				738	a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	739
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	740	The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
				741	Applications in Python"
				742	<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	743	discuss questions of character encodings as well as how to internationalize
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	744	and localize an application. These slides cover Python 2.x only.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	745
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	746	`The Guts of Unicode in Python
				747	<http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_
				748	is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
				749	representation in Python 3.3.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	750
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	751
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	752	Acknowledgements
				753	================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	754
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	755	The initial draft of this document was written by Andrew Kuchling.
				756	It has since been revised further by Alexander Belopolsky, Georg Brandl,
				757	Andrew Kuchling, and Ezio Melotti.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	758
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	759	Thanks to the following people who have noted errors or offered
				760	suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
				761	Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame]	762	Lemburg, Martin von Löwis, Terry J. Reedy, Serhiy Storchaka,
				763	Eryk Sun, Chad Whitacre, Graham Wideman.