Blame - Doc/howto/unicode.rst - platform/external/python/cpython3

blob: be1fefb35a71f0970d5e84aadb93f011d79f73f9 [file] [log] [blame]

Guido van Rossum	715287f	2008-12-02 22:34:15 +0000	[diff] [blame]	1	.. _unicode-howto:
				2
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3	*****************
				4	Unicode HOWTO
				5	*****************
				6
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	7	:Release: 1.12
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	8
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	9	This HOWTO discusses Python support for Unicode, and explains
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	10	various problems that people commonly encounter when trying to work
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	11	with Unicode.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	12
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	13	Introduction to Unicode
				14	=======================
				15
				16	History of Character Codes
				17	--------------------------
				18
				19	In 1968, the American Standard Code for Information Interchange, better known by
				20	its acronym ASCII, was standardized. ASCII defined numeric codes for various
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	21	characters, with the numeric values running from 0 to 127. For example, the
				22	lowercase letter 'a' is assigned 97 as its code value.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	23
				24	ASCII was an American-developed standard, so it only defined unaccented
				25	characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
				26	which required accented characters couldn't be faithfully represented in ASCII.
				27	(Actually the missing accents matter for English, too, which contains words such
				28	as 'naïve' and 'café', and some publications have house styles which require
				29	spellings such as 'coöperate'.)
				30
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	31	For a while people just wrote programs that didn't display accents.
				32	In the mid-1980s an Apple II BASIC program written by a French speaker
Serhiy Storchaka	46936d5	2018-04-08 19:18:04 +0300	[diff] [blame]	33	might have lines like these:
				34
				35	.. code-block:: basic
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	36
Benjamin Peterson	643eb44	2014-12-24 13:58:05 -0600	[diff] [blame]	37	PRINT "MISE A JOUR TERMINEE"
				38	PRINT "PARAMETRES ENREGISTRES"
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	39
Benjamin Peterson	a54f075	2014-12-24 16:07:02 -0600	[diff] [blame]	40	Those messages should contain accents (terminée, paramètre, enregistrés) and
				41	they just look wrong to someone who can read French.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	42
				43	In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
				44	hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
				45	machines assigned values between 128 and 255 to accented characters. Different
				46	machines had different codes, however, which led to problems exchanging files.
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	47	Eventually various commonly used sets of values for the 128--255 range emerged.
Jesse Gonzalez	6fde770	2017-04-27 00:12:17 -0500	[diff] [blame]	48	Some were true standards, defined by the International Organization for
				49	Standardization, and some were de facto conventions that were invented by one
				50	company or another and managed to catch on.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	51
				52	255 characters aren't very many. For example, you can't fit both the accented
				53	characters used in Western Europe and the Cyrillic alphabet used for Russian
Benjamin Peterson	9e59967	2014-04-22 21:54:10 -0400	[diff] [blame]	54	into the 128--255 range because there are more than 128 such characters.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	55
				56	You could write files using different codes (all your Russian files in a coding
				57	system called KOI8, all your French files in a different coding system called
				58	Latin1), but what if you wanted to write a French document that quotes some
				59	Russian text? In the 1980s people began to want to solve this problem, and the
				60	Unicode standardization effort began.
				61
				62	Unicode started out using 16-bit characters instead of 8-bit characters. 16
				63	bits means you have 2^16 = 65,536 distinct values available, making it possible
				64	to represent many different characters from many different alphabets; an initial
				65	goal was to have Unicode contain the alphabets for every single human language.
				66	It turns out that even 16 bits isn't enough to meet that goal, and the modern
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	67	Unicode specification uses a wider range of codes, 0 through 1,114,111 (
				68	``0x10FFFF`` in base 16).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	69
				70	There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
				71	originally separate efforts, but the specifications were merged with the 1.1
				72	revision of Unicode.
				73
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	74	(This discussion of Unicode's history is highly simplified. The
				75	precise historical details aren't necessary for understanding how to
				76	use Unicode effectively, but if you're curious, consult the Unicode
				77	consortium site listed in the References or
Georg Brandl	5d94134	2016-02-26 19:37:12 +0100	[diff] [blame]	78	the `Wikipedia entry for Unicode <https://en.wikipedia.org/wiki/Unicode#History>`_
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	79	for more information.)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	80
				81
				82	Definitions
				83	-----------
				84
				85	A character is the smallest possible component of a text. 'A', 'B', 'C',
				86	etc., are all different characters. So are 'È' and 'Í'. Characters are
				87	abstractions, and vary depending on the language or context you're talking
				88	about. For example, the symbol for ohms (Ω) is usually drawn much like the
				89	capital letter omega (Ω) in the Greek alphabet (they may even be the same in
				90	some fonts), but these are two different characters that have different
				91	meanings.
				92
				93	The Unicode standard describes how characters are represented by **code
				94	points**. A code point is an integer value, usually denoted in base 16. In the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	95	standard, a code point is written using the notation ``U+12CA`` to mean the
				96	character with value ``0x12ca`` (4,810 decimal). The Unicode standard contains
				97	a lot of tables listing characters and their corresponding code points:
				98
				99	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	100
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	101	0061 'a'; LATIN SMALL LETTER A
				102	0062 'b'; LATIN SMALL LETTER B
				103	0063 'c'; LATIN SMALL LETTER C
				104	...
				105	007B '{'; LEFT CURLY BRACKET
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	106
				107	Strictly, these definitions imply that it's meaningless to say 'this is
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	108	character ``U+12CA``'. ``U+12CA`` is a code point, which represents some particular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	109	character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
				110	informal contexts, this distinction between code points and characters will
				111	sometimes be forgotten.
				112
				113	A character is represented on a screen or on paper by a set of graphical
				114	elements that's called a glyph. The glyph for an uppercase A, for example,
				115	is two diagonal strokes and a horizontal stroke, though the exact details will
				116	depend on the font being used. Most Python code doesn't need to worry about
				117	glyphs; figuring out the correct glyph to display is generally the job of a GUI
				118	toolkit or a terminal's font renderer.
				119
				120
				121	Encodings
				122	---------
				123
				124	To summarize the previous section: a Unicode string is a sequence of code
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	125	points, which are numbers from 0 through ``0x10FFFF`` (1,114,111 decimal). This
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	126	sequence needs to be represented as a set of bytes (meaning, values
				127	from 0 through 255) in memory. The rules for translating a Unicode string
				128	into a sequence of bytes are called an encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	129
				130	The first encoding you might think of is an array of 32-bit integers. In this
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	131	representation, the string "Python" would look like this:
				132
				133	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	134
				135	P y t h o n
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	136	0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
				137	0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	138
				139	This representation is straightforward but using it presents a number of
				140	problems.
				141
				142	1. It's not portable; different processors order the bytes differently.
				143
				144	2. It's very wasteful of space. In most texts, the majority of the code points
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	145	are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	146	bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
				147	ASCII representation. Increased RAM usage doesn't matter too much (desktop
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	148	computers have gigabytes of RAM, and strings aren't usually that large), but
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	149	expanding our usage of disk and network bandwidth by a factor of 4 is
				150	intolerable.
				151
				152	3. It's not compatible with existing C functions such as ``strlen()``, so a new
				153	family of wide string functions would need to be used.
				154
				155	4. Many Internet standards are defined in terms of textual data, and can't
				156	handle content with embedded zero bytes.
				157
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	158	Generally people don't use this encoding, instead choosing other
				159	encodings that are more efficient and convenient. UTF-8 is probably
				160	the most commonly supported encoding; it will be discussed below.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	161
				162	Encodings don't have to handle every possible Unicode character, and most
Benjamin Peterson	1f31697	2009-09-11 20:42:29 +0000	[diff] [blame]	163	encodings don't. The rules for converting a Unicode string into the ASCII
				164	encoding, for example, are simple; for each code point:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	165
				166	1. If the code point is < 128, each byte is the same as the value of the code
				167	point.
				168
				169	2. If the code point is 128 or greater, the Unicode string can't be represented
				170	in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
				171	case.)
				172
				173	Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	174	0--255 are identical to the Latin-1 values, so converting to this encoding simply
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	175	requires converting code points to byte values; if a code point larger than 255
				176	is encountered, the string can't be encoded into Latin-1.
				177
				178	Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
				179	IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
				180	block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
				181	through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
				182	some sort of lookup table to perform the conversion, but this is largely an
				183	internal detail.
				184
				185	UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
				186	Transformation Format", and the '8' means that 8-bit numbers are used in the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	187	encoding. (There are also a UTF-16 and UTF-32 encodings, but they are less
				188	frequently used than UTF-8.) UTF-8 uses the following rules:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	189
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	190	1. If the code point is < 128, it's represented by the corresponding byte value.
				191	2. If the code point is >= 128, it's turned into a sequence of two, three, or
				192	four bytes, where each byte of the sequence is between 128 and 255.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	193
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	194	UTF-8 has several convenient properties:
				195
				196	1. It can handle any Unicode code point.
R David Murray	48de282	2016-08-23 20:43:56 -0400	[diff] [blame]	197	2. A Unicode string is turned into a sequence of bytes containing no embedded zero
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	198	bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
				199	processed by C functions such as ``strcpy()`` and sent through protocols that
				200	can't handle zero bytes.
				201	3. A string of ASCII text is also valid UTF-8 text.
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	202	4. UTF-8 is fairly compact; the majority of commonly used characters can be
				203	represented with one or two bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	204	5. If bytes are corrupted or lost, it's possible to determine the start of the
				205	next UTF-8-encoded code point and resynchronize. It's also unlikely that
				206	random 8-bit data will look like valid UTF-8.
				207
				208
				209
				210	References
				211	----------
				212
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	213	The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	214	glossary, and PDF versions of the Unicode specification. Be prepared for some
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	215	difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
				216	origin and development of Unicode is also available on the site.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	217
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	218	To help understand the standard, Jukka Korpela has written `an introductory
Sanyam Khurana	338cd83	2018-01-20 05:55:37 +0530	[diff] [blame]	219	guide <http://jkorpela.fi/unicode/guide.html>`_ to reading the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	220	Unicode character tables.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	221
Sanyam Khurana	1b4587a	2017-12-06 22:09:33 +0530	[diff] [blame]	222	Another `good introductory article <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	223	was written by Joel Spolsky.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	224	If this introduction didn't make things clear to you, you should try
				225	reading this alternate article before continuing.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	226
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	227	Wikipedia entries are often helpful; see the entries for "`character encoding
Georg Brandl	5d94134	2016-02-26 19:37:12 +0100	[diff] [blame]	228	<https://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
				229	<https://en.wikipedia.org/wiki/UTF-8>`_, for example.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	230
				231
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	232	Python's Unicode Support
				233	========================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	234
				235	Now that you've learned the rudiments of Unicode, we can look at Python's
				236	Unicode features.
				237
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	238	The String Type
				239	---------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	240
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	241	Since Python 3.0, the language features a :class:`str` type that contain Unicode
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	242	characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
Georg Brandl	4f5f98d	2009-05-04 21:01:20 +0000	[diff] [blame]	243	rocks!'``, or the triple-quoted string syntax is stored as Unicode.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	244
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	245	The default encoding for Python source code is UTF-8, so you can simply
				246	include a Unicode character in a string literal::
				247
				248	try:
				249	with open('/tmp/input.txt', 'r') as f:
				250	...
Andrew Svetlov	08af000	2014-04-01 01:13:30 +0300	[diff] [blame]	251	except OSError:
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	252	# 'File not found' error message.
				253	print("Fichier non trouvé")
				254
				255	You can use a different encoding from UTF-8 by putting a specially-formatted
				256	comment as the first or second line of the source code::
				257
				258	# -- coding: <encoding name> --
				259
				260	Side note: Python 3 also supports using Unicode characters in identifiers::
				261
				262	répertoire = "/tmp/records.log"
				263	with open(répertoire, "w") as f:
				264	f.write("test\n")
				265
				266	If you can't enter a particular character in your editor or want to
				267	keep the source code ASCII-only for some reason, you can also use
				268	escape sequences in string literals. (Depending on your system,
				269	you may see the actual capital-delta glyph instead of a \u escape.) ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	270
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	271	>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
				272	'\u0394'
				273	>>> "\u0394" # Using a 16-bit hex value
				274	'\u0394'
				275	>>> "\U00000394" # Using a 32-bit hex value
				276	'\u0394'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	277
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	278	In addition, one can create a string using the :func:`~bytes.decode` method of
				279	:class:`bytes`. This method takes an encoding argument, such as ``UTF-8``,
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	280	and optionally an errors argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	281
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	282	The errors argument specifies the response when the input string can't be
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	283	converted according to the encoding's rules. Legal values for this argument are
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	284	``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
Serhiy Storchaka	07985ef	2015-01-25 22:56:57 +0200	[diff] [blame]	285	``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
				286	character out of the Unicode result), or ``'backslashreplace'`` (inserts a
				287	``\xNN`` escape sequence).
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	288	The following examples show the differences::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	289
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	290	>>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	291	Traceback (most recent call last):
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	292	...
				293	UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
				294	invalid start byte
Ezio Melotti	20b8d99	2012-09-23 15:55:14 +0300	[diff] [blame]	295	>>> b'\x80abc'.decode("utf-8", "replace")
				296	'\ufffdabc'
Serhiy Storchaka	07985ef	2015-01-25 22:56:57 +0200	[diff] [blame]	297	>>> b'\x80abc'.decode("utf-8", "backslashreplace")
				298	'\\x80abc'
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	299	>>> b'\x80abc'.decode("utf-8", "ignore")
				300	'abc'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	301
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	302	Encodings are specified as strings containing the encoding's name. Python 3.2
				303	comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	304	:ref:`standard-encodings` for a list. Some encodings have multiple names; for
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	305	example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
				306	the same encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	307
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	308	One-character Unicode strings can also be created with the :func:`chr`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	309	built-in function, which takes integers and returns a Unicode string of length 1
				310	that contains the corresponding code point. The reverse operation is the
				311	built-in :func:`ord` function that takes a one-character Unicode string and
				312	returns the code point value::
				313
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	314	>>> chr(57344)
				315	'\ue000'
				316	>>> ord('\ue000')
				317	57344
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	318
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	319	Converting to Bytes
				320	-------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	321
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	322	The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
				323	which returns a :class:`bytes` representation of the Unicode string, encoded in the
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	324	requested encoding.
				325
				326	The errors parameter is the same as the parameter of the
				327	:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
				328	``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
				329	inserts a question mark instead of the unencodable character), there is
Serhiy Storchaka	166ebc4	2014-11-25 13:57:17 +0200	[diff] [blame]	330	also ``'xmlcharrefreplace'`` (inserts an XML character reference),
				331	``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and
				332	``namereplace`` (inserts a ``\N{...}`` escape sequence).
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	333
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	334	The following example shows the different results::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	335
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	336	>>> u = chr(40960) + 'abcd' + chr(1972)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	337	>>> u.encode('utf-8')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	338	b'\xea\x80\x80abcd\xde\xb4'
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	339	>>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	340	Traceback (most recent call last):
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	341	...
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	342	UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	343	position 0: ordinal not in range(128)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	344	>>> u.encode('ascii', 'ignore')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	345	b'abcd'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	346	>>> u.encode('ascii', 'replace')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	347	b'?abcd?'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	348	>>> u.encode('ascii', 'xmlcharrefreplace')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	349	b'ꀀabcd޴'
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	350	>>> u.encode('ascii', 'backslashreplace')
				351	b'\\ua000abcd\\u07b4'
Serhiy Storchaka	166ebc4	2014-11-25 13:57:17 +0200	[diff] [blame]	352	>>> u.encode('ascii', 'namereplace')
				353	b'\\N{YI SYLLABLE IT}abcd\\u07b4'
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	354
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	355	The low-level routines for registering and accessing the available
				356	encodings are found in the :mod:`codecs` module. Implementing new
				357	encodings also requires understanding the :mod:`codecs` module.
				358	However, the encoding and decoding functions returned by this module
				359	are usually more low-level than is comfortable, and writing new encodings
				360	is a specialized task, so the module won't be covered in this HOWTO.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	361
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	362
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	363	Unicode Literals in Python Source Code
				364	--------------------------------------
				365
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	366	In Python source code, specific Unicode code points can be written using the
				367	``\u`` escape sequence, which is followed by four hex digits giving the code
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	368	point. The ``\U`` escape sequence is similar, but expects eight hex digits,
				369	not four::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	370
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	371	>>> s = "a\xac\u1234\u20ac\U00008000"
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	372	... # ^^^^ two-digit hex escape
				373	... # ^^^^^^ four-digit Unicode escape
				374	... # ^^^^^^^^^^ eight-digit Unicode escape
				375	>>> [ord(c) for c in s]
				376	[97, 172, 4660, 8364, 32768]
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	377
				378	Using escape sequences for code points greater than 127 is fine in small doses,
				379	but becomes an annoyance if you're using many accented characters, as you would
				380	in a program with messages in French or some other accent-using language. You
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	381	can also assemble strings using the :func:`chr` built-in function, but this is
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	382	even more tedious.
				383
				384	Ideally, you'd want to be able to write literals in your language's natural
				385	encoding. You could then edit Python source code with your favorite editor
				386	which would display the accented characters naturally, and have the right
				387	characters used at runtime.
				388
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	389	Python supports writing source code in UTF-8 by default, but you can use almost
				390	any encoding if you declare the encoding being used. This is done by including
				391	a special comment as either the first or second line of the source file::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	392
				393	#!/usr/bin/env python
				394	# -- coding: latin-1 --
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	395
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	396	u = 'abcdé'
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	397	print(ord(u[-1]))
				398
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	399	The syntax is inspired by Emacs's notation for specifying variables local to a
				400	file. Emacs supports many different variables, but Python only supports
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	401	'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
				402	they have no significance to Python but are a convention. Python looks for
				403	``coding: name`` or ``coding=name`` in the comment.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	404
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	405	If you don't include such a comment, the default encoding used will be UTF-8 as
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	406	already mentioned. See also :pep:`263` for more information.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	407
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	408
				409	Unicode Properties
				410	------------------
				411
				412	The Unicode specification includes a database of information about code points.
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	413	For each defined code point, the information includes the character's
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	414	name, its category, the numeric value if applicable (Unicode has characters
				415	representing the Roman numerals and fractions such as one-third and
				416	four-fifths). There are also properties related to the code point's use in
				417	bidirectional text and other display-related properties.
				418
				419	The following program displays some information about several characters, and
				420	prints the numeric value of one particular character::
				421
				422	import unicodedata
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	423
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	424	u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	425
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	426	for i, c in enumerate(u):
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	427	print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
				428	print(unicodedata.name(c))
				429
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	430	# Get numeric value of second character
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	431	print(unicodedata.numeric(u[1]))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	432
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	433	When run, this prints:
				434
				435	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	436
				437	0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
				438	1 0bf2 No TAMIL NUMBER ONE THOUSAND
				439	2 0f84 Mn TIBETAN MARK HALANTA
				440	3 1770 Lo TAGBANWA LETTER SA
				441	4 33af So SQUARE RAD OVER S SQUARED
				442	1000.0
				443
				444	The category codes are abbreviations describing the nature of the character.
				445	These are grouped into categories such as "Letter", "Number", "Punctuation", or
				446	"Symbol", which in turn are broken up into subcategories. To take the codes
				447	from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
				448	"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
				449	other". See
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	450	`the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	451	list of category codes.
				452
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	453
				454	Unicode Regular Expressions
				455	---------------------------
				456
				457	The regular expressions supported by the :mod:`re` module can be provided
				458	either as bytes or strings. Some of the special character sequences such as
				459	``\d`` and ``\w`` have different meanings depending on whether
				460	the pattern is supplied as bytes or a string. For example,
				461	``\d`` will match the characters ``[0-9]`` in bytes but
				462	in strings will match any character that's in the ``'Nd'`` category.
				463
				464	The string in this example has the number 57 written in both Thai and
				465	Arabic numerals::
				466
				467	import re
Cheryl Sabella	6677142	2018-02-02 16:16:27 -0500	[diff] [blame]	468	p = re.compile(r'\d+')
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	469
				470	s = "Over \u0e55\u0e57 57 flavours"
				471	m = p.search(s)
				472	print(repr(m.group()))
				473
				474	When executed, ``\d+`` will match the Thai numerals and print them
				475	out. If you supply the :const:`re.ASCII` flag to
				476	:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
				477
				478	Similarly, ``\w`` matches a wide variety of Unicode characters but
				479	only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
				480	and ``\s`` will match either Unicode whitespace characters or
				481	``[ \t\n\r\f\v]``.
				482
				483
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	484	References
				485	----------
				486
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	487	.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
				488
				489	Some good alternative discussions of Python's Unicode support are:
				490
				491	* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
Sanyam Khurana	1b4587a	2017-12-06 22:09:33 +0530	[diff] [blame]	492	* `Pragmatic Unicode <https://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	493
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	494	The :class:`str` type is described in the Python library reference at
Ezio Melotti	a6229e6	2012-10-12 10:59:14 +0300	[diff] [blame]	495	:ref:`textseq`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	496
				497	The documentation for the :mod:`unicodedata` module.
				498
				499	The documentation for the :mod:`codecs` module.
				500
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	501	Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides)
				502	<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
				503	EuroPython 2002. The slides are an excellent overview of the design of Python
				504	2's Unicode features (where the Unicode string type is called ``unicode`` and
				505	literals start with ``u``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	506
				507
				508	Reading and Writing Unicode Data
				509	================================
				510
				511	Once you've written some code that works with Unicode data, the next problem is
				512	input/output. How do you get Unicode strings into your program, and how do you
				513	convert Unicode into a form suitable for storage or transmission?
				514
				515	It's possible that you may not need to do anything depending on your input
				516	sources and output destinations; you should check whether the libraries used in
				517	your application support Unicode natively. XML parsers often return Unicode
				518	data, for example. Many relational databases also support Unicode-valued
				519	columns and can return Unicode values from an SQL query.
				520
				521	Unicode data is usually converted to a particular encoding before it gets
				522	written to disk or sent over a socket. It's possible to do all the work
Georg Brandl	3d596fa	2013-10-29 08:16:56 +0100	[diff] [blame]	523	yourself: open a file, read an 8-bit bytes object from it, and convert the bytes
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	524	with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	525
				526	One problem is the multi-byte nature of encodings; one Unicode character can be
				527	represented by several bytes. If you want to read the file in arbitrary-sized
Serhiy Storchaka	f8def28	2013-02-16 17:29:56 +0200	[diff] [blame]	528	chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	529	where only part of the bytes encoding a single Unicode character are read at the
				530	end of a chunk. One solution would be to read the entire file into memory and
				531	then perform the decoding, but that prevents you from working with files that
Serhiy Storchaka	f8def28	2013-02-16 17:29:56 +0200	[diff] [blame]	532	are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	533	(More, really, since for at least a moment you'd need to have both the encoded
				534	string and its Unicode version in memory.)
				535
				536	The solution would be to use the low-level decoding interface to catch the case
				537	of partial coding sequences. The work of implementing this has already been
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	538	done for you: the built-in :func:`open` function can return a file-like object
				539	that assumes the file's contents are in a specified encoding and accepts Unicode
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	540	parameters for methods such as :meth:`~io.TextIOBase.read` and
Georg Brandl	325a1c2	2013-10-27 09:16:01 +0100	[diff] [blame]	541	:meth:`~io.TextIOBase.write`. This works through :func:`open`\'s encoding and
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	542	errors parameters which are interpreted just like those in :meth:`str.encode`
				543	and :meth:`bytes.decode`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	544
				545	Reading Unicode from a file is therefore simple::
				546
Georg Brandl	e47e184	2013-10-06 13:07:10 +0200	[diff] [blame]	547	with open('unicode.txt', encoding='utf-8') as f:
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	548	for line in f:
				549	print(repr(line))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	550
				551	It's also possible to open files in update mode, allowing both reading and
				552	writing::
				553
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	554	with open('test', encoding='utf-8', mode='w+') as f:
				555	f.write('\u4500 blah blah blah\n')
				556	f.seek(0)
				557	print(repr(f.readline()[:1]))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	558
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	559	The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	560	written as the first character of a file in order to assist with autodetection
				561	of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
				562	present at the start of a file; when such an encoding is used, the BOM will be
				563	automatically written as the first character and will be silently dropped when
				564	the file is read. There are variants of these encodings, such as 'utf-16-le'
				565	and 'utf-16-be' for little-endian and big-endian encodings, that specify one
				566	particular byte ordering and don't skip the BOM.
				567
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	568	In some areas, it is also convention to use a "BOM" at the start of UTF-8
				569	encoded files; the name is misleading since UTF-8 is not byte-order dependent.
				570	The mark simply announces that the file is encoded in UTF-8. Use the
				571	'utf-8-sig' codec to automatically skip the mark if present for reading such
				572	files.
				573
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	574
				575	Unicode filenames
				576	-----------------
				577
				578	Most of the operating systems in common use today support filenames that contain
				579	arbitrary Unicode characters. Usually this is implemented by converting the
				580	Unicode string into some encoding that varies depending on the system. For
Georg Brandl	c575c90	2008-09-13 17:46:05 +0000	[diff] [blame]	581	example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	582	Windows, Python uses the name "mbcs" to refer to whatever the currently
				583	configured encoding is. On Unix systems, there will only be a filesystem
				584	encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	585	you haven't, the default encoding is UTF-8.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	586
				587	The :func:`sys.getfilesystemencoding` function returns the encoding to use on
				588	your current system, in case you want to do the encoding manually, but there's
				589	not much reason to bother. When opening a file for reading or writing, you can
				590	usually just provide the Unicode string as the filename, and it will be
				591	automatically converted to the right encoding for you::
				592
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	593	filename = 'filename\u4500abc'
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	594	with open(filename, 'w') as f:
				595	f.write('blah\n')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	596
				597	Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
				598	filenames.
				599
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	600	The :func:`os.listdir` function returns filenames and raises an issue: should it return
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	601	the Unicode version of filenames, or should it return bytes containing
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	602	the encoded versions? :func:`os.listdir` will do both, depending on whether you
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	603	provided the directory path as bytes or a Unicode string. If you pass a
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	604	Unicode string as the path, filenames will be decoded using the filesystem's
				605	encoding and a list of Unicode strings will be returned, while passing a byte
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	606	path will return the filenames as bytes. For example,
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	607	assuming the default filesystem encoding is UTF-8, running the following
				608	program::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	609
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	610	fn = 'filename\u4500abc'
				611	f = open(fn, 'w')
				612	f.close()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	613
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	614	import os
				615	print(os.listdir(b'.'))
				616	print(os.listdir('.'))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	617
Martin Panter	1050d2d	2016-07-26 11:18:21 +0200	[diff] [blame]	618	will produce the following output:
				619
				620	.. code-block:: shell-session
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	621
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	622	amk:~$ python t.py
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	623	[b'filename\xe4\x94\x80abc', ...]
				624	['filename\u4500abc', ...]
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	625
				626	The first list contains UTF-8-encoded filenames, and the second list contains
				627	the Unicode versions.
				628
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	629	Note that on most occasions, the Unicode APIs should be used. The bytes APIs
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	630	should only be used on systems where undecodable file names can be present,
				631	i.e. Unix systems.
				632
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	633
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	634	Tips for Writing Unicode-aware Programs
				635	---------------------------------------
				636
				637	This section provides some suggestions on writing software that deals with
				638	Unicode.
				639
				640	The most important tip is:
				641
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	642	Software should only work with Unicode strings internally, decoding the input
				643	data as soon as possible and encoding the output only at the end.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	644
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	645	If you attempt to write processing functions that accept both Unicode and byte
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	646	strings, you will find your program vulnerable to bugs wherever you combine the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	647	two different kinds of strings. There is no automatic encoding or decoding: if
				648	you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	649
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	650	When using data coming from a web browser or some other untrusted source, a
				651	common technique is to check for illegal characters in a string before using the
				652	string in a generated command line or storing it in a database. If you're doing
Antoine Pitrou	534e253	2011-12-05 01:21:46 +0100	[diff] [blame]	653	this, be careful to check the decoded string, not the encoded bytes data;
				654	some encodings may have interesting properties, such as not being bijective
				655	or not being fully ASCII-compatible. This is especially true if the input
				656	data also specifies the encoding, since the attacker can then choose a
				657	clever way to hide malicious text in the encoded bytestream.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	658
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	659
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	660	Converting Between File Encodings
				661	'''''''''''''''''''''''''''''''''
				662
				663	The :class:`~codecs.StreamRecoder` class can transparently convert between
				664	encodings, taking a stream that returns data in encoding #1
				665	and behaving like a stream returning data in encoding #2.
				666
				667	For example, if you have an input file f that's in Latin-1, you
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	668	can wrap it with a :class:`~codecs.StreamRecoder` to return bytes encoded in
				669	UTF-8::
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	670
				671	new_f = codecs.StreamRecoder(f,
				672	# en/decoder: used by read() to encode its results and
				673	# by write() to decode its input.
				674	codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
				675
				676	# reader/writer: used to read and write to the stream.
				677	codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
				678
				679
				680	Files in an Unknown Encoding
				681	''''''''''''''''''''''''''''
				682
				683	What can you do if you need to make a change to a file, but don't know
				684	the file's encoding? If you know the encoding is ASCII-compatible and
				685	only want to examine or modify the ASCII parts, you can open the file
				686	with the ``surrogateescape`` error handler::
				687
				688	with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
				689	data = f.read()
				690
				691	# make changes to the string 'data'
				692
				693	with open(fname + '.new', 'w',
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	694	encoding="ascii", errors="surrogateescape") as f:
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	695	f.write(data)
				696
				697	The ``surrogateescape`` error handler will decode any non-ASCII bytes
				698	as code points in the Unicode Private Use Area ranging from U+DC80 to
				699	U+DCFF. These private code points will then be turned back into the
				700	same bytes when the ``surrogateescape`` error handler is used when
				701	encoding the data and writing it back out.
				702
				703
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	704	References
				705	----------
				706
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	707	One section of `Mastering Python 3 Input/Output
				708	<http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_,
				709	a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	710
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	711	The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
				712	Applications in Python"
				713	<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	714	discuss questions of character encodings as well as how to internationalize
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	715	and localize an application. These slides cover Python 2.x only.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	716
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	717	`The Guts of Unicode in Python
				718	<http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_
				719	is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
				720	representation in Python 3.3.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	721
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	722
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	723	Acknowledgements
				724	================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	725
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	726	The initial draft of this document was written by Andrew Kuchling.
				727	It has since been revised further by Alexander Belopolsky, Georg Brandl,
				728	Andrew Kuchling, and Ezio Melotti.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	729
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	730	Thanks to the following people who have noted errors or offered
				731	suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
				732	Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
				733	Lemburg, Martin von Löwis, Terry J. Reedy, Chad Whitacre.