Blame - Doc/howto/unicode.rst - platform/external/python/cpython3

blob: f19cfc3133153a8858a197079f956f77a91ab89c [file] [log] [blame]

Guido van Rossum	715287f	2008-12-02 22:34:15 +0000	[diff] [blame]	1	.. _unicode-howto:
				2
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3	*****************
				4	Unicode HOWTO
				5	*****************
				6
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	7	:Release: 1.12
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	8
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	9	This HOWTO discusses Python support for Unicode, and explains
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	10	various problems that people commonly encounter when trying to work
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	11	with Unicode.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	12
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	13	Introduction to Unicode
				14	=======================
				15
				16	History of Character Codes
				17	--------------------------
				18
				19	In 1968, the American Standard Code for Information Interchange, better known by
				20	its acronym ASCII, was standardized. ASCII defined numeric codes for various
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	21	characters, with the numeric values running from 0 to 127. For example, the
				22	lowercase letter 'a' is assigned 97 as its code value.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	23
				24	ASCII was an American-developed standard, so it only defined unaccented
				25	characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
				26	which required accented characters couldn't be faithfully represented in ASCII.
				27	(Actually the missing accents matter for English, too, which contains words such
				28	as 'naïve' and 'café', and some publications have house styles which require
				29	spellings such as 'coöperate'.)
				30
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	31	For a while people just wrote programs that didn't display accents.
				32	In the mid-1980s an Apple II BASIC program written by a French speaker
				33	might have lines like these::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	34
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	35	PRINT "FICHIER EST COMPLETE."
				36	PRINT "CARACTERE NON ACCEPTE."
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	37
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	38	Those messages should contain accents (completé, caractère, accepté),
				39	and they just look wrong to someone who can read French.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	40
				41	In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
				42	hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
				43	machines assigned values between 128 and 255 to accented characters. Different
				44	machines had different codes, however, which led to problems exchanging files.
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	45	Eventually various commonly used sets of values for the 128--255 range emerged.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	46	Some were true standards, defined by the International Standards Organization,
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	47	and some were de facto conventions that were invented by one company or
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	48	another and managed to catch on.
				49
				50	255 characters aren't very many. For example, you can't fit both the accented
				51	characters used in Western Europe and the Cyrillic alphabet used for Russian
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	52	into the 128--255 range because there are more than 127 such characters.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	53
				54	You could write files using different codes (all your Russian files in a coding
				55	system called KOI8, all your French files in a different coding system called
				56	Latin1), but what if you wanted to write a French document that quotes some
				57	Russian text? In the 1980s people began to want to solve this problem, and the
				58	Unicode standardization effort began.
				59
				60	Unicode started out using 16-bit characters instead of 8-bit characters. 16
				61	bits means you have 2^16 = 65,536 distinct values available, making it possible
				62	to represent many different characters from many different alphabets; an initial
				63	goal was to have Unicode contain the alphabets for every single human language.
				64	It turns out that even 16 bits isn't enough to meet that goal, and the modern
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	65	Unicode specification uses a wider range of codes, 0 through 1,114,111 (
				66	``0x10FFFF`` in base 16).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	67
				68	There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
				69	originally separate efforts, but the specifications were merged with the 1.1
				70	revision of Unicode.
				71
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	72	(This discussion of Unicode's history is highly simplified. The
				73	precise historical details aren't necessary for understanding how to
				74	use Unicode effectively, but if you're curious, consult the Unicode
				75	consortium site listed in the References or
				76	the `Wikipedia entry for Unicode <http://en.wikipedia.org/wiki/Unicode#History>`_
				77	for more information.)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	78
				79
				80	Definitions
				81	-----------
				82
				83	A character is the smallest possible component of a text. 'A', 'B', 'C',
				84	etc., are all different characters. So are 'È' and 'Í'. Characters are
				85	abstractions, and vary depending on the language or context you're talking
				86	about. For example, the symbol for ohms (Ω) is usually drawn much like the
				87	capital letter omega (Ω) in the Greek alphabet (they may even be the same in
				88	some fonts), but these are two different characters that have different
				89	meanings.
				90
				91	The Unicode standard describes how characters are represented by **code
				92	points**. A code point is an integer value, usually denoted in base 16. In the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	93	standard, a code point is written using the notation ``U+12CA`` to mean the
				94	character with value ``0x12ca`` (4,810 decimal). The Unicode standard contains
				95	a lot of tables listing characters and their corresponding code points:
				96
				97	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	98
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	99	0061 'a'; LATIN SMALL LETTER A
				100	0062 'b'; LATIN SMALL LETTER B
				101	0063 'c'; LATIN SMALL LETTER C
				102	...
				103	007B '{'; LEFT CURLY BRACKET
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	104
				105	Strictly, these definitions imply that it's meaningless to say 'this is
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	106	character ``U+12CA``'. ``U+12CA`` is a code point, which represents some particular
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	107	character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
				108	informal contexts, this distinction between code points and characters will
				109	sometimes be forgotten.
				110
				111	A character is represented on a screen or on paper by a set of graphical
				112	elements that's called a glyph. The glyph for an uppercase A, for example,
				113	is two diagonal strokes and a horizontal stroke, though the exact details will
				114	depend on the font being used. Most Python code doesn't need to worry about
				115	glyphs; figuring out the correct glyph to display is generally the job of a GUI
				116	toolkit or a terminal's font renderer.
				117
				118
				119	Encodings
				120	---------
				121
				122	To summarize the previous section: a Unicode string is a sequence of code
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	123	points, which are numbers from 0 through ``0x10FFFF`` (1,114,111 decimal). This
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	124	sequence needs to be represented as a set of bytes (meaning, values
				125	from 0 through 255) in memory. The rules for translating a Unicode string
				126	into a sequence of bytes are called an encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	127
				128	The first encoding you might think of is an array of 32-bit integers. In this
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	129	representation, the string "Python" would look like this:
				130
				131	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	132
				133	P y t h o n
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	134	0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
				135	0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	136
				137	This representation is straightforward but using it presents a number of
				138	problems.
				139
				140	1. It's not portable; different processors order the bytes differently.
				141
				142	2. It's very wasteful of space. In most texts, the majority of the code points
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	143	are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	144	bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
				145	ASCII representation. Increased RAM usage doesn't matter too much (desktop
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	146	computers have gigabytes of RAM, and strings aren't usually that large), but
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	147	expanding our usage of disk and network bandwidth by a factor of 4 is
				148	intolerable.
				149
				150	3. It's not compatible with existing C functions such as ``strlen()``, so a new
				151	family of wide string functions would need to be used.
				152
				153	4. Many Internet standards are defined in terms of textual data, and can't
				154	handle content with embedded zero bytes.
				155
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	156	Generally people don't use this encoding, instead choosing other
				157	encodings that are more efficient and convenient. UTF-8 is probably
				158	the most commonly supported encoding; it will be discussed below.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	159
				160	Encodings don't have to handle every possible Unicode character, and most
Benjamin Peterson	1f31697	2009-09-11 20:42:29 +0000	[diff] [blame]	161	encodings don't. The rules for converting a Unicode string into the ASCII
				162	encoding, for example, are simple; for each code point:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	163
				164	1. If the code point is < 128, each byte is the same as the value of the code
				165	point.
				166
				167	2. If the code point is 128 or greater, the Unicode string can't be represented
				168	in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
				169	case.)
				170
				171	Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	172	0--255 are identical to the Latin-1 values, so converting to this encoding simply
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	173	requires converting code points to byte values; if a code point larger than 255
				174	is encountered, the string can't be encoded into Latin-1.
				175
				176	Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
				177	IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
				178	block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
				179	through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
				180	some sort of lookup table to perform the conversion, but this is largely an
				181	internal detail.
				182
				183	UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
				184	Transformation Format", and the '8' means that 8-bit numbers are used in the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	185	encoding. (There are also a UTF-16 and UTF-32 encodings, but they are less
				186	frequently used than UTF-8.) UTF-8 uses the following rules:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	187
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	188	1. If the code point is < 128, it's represented by the corresponding byte value.
				189	2. If the code point is >= 128, it's turned into a sequence of two, three, or
				190	four bytes, where each byte of the sequence is between 128 and 255.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	191
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	192	UTF-8 has several convenient properties:
				193
				194	1. It can handle any Unicode code point.
				195	2. A Unicode string is turned into a string of bytes containing no embedded zero
				196	bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
				197	processed by C functions such as ``strcpy()`` and sent through protocols that
				198	can't handle zero bytes.
				199	3. A string of ASCII text is also valid UTF-8 text.
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	200	4. UTF-8 is fairly compact; the majority of commonly used characters can be
				201	represented with one or two bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	202	5. If bytes are corrupted or lost, it's possible to determine the start of the
				203	next UTF-8-encoded code point and resynchronize. It's also unlikely that
				204	random 8-bit data will look like valid UTF-8.
				205
				206
				207
				208	References
				209	----------
				210
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	211	The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	212	glossary, and PDF versions of the Unicode specification. Be prepared for some
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	213	difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
				214	origin and development of Unicode is also available on the site.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	215
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	216	To help understand the standard, Jukka Korpela has written `an introductory
				217	guide <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>`_ to reading the
				218	Unicode character tables.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	219
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	220	Another `good introductory article <http://www.joelonsoftware.com/articles/Unicode.html>`_
				221	was written by Joel Spolsky.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	222	If this introduction didn't make things clear to you, you should try
				223	reading this alternate article before continuing.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	224
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	225	Wikipedia entries are often helpful; see the entries for "`character encoding
				226	<http://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
				227	<http://en.wikipedia.org/wiki/UTF-8>`_, for example.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	228
				229
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	230	Python's Unicode Support
				231	========================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	232
				233	Now that you've learned the rudiments of Unicode, we can look at Python's
				234	Unicode features.
				235
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	236	The String Type
				237	---------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	238
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	239	Since Python 3.0, the language features a :class:`str` type that contain Unicode
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	240	characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
Georg Brandl	4f5f98d	2009-05-04 21:01:20 +0000	[diff] [blame]	241	rocks!'``, or the triple-quoted string syntax is stored as Unicode.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	242
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	243	The default encoding for Python source code is UTF-8, so you can simply
				244	include a Unicode character in a string literal::
				245
				246	try:
				247	with open('/tmp/input.txt', 'r') as f:
				248	...
				249	except IOError:
				250	# 'File not found' error message.
				251	print("Fichier non trouvé")
				252
				253	You can use a different encoding from UTF-8 by putting a specially-formatted
				254	comment as the first or second line of the source code::
				255
				256	# -- coding: <encoding name> --
				257
				258	Side note: Python 3 also supports using Unicode characters in identifiers::
				259
				260	répertoire = "/tmp/records.log"
				261	with open(répertoire, "w") as f:
				262	f.write("test\n")
				263
				264	If you can't enter a particular character in your editor or want to
				265	keep the source code ASCII-only for some reason, you can also use
				266	escape sequences in string literals. (Depending on your system,
				267	you may see the actual capital-delta glyph instead of a \u escape.) ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	268
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	269	>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
				270	'\u0394'
				271	>>> "\u0394" # Using a 16-bit hex value
				272	'\u0394'
				273	>>> "\U00000394" # Using a 32-bit hex value
				274	'\u0394'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	275
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	276	In addition, one can create a string using the :func:`~bytes.decode` method of
				277	:class:`bytes`. This method takes an encoding argument, such as ``UTF-8``,
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	278	and optionally an errors argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	279
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	280	The errors argument specifies the response when the input string can't be
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	281	converted according to the encoding's rules. Legal values for this argument are
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	282	``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
				283	``U+FFFD``, ``REPLACEMENT CHARACTER``), or ``'ignore'`` (just leave the
				284	character out of the Unicode result).
				285	The following examples show the differences::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	286
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	287	>>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	288	Traceback (most recent call last):
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	289	...
				290	UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
				291	invalid start byte
Ezio Melotti	20b8d99	2012-09-23 15:55:14 +0300	[diff] [blame]	292	>>> b'\x80abc'.decode("utf-8", "replace")
				293	'\ufffdabc'
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	294	>>> b'\x80abc'.decode("utf-8", "ignore")
				295	'abc'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	296
Georg Brandl	c8c60c2	2010-11-19 22:09:04 +0000	[diff] [blame]	297	(In this code example, the Unicode replacement character has been replaced by
				298	a question mark because it may not be displayed on some systems.)
				299
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	300	Encodings are specified as strings containing the encoding's name. Python 3.2
				301	comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	302	:ref:`standard-encodings` for a list. Some encodings have multiple names; for
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	303	example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
				304	the same encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	305
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	306	One-character Unicode strings can also be created with the :func:`chr`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	307	built-in function, which takes integers and returns a Unicode string of length 1
				308	that contains the corresponding code point. The reverse operation is the
				309	built-in :func:`ord` function that takes a one-character Unicode string and
				310	returns the code point value::
				311
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	312	>>> chr(57344)
				313	'\ue000'
				314	>>> ord('\ue000')
				315	57344
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	316
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	317	Converting to Bytes
				318	-------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	319
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	320	The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
				321	which returns a :class:`bytes` representation of the Unicode string, encoded in the
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	322	requested encoding.
				323
				324	The errors parameter is the same as the parameter of the
				325	:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
				326	``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
				327	inserts a question mark instead of the unencodable character), there is
				328	also ``'xmlcharrefreplace'`` (inserts an XML character reference) and
				329	``backslashreplace`` (inserts a ``\uNNNN`` escape sequence).
				330
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	331	The following example shows the different results::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	332
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	333	>>> u = chr(40960) + 'abcd' + chr(1972)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	334	>>> u.encode('utf-8')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	335	b'\xea\x80\x80abcd\xde\xb4'
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	336	>>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	337	Traceback (most recent call last):
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	338	...
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	339	UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	340	position 0: ordinal not in range(128)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	341	>>> u.encode('ascii', 'ignore')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	342	b'abcd'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	343	>>> u.encode('ascii', 'replace')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	344	b'?abcd?'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	345	>>> u.encode('ascii', 'xmlcharrefreplace')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	346	b'ꀀabcd޴'
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	347	>>> u.encode('ascii', 'backslashreplace')
				348	b'\\ua000abcd\\u07b4'
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	349
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	350	The low-level routines for registering and accessing the available
				351	encodings are found in the :mod:`codecs` module. Implementing new
				352	encodings also requires understanding the :mod:`codecs` module.
				353	However, the encoding and decoding functions returned by this module
				354	are usually more low-level than is comfortable, and writing new encodings
				355	is a specialized task, so the module won't be covered in this HOWTO.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	356
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	357
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	358	Unicode Literals in Python Source Code
				359	--------------------------------------
				360
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	361	In Python source code, specific Unicode code points can be written using the
				362	``\u`` escape sequence, which is followed by four hex digits giving the code
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	363	point. The ``\U`` escape sequence is similar, but expects eight hex digits,
				364	not four::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	365
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	366	>>> s = "a\xac\u1234\u20ac\U00008000"
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	367	... # ^^^^ two-digit hex escape
				368	... # ^^^^^^ four-digit Unicode escape
				369	... # ^^^^^^^^^^ eight-digit Unicode escape
				370	>>> [ord(c) for c in s]
				371	[97, 172, 4660, 8364, 32768]
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	372
				373	Using escape sequences for code points greater than 127 is fine in small doses,
				374	but becomes an annoyance if you're using many accented characters, as you would
				375	in a program with messages in French or some other accent-using language. You
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	376	can also assemble strings using the :func:`chr` built-in function, but this is
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	377	even more tedious.
				378
				379	Ideally, you'd want to be able to write literals in your language's natural
				380	encoding. You could then edit Python source code with your favorite editor
				381	which would display the accented characters naturally, and have the right
				382	characters used at runtime.
				383
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	384	Python supports writing source code in UTF-8 by default, but you can use almost
				385	any encoding if you declare the encoding being used. This is done by including
				386	a special comment as either the first or second line of the source file::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	387
				388	#!/usr/bin/env python
				389	# -- coding: latin-1 --
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	390
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	391	u = 'abcdé'
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	392	print(ord(u[-1]))
				393
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	394	The syntax is inspired by Emacs's notation for specifying variables local to a
				395	file. Emacs supports many different variables, but Python only supports
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	396	'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
				397	they have no significance to Python but are a convention. Python looks for
				398	``coding: name`` or ``coding=name`` in the comment.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	399
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	400	If you don't include such a comment, the default encoding used will be UTF-8 as
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	401	already mentioned. See also :pep:`263` for more information.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	402
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	403
				404	Unicode Properties
				405	------------------
				406
				407	The Unicode specification includes a database of information about code points.
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	408	For each defined code point, the information includes the character's
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	409	name, its category, the numeric value if applicable (Unicode has characters
				410	representing the Roman numerals and fractions such as one-third and
				411	four-fifths). There are also properties related to the code point's use in
				412	bidirectional text and other display-related properties.
				413
				414	The following program displays some information about several characters, and
				415	prints the numeric value of one particular character::
				416
				417	import unicodedata
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	418
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	419	u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	420
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	421	for i, c in enumerate(u):
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	422	print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
				423	print(unicodedata.name(c))
				424
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	425	# Get numeric value of second character
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	426	print(unicodedata.numeric(u[1]))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	427
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	428	When run, this prints:
				429
				430	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	431
				432	0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
				433	1 0bf2 No TAMIL NUMBER ONE THOUSAND
				434	2 0f84 Mn TIBETAN MARK HALANTA
				435	3 1770 Lo TAGBANWA LETTER SA
				436	4 33af So SQUARE RAD OVER S SQUARED
				437	1000.0
				438
				439	The category codes are abbreviations describing the nature of the character.
				440	These are grouped into categories such as "Letter", "Number", "Punctuation", or
				441	"Symbol", which in turn are broken up into subcategories. To take the codes
				442	from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
				443	"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
				444	other". See
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	445	`the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	446	list of category codes.
				447
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	448
				449	Unicode Regular Expressions
				450	---------------------------
				451
				452	The regular expressions supported by the :mod:`re` module can be provided
				453	either as bytes or strings. Some of the special character sequences such as
				454	``\d`` and ``\w`` have different meanings depending on whether
				455	the pattern is supplied as bytes or a string. For example,
				456	``\d`` will match the characters ``[0-9]`` in bytes but
				457	in strings will match any character that's in the ``'Nd'`` category.
				458
				459	The string in this example has the number 57 written in both Thai and
				460	Arabic numerals::
				461
				462	import re
				463	p = re.compile('\d+')
				464
				465	s = "Over \u0e55\u0e57 57 flavours"
				466	m = p.search(s)
				467	print(repr(m.group()))
				468
				469	When executed, ``\d+`` will match the Thai numerals and print them
				470	out. If you supply the :const:`re.ASCII` flag to
				471	:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
				472
				473	Similarly, ``\w`` matches a wide variety of Unicode characters but
				474	only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
				475	and ``\s`` will match either Unicode whitespace characters or
				476	``[ \t\n\r\f\v]``.
				477
				478
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	479	References
				480	----------
				481
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	482	.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
				483
				484	Some good alternative discussions of Python's Unicode support are:
				485
				486	* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
				487	* `Pragmatic Unicode <http://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
				488
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	489	The :class:`str` type is described in the Python library reference at
Ezio Melotti	a6229e6	2012-10-12 10:59:14 +0300	[diff] [blame]	490	:ref:`textseq`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	491
				492	The documentation for the :mod:`unicodedata` module.
				493
				494	The documentation for the :mod:`codecs` module.
				495
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	496	Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides) <http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
				497	EuroPython 2002. The slides are an excellent overview of the design
				498	of Python 2's Unicode features (where the Unicode string type is
				499	called ``unicode`` and literals start with ``u``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	500
				501
				502	Reading and Writing Unicode Data
				503	================================
				504
				505	Once you've written some code that works with Unicode data, the next problem is
				506	input/output. How do you get Unicode strings into your program, and how do you
				507	convert Unicode into a form suitable for storage or transmission?
				508
				509	It's possible that you may not need to do anything depending on your input
				510	sources and output destinations; you should check whether the libraries used in
				511	your application support Unicode natively. XML parsers often return Unicode
				512	data, for example. Many relational databases also support Unicode-valued
				513	columns and can return Unicode values from an SQL query.
				514
				515	Unicode data is usually converted to a particular encoding before it gets
				516	written to disk or sent over a socket. It's possible to do all the work
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	517	yourself: open a file, read an 8-bit bytes object from it, and convert the string
				518	with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	519
				520	One problem is the multi-byte nature of encodings; one Unicode character can be
				521	represented by several bytes. If you want to read the file in arbitrary-sized
Serhiy Storchaka	f8def28	2013-02-16 17:29:56 +0200	[diff] [blame]	522	chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	523	where only part of the bytes encoding a single Unicode character are read at the
				524	end of a chunk. One solution would be to read the entire file into memory and
				525	then perform the decoding, but that prevents you from working with files that
Serhiy Storchaka	f8def28	2013-02-16 17:29:56 +0200	[diff] [blame]	526	are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	527	(More, really, since for at least a moment you'd need to have both the encoded
				528	string and its Unicode version in memory.)
				529
				530	The solution would be to use the low-level decoding interface to catch the case
				531	of partial coding sequences. The work of implementing this has already been
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	532	done for you: the built-in :func:`open` function can return a file-like object
				533	that assumes the file's contents are in a specified encoding and accepts Unicode
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	534	parameters for methods such as :meth:`read` and :meth:`write`. This works through
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	535	:func:`open`\'s encoding and errors parameters which are interpreted just
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	536	like those in :meth:`str.encode` and :meth:`bytes.decode`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	537
				538	Reading Unicode from a file is therefore simple::
				539
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	540	with open('unicode.rst', encoding='utf-8') as f:
				541	for line in f:
				542	print(repr(line))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	543
				544	It's also possible to open files in update mode, allowing both reading and
				545	writing::
				546
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	547	with open('test', encoding='utf-8', mode='w+') as f:
				548	f.write('\u4500 blah blah blah\n')
				549	f.seek(0)
				550	print(repr(f.readline()[:1]))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	551
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	552	The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	553	written as the first character of a file in order to assist with autodetection
				554	of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
				555	present at the start of a file; when such an encoding is used, the BOM will be
				556	automatically written as the first character and will be silently dropped when
				557	the file is read. There are variants of these encodings, such as 'utf-16-le'
				558	and 'utf-16-be' for little-endian and big-endian encodings, that specify one
				559	particular byte ordering and don't skip the BOM.
				560
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	561	In some areas, it is also convention to use a "BOM" at the start of UTF-8
				562	encoded files; the name is misleading since UTF-8 is not byte-order dependent.
				563	The mark simply announces that the file is encoded in UTF-8. Use the
				564	'utf-8-sig' codec to automatically skip the mark if present for reading such
				565	files.
				566
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	567
				568	Unicode filenames
				569	-----------------
				570
				571	Most of the operating systems in common use today support filenames that contain
				572	arbitrary Unicode characters. Usually this is implemented by converting the
				573	Unicode string into some encoding that varies depending on the system. For
Georg Brandl	c575c90	2008-09-13 17:46:05 +0000	[diff] [blame]	574	example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	575	Windows, Python uses the name "mbcs" to refer to whatever the currently
				576	configured encoding is. On Unix systems, there will only be a filesystem
				577	encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	578	you haven't, the default encoding is UTF-8.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	579
				580	The :func:`sys.getfilesystemencoding` function returns the encoding to use on
				581	your current system, in case you want to do the encoding manually, but there's
				582	not much reason to bother. When opening a file for reading or writing, you can
				583	usually just provide the Unicode string as the filename, and it will be
				584	automatically converted to the right encoding for you::
				585
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	586	filename = 'filename\u4500abc'
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	587	with open(filename, 'w') as f:
				588	f.write('blah\n')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	589
				590	Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
				591	filenames.
				592
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	593	The :func:`os.listdir` function returns filenames and raises an issue: should it return
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	594	the Unicode version of filenames, or should it return bytes containing
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	595	the encoded versions? :func:`os.listdir` will do both, depending on whether you
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	596	provided the directory path as bytes or a Unicode string. If you pass a
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	597	Unicode string as the path, filenames will be decoded using the filesystem's
				598	encoding and a list of Unicode strings will be returned, while passing a byte
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	599	path will return the filenames as bytes. For example,
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	600	assuming the default filesystem encoding is UTF-8, running the following
				601	program::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	602
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	603	fn = 'filename\u4500abc'
				604	f = open(fn, 'w')
				605	f.close()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	606
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	607	import os
				608	print(os.listdir(b'.'))
				609	print(os.listdir('.'))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	610
				611	will produce the following output::
				612
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	613	amk:~$ python t.py
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	614	[b'filename\xe4\x94\x80abc', ...]
				615	['filename\u4500abc', ...]
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	616
				617	The first list contains UTF-8-encoded filenames, and the second list contains
				618	the Unicode versions.
				619
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	620	Note that on most occasions, the Unicode APIs should be used. The bytes APIs
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	621	should only be used on systems where undecodable file names can be present,
				622	i.e. Unix systems.
				623
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	624
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	625	Tips for Writing Unicode-aware Programs
				626	---------------------------------------
				627
				628	This section provides some suggestions on writing software that deals with
				629	Unicode.
				630
				631	The most important tip is:
				632
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	633	Software should only work with Unicode strings internally, decoding the input
				634	data as soon as possible and encoding the output only at the end.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	635
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	636	If you attempt to write processing functions that accept both Unicode and byte
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	637	strings, you will find your program vulnerable to bugs wherever you combine the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	638	two different kinds of strings. There is no automatic encoding or decoding: if
				639	you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	640
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	641	When using data coming from a web browser or some other untrusted source, a
				642	common technique is to check for illegal characters in a string before using the
				643	string in a generated command line or storing it in a database. If you're doing
Antoine Pitrou	534e253	2011-12-05 01:21:46 +0100	[diff] [blame]	644	this, be careful to check the decoded string, not the encoded bytes data;
				645	some encodings may have interesting properties, such as not being bijective
				646	or not being fully ASCII-compatible. This is especially true if the input
				647	data also specifies the encoding, since the attacker can then choose a
				648	clever way to hide malicious text in the encoded bytestream.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	649
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	650
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	651	Converting Between File Encodings
				652	'''''''''''''''''''''''''''''''''
				653
				654	The :class:`~codecs.StreamRecoder` class can transparently convert between
				655	encodings, taking a stream that returns data in encoding #1
				656	and behaving like a stream returning data in encoding #2.
				657
				658	For example, if you have an input file f that's in Latin-1, you
				659	can wrap it with a :class:`StreamRecoder` to return bytes encoded in UTF-8::
				660
				661	new_f = codecs.StreamRecoder(f,
				662	# en/decoder: used by read() to encode its results and
				663	# by write() to decode its input.
				664	codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
				665
				666	# reader/writer: used to read and write to the stream.
				667	codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
				668
				669
				670	Files in an Unknown Encoding
				671	''''''''''''''''''''''''''''
				672
				673	What can you do if you need to make a change to a file, but don't know
				674	the file's encoding? If you know the encoding is ASCII-compatible and
				675	only want to examine or modify the ASCII parts, you can open the file
				676	with the ``surrogateescape`` error handler::
				677
				678	with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
				679	data = f.read()
				680
				681	# make changes to the string 'data'
				682
				683	with open(fname + '.new', 'w',
				684	encoding="ascii", errors="surrogateescape") as f:
				685	f.write(data)
				686
				687	The ``surrogateescape`` error handler will decode any non-ASCII bytes
				688	as code points in the Unicode Private Use Area ranging from U+DC80 to
				689	U+DCFF. These private code points will then be turned back into the
				690	same bytes when the ``surrogateescape`` error handler is used when
				691	encoding the data and writing it back out.
				692
				693
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	694	References
				695	----------
				696
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	697	One section of `Mastering Python 3 Input/Output <http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_, a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
				698
				699	The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware Applications in Python" <http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
				700	discuss questions of character encodings as well as how to internationalize
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	701	and localize an application. These slides cover Python 2.x only.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	702
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	703	`The Guts of Unicode in Python <http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_ is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode representation in Python 3.3.
				704
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	705
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	706	Acknowledgements
				707	================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	708
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	709	The initial draft of this document was written by Andrew Kuchling.
				710	It has since been revised further by Alexander Belopolsky, Georg Brandl,
				711	Andrew Kuchling, and Ezio Melotti.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	712
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	713	Thanks to the following people who have noted errors or offered
				714	suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
				715	Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
				716	Lemburg, Martin von Löwis, Terry J. Reedy, Chad Whitacre.