Blame - Doc/howto/unicode.rst - platform/external/python/cpython3

blob: 5339bf45bf0e804d8ebcda1638c54ea711f15e9c [file] [log] [blame]

Guido van Rossum	715287f	2008-12-02 22:34:15 +0000	[diff] [blame]	1	.. _unicode-howto:
				2
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	3	*****************
				4	Unicode HOWTO
				5	*****************
				6
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	7	:Release: 1.12
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	8
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	9	This HOWTO discusses Python's support for the Unicode specification
				10	for representing textual data, and explains various problems that
				11	people commonly encounter when trying to work with Unicode.
				12
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	13
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	14	Introduction to Unicode
				15	=======================
				16
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	17	Definitions
				18	-----------
				19
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	20	Today's programs need to be able to handle a wide variety of
				21	characters. Applications are often internationalized to display
				22	messages and output in a variety of user-selectable languages; the
				23	same program might need to output an error message in English, French,
				24	Japanese, Hebrew, or Russian. Web content can be written in any of
				25	these languages and can also include a variety of emoji symbols.
				26	Python's string type uses the Unicode Standard for representing
				27	characters, which lets Python programs work with all these different
				28	possible characters.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	29
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	30	Unicode (https://www.unicode.org/) is a specification that aims to
				31	list every character used by human languages and give each character
				32	its own unique code. The Unicode specifications are continually
				33	revised and updated to add new languages and symbols.
				34
				35	A character is the smallest possible component of a text. 'A', 'B', 'C',
				36	etc., are all different characters. So are 'È' and 'Í'. Characters vary
				37	depending on the language or context you're talking
				38	about. For example, there's a character for "Roman Numeral One", 'Ⅰ', that's
				39	separate from the uppercase letter 'I'. They'll usually look the same,
				40	but these are two different characters that have different meanings.
				41
				42	The Unicode standard describes how characters are represented by
				43	code points. A code point value is an integer in the range 0 to
				44	0x10FFFF (about 1.1 million values, with some 110 thousand assigned so
				45	far). In the standard and in this document, a code point is written
				46	using the notation ``U+265E`` to mean the character with value
				47	``0x265e`` (9,822 in decimal).
				48
				49	The Unicode standard contains a lot of tables listing characters and
				50	their corresponding code points:
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	51
				52	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	53
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	54	0061 'a'; LATIN SMALL LETTER A
				55	0062 'b'; LATIN SMALL LETTER B
				56	0063 'c'; LATIN SMALL LETTER C
				57	...
				58	007B '{'; LEFT CURLY BRACKET
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	59	...
				60	2167 'Ⅶ': ROMAN NUMERAL EIGHT
				61	2168 'Ⅸ': ROMAN NUMERAL NINE
				62	...
				63	265E '♞': BLACK CHESS KNIGHT
				64	265F '♟': BLACK CHESS PAWN
				65	...
				66	1F600 '😀': GRINNING FACE
				67	1F609 '😉': WINKING FACE
				68	...
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	69
				70	Strictly, these definitions imply that it's meaningless to say 'this is
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	71	character ``U+265E``'. ``U+265E`` is a code point, which represents some particular
				72	character; in this case, it represents the character 'BLACK CHESS KNIGHT',
				73	'♞'. In
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	74	informal contexts, this distinction between code points and characters will
				75	sometimes be forgotten.
				76
				77	A character is represented on a screen or on paper by a set of graphical
				78	elements that's called a glyph. The glyph for an uppercase A, for example,
				79	is two diagonal strokes and a horizontal stroke, though the exact details will
				80	depend on the font being used. Most Python code doesn't need to worry about
				81	glyphs; figuring out the correct glyph to display is generally the job of a GUI
				82	toolkit or a terminal's font renderer.
				83
				84
				85	Encodings
				86	---------
				87
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	88	To summarize the previous section: a Unicode string is a sequence of
				89	code points, which are numbers from 0 through ``0x10FFFF`` (1,114,111
				90	decimal). This sequence of code points needs to be represented in
				91	memory as a set of code units, and code units are then mapped
				92	to 8-bit bytes. The rules for translating a Unicode string into a
				93	sequence of bytes are called a character encoding, or just
				94	an encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	95
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	96	The first encoding you might think of is using 32-bit integers as the
				97	code unit, and then using the CPU's representation of 32-bit integers.
				98	In this representation, the string "Python" might look like this:
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	99
				100	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	101
				102	P y t h o n
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	103	0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
				104	0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	105
				106	This representation is straightforward but using it presents a number of
				107	problems.
				108
				109	1. It's not portable; different processors order the bytes differently.
				110
				111	2. It's very wasteful of space. In most texts, the majority of the code points
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	112	are less than 127, or less than 255, so a lot of space is occupied by ``0x00``
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	113	bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
				114	ASCII representation. Increased RAM usage doesn't matter too much (desktop
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	115	computers have gigabytes of RAM, and strings aren't usually that large), but
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	116	expanding our usage of disk and network bandwidth by a factor of 4 is
				117	intolerable.
				118
				119	3. It's not compatible with existing C functions such as ``strlen()``, so a new
				120	family of wide string functions would need to be used.
				121
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	122	Therefore this encoding isn't used very much, and people instead choose other
				123	encodings that are more efficient and convenient, such as UTF-8.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	124
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	125	UTF-8 is one of the most commonly used encodings, and Python often
				126	defaults to using it. UTF stands for "Unicode Transformation Format",
				127	and the '8' means that 8-bit values are used in the encoding. (There
				128	are also UTF-16 and UTF-32 encodings, but they are less frequently
				129	used than UTF-8.) UTF-8 uses the following rules:
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	130
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	131	1. If the code point is < 128, it's represented by the corresponding byte value.
				132	2. If the code point is >= 128, it's turned into a sequence of two, three, or
				133	four bytes, where each byte of the sequence is between 128 and 255.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	134
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	135	UTF-8 has several convenient properties:
				136
				137	1. It can handle any Unicode code point.
R David Murray	48de282	2016-08-23 20:43:56 -0400	[diff] [blame]	138	2. A Unicode string is turned into a sequence of bytes containing no embedded zero
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	139	bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
				140	processed by C functions such as ``strcpy()`` and sent through protocols that
				141	can't handle zero bytes.
				142	3. A string of ASCII text is also valid UTF-8 text.
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	143	4. UTF-8 is fairly compact; the majority of commonly used characters can be
				144	represented with one or two bytes.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	145	5. If bytes are corrupted or lost, it's possible to determine the start of the
				146	next UTF-8-encoded code point and resynchronize. It's also unlikely that
				147	random 8-bit data will look like valid UTF-8.
				148
				149
				150
				151	References
				152	----------
				153
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	154	The `Unicode Consortium site <http://www.unicode.org>`_ has character charts, a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	155	glossary, and PDF versions of the Unicode specification. Be prepared for some
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	156	difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
				157	origin and development of Unicode is also available on the site.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	158
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	159	On the Computerphile Youtube channel, Tom Scott briefly
				160	`discusses the history of Unicode and UTF-8 <https://www.youtube.com/watch?v=MijmeoH9LT4>`
				161	(9 minutes 36 seconds).
				162
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	163	To help understand the standard, Jukka Korpela has written `an introductory
Sanyam Khurana	338cd83	2018-01-20 05:55:37 +0530	[diff] [blame]	164	guide <http://jkorpela.fi/unicode/guide.html>`_ to reading the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	165	Unicode character tables.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	166
Sanyam Khurana	1b4587a	2017-12-06 22:09:33 +0530	[diff] [blame]	167	Another `good introductory article <https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/>`_
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	168	was written by Joel Spolsky.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	169	If this introduction didn't make things clear to you, you should try
				170	reading this alternate article before continuing.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	171
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	172	Wikipedia entries are often helpful; see the entries for "`character encoding
Georg Brandl	5d94134	2016-02-26 19:37:12 +0100	[diff] [blame]	173	<https://en.wikipedia.org/wiki/Character_encoding>`_" and `UTF-8
				174	<https://en.wikipedia.org/wiki/UTF-8>`_, for example.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	175
				176
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	177	Python's Unicode Support
				178	========================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	179
				180	Now that you've learned the rudiments of Unicode, we can look at Python's
				181	Unicode features.
				182
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	183	The String Type
				184	---------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	185
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	186	Since Python 3.0, the language's :class:`str` type contains Unicode
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	187	characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
Georg Brandl	4f5f98d	2009-05-04 21:01:20 +0000	[diff] [blame]	188	rocks!'``, or the triple-quoted string syntax is stored as Unicode.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	189
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	190	The default encoding for Python source code is UTF-8, so you can simply
				191	include a Unicode character in a string literal::
				192
				193	try:
				194	with open('/tmp/input.txt', 'r') as f:
				195	...
Andrew Svetlov	08af000	2014-04-01 01:13:30 +0300	[diff] [blame]	196	except OSError:
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	197	# 'File not found' error message.
				198	print("Fichier non trouvé")
				199
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	200	Side note: Python 3 also supports using Unicode characters in identifiers::
				201
				202	répertoire = "/tmp/records.log"
				203	with open(répertoire, "w") as f:
				204	f.write("test\n")
				205
				206	If you can't enter a particular character in your editor or want to
				207	keep the source code ASCII-only for some reason, you can also use
				208	escape sequences in string literals. (Depending on your system,
				209	you may see the actual capital-delta glyph instead of a \u escape.) ::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	210
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	211	>>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name
				212	'\u0394'
				213	>>> "\u0394" # Using a 16-bit hex value
				214	'\u0394'
				215	>>> "\U00000394" # Using a 32-bit hex value
				216	'\u0394'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	217
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	218	In addition, one can create a string using the :func:`~bytes.decode` method of
				219	:class:`bytes`. This method takes an encoding argument, such as ``UTF-8``,
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	220	and optionally an errors argument.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	221
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	222	The errors argument specifies the response when the input string can't be
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	223	converted according to the encoding's rules. Legal values for this argument are
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	224	``'strict'`` (raise a :exc:`UnicodeDecodeError` exception), ``'replace'`` (use
Serhiy Storchaka	07985ef	2015-01-25 22:56:57 +0200	[diff] [blame]	225	``U+FFFD``, ``REPLACEMENT CHARACTER``), ``'ignore'`` (just leave the
				226	character out of the Unicode result), or ``'backslashreplace'`` (inserts a
				227	``\xNN`` escape sequence).
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	228	The following examples show the differences::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	229
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	230	>>> b'\x80abc'.decode("utf-8", "strict") #doctest: +NORMALIZE_WHITESPACE
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	231	Traceback (most recent call last):
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	232	...
				233	UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
				234	invalid start byte
Ezio Melotti	20b8d99	2012-09-23 15:55:14 +0300	[diff] [blame]	235	>>> b'\x80abc'.decode("utf-8", "replace")
				236	'\ufffdabc'
Serhiy Storchaka	07985ef	2015-01-25 22:56:57 +0200	[diff] [blame]	237	>>> b'\x80abc'.decode("utf-8", "backslashreplace")
				238	'\\x80abc'
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	239	>>> b'\x80abc'.decode("utf-8", "ignore")
				240	'abc'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	241
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	242	Encodings are specified as strings containing the encoding's name. Python
Benjamin Peterson	d7c3ed5	2010-06-27 22:32:30 +0000	[diff] [blame]	243	comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	244	:ref:`standard-encodings` for a list. Some encodings have multiple names; for
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	245	example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
				246	the same encoding.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	247
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	248	One-character Unicode strings can also be created with the :func:`chr`
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	249	built-in function, which takes integers and returns a Unicode string of length 1
				250	that contains the corresponding code point. The reverse operation is the
				251	built-in :func:`ord` function that takes a one-character Unicode string and
				252	returns the code point value::
				253
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	254	>>> chr(57344)
				255	'\ue000'
				256	>>> ord('\ue000')
				257	57344
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	258
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	259	Converting to Bytes
				260	-------------------
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	261
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	262	The opposite method of :meth:`bytes.decode` is :meth:`str.encode`,
				263	which returns a :class:`bytes` representation of the Unicode string, encoded in the
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	264	requested encoding.
				265
				266	The errors parameter is the same as the parameter of the
				267	:meth:`~bytes.decode` method but supports a few more possible handlers. As well as
				268	``'strict'``, ``'ignore'``, and ``'replace'`` (which in this case
				269	inserts a question mark instead of the unencodable character), there is
Serhiy Storchaka	166ebc4	2014-11-25 13:57:17 +0200	[diff] [blame]	270	also ``'xmlcharrefreplace'`` (inserts an XML character reference),
				271	``backslashreplace`` (inserts a ``\uNNNN`` escape sequence) and
				272	``namereplace`` (inserts a ``\N{...}`` escape sequence).
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	273
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	274	The following example shows the different results::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	275
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	276	>>> u = chr(40960) + 'abcd' + chr(1972)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	277	>>> u.encode('utf-8')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	278	b'\xea\x80\x80abcd\xde\xb4'
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	279	>>> u.encode('ascii') #doctest: +NORMALIZE_WHITESPACE
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	280	Traceback (most recent call last):
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	281	...
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	282	UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	283	position 0: ordinal not in range(128)
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	284	>>> u.encode('ascii', 'ignore')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	285	b'abcd'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	286	>>> u.encode('ascii', 'replace')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	287	b'?abcd?'
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	288	>>> u.encode('ascii', 'xmlcharrefreplace')
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	289	b'ꀀabcd޴'
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	290	>>> u.encode('ascii', 'backslashreplace')
				291	b'\\ua000abcd\\u07b4'
Serhiy Storchaka	166ebc4	2014-11-25 13:57:17 +0200	[diff] [blame]	292	>>> u.encode('ascii', 'namereplace')
				293	b'\\N{YI SYLLABLE IT}abcd\\u07b4'
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	294
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	295	The low-level routines for registering and accessing the available
				296	encodings are found in the :mod:`codecs` module. Implementing new
				297	encodings also requires understanding the :mod:`codecs` module.
				298	However, the encoding and decoding functions returned by this module
				299	are usually more low-level than is comfortable, and writing new encodings
				300	is a specialized task, so the module won't be covered in this HOWTO.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	301
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	302
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	303	Unicode Literals in Python Source Code
				304	--------------------------------------
				305
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	306	In Python source code, specific Unicode code points can be written using the
				307	``\u`` escape sequence, which is followed by four hex digits giving the code
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	308	point. The ``\U`` escape sequence is similar, but expects eight hex digits,
				309	not four::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	310
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	311	>>> s = "a\xac\u1234\u20ac\U00008000"
Senthil Kumaran	2fd8bdb	2012-09-11 03:17:52 -0700	[diff] [blame]	312	... # ^^^^ two-digit hex escape
				313	... # ^^^^^^ four-digit Unicode escape
				314	... # ^^^^^^^^^^ eight-digit Unicode escape
				315	>>> [ord(c) for c in s]
				316	[97, 172, 4660, 8364, 32768]
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	317
				318	Using escape sequences for code points greater than 127 is fine in small doses,
				319	but becomes an annoyance if you're using many accented characters, as you would
				320	in a program with messages in French or some other accent-using language. You
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	321	can also assemble strings using the :func:`chr` built-in function, but this is
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	322	even more tedious.
				323
				324	Ideally, you'd want to be able to write literals in your language's natural
				325	encoding. You could then edit Python source code with your favorite editor
				326	which would display the accented characters naturally, and have the right
				327	characters used at runtime.
				328
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	329	Python supports writing source code in UTF-8 by default, but you can use almost
				330	any encoding if you declare the encoding being used. This is done by including
				331	a special comment as either the first or second line of the source file::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	332
				333	#!/usr/bin/env python
				334	# -- coding: latin-1 --
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	335
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	336	u = 'abcdé'
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	337	print(ord(u[-1]))
				338
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	339	The syntax is inspired by Emacs's notation for specifying variables local to a
				340	file. Emacs supports many different variables, but Python only supports
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	341	'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
				342	they have no significance to Python but are a convention. Python looks for
				343	``coding: name`` or ``coding=name`` in the comment.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	344
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	345	If you don't include such a comment, the default encoding used will be UTF-8 as
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	346	already mentioned. See also :pep:`263` for more information.
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	347
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	348
				349	Unicode Properties
				350	------------------
				351
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	352	The Unicode specification includes a database of information about
				353	code points. For each defined code point, the information includes
				354	the character's name, its category, the numeric value if applicable
				355	(for characters representing numeric concepts such as the Roman
				356	numerals, fractions such as one-third and four-fifths, etc.). There
				357	are also display-related properties, such as how to use the code point
				358	in bidirectional text.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	359
				360	The following program displays some information about several characters, and
				361	prints the numeric value of one particular character::
				362
				363	import unicodedata
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	364
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	365	u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	366
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	367	for i, c in enumerate(u):
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	368	print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
				369	print(unicodedata.name(c))
				370
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	371	# Get numeric value of second character
Georg Brandl	6911e3c	2007-09-04 07:15:32 +0000	[diff] [blame]	372	print(unicodedata.numeric(u[1]))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	373
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	374	When run, this prints:
				375
				376	.. code-block:: none
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	377
				378	0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
				379	1 0bf2 No TAMIL NUMBER ONE THOUSAND
				380	2 0f84 Mn TIBETAN MARK HALANTA
				381	3 1770 Lo TAGBANWA LETTER SA
				382	4 33af So SQUARE RAD OVER S SQUARED
				383	1000.0
				384
				385	The category codes are abbreviations describing the nature of the character.
				386	These are grouped into categories such as "Letter", "Number", "Punctuation", or
				387	"Symbol", which in turn are broken up into subcategories. To take the codes
				388	from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
				389	"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
				390	other". See
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	391	`the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>`_ for a
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	392	list of category codes.
				393
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	394
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	395	Comparing Strings
				396	-----------------
				397
				398	Unicode adds some complication to comparing strings, because the same
				399	set of characters can be represented by different sequences of code
				400	points. For example, a letter like 'ê' can be represented as a single
				401	code point U+00EA, or as U+0065 U+0302, which is the code point for
				402	'e' followed by a code point for 'COMBINING CIRCUMFLEX ACCENT'. These
				403	will produce the same output when printed, but one is a string of
				404	length 1 and the other is of length 2.
				405
				406	One tool for a case-insensitive comparison is the
				407	:meth:`~str.casefold` string method that converts a string to a
				408	case-insensitive form following an algorithm described by the Unicode
				409	Standard. This algorithm has special handling for characters such as
				410	the German letter 'ß' (code point U+00DF), which becomes the pair of
				411	lowercase letters 'ss'.
				412
				413	::
				414
				415	>>> street = 'Gürzenichstraße'
				416	>>> street.casefold()
				417	'gürzenichstrasse'
				418
				419	A second tool is the :mod:`unicodedata` module's
				420	:func:`~unicodedata.normalize` function that converts strings to one
				421	of several normal forms, where letters followed by a combining
				422	character are replaced with single characters. :func:`normalize` can
				423	be used to perform string comparisons that won't falsely report
				424	inequality if two strings use combining characters differently:
				425
				426	::
				427
				428	import unicodedata
				429
				430	def compare_strs(s1, s2):
				431	def NFD(s):
				432	return unicodedata.normalize('NFD', s)
				433
				434	return NFD(s1) == NFD(s2)
				435
				436	single_char = 'ê'
				437	multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
				438	print('length of first string=', len(single_char))
				439	print('length of second string=', len(multiple_chars))
				440	print(compare_strs(single_char, multiple_chars))
				441
				442	When run, this outputs:
				443
				444	.. code-block:: shell-session
				445
				446	$ python3 compare-strs.py
				447	length of first string= 1
				448	length of second string= 2
				449	True
				450
				451	The first argument to the :func:`~unicodedata.normalize` function is a
				452	string giving the desired normalization form, which can be one of
				453	'NFC', 'NFKC', 'NFD', and 'NFKD'.
				454
				455	The Unicode Standard also specifies how to do caseless comparisons::
				456
				457	import unicodedata
				458
				459	def compare_caseless(s1, s2):
				460	def NFD(s):
				461	return unicodedata.normalize('NFD', s)
				462
				463	return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())
				464
				465	# Example usage
				466	single_char = 'ê'
				467	multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
				468
				469	print(compare_caseless(single_char, multiple_chars))
				470
				471	This will print ``True``. (Why is :func:`NFD` invoked twice? Because
				472	there are a few characters that make :meth:`casefold` return a
				473	non-normalized string, so the result needs to be normalized again. See
				474	section 3.13 of the Unicode Standard for a discussion and an example.)
				475
				476
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	477	Unicode Regular Expressions
				478	---------------------------
				479
				480	The regular expressions supported by the :mod:`re` module can be provided
				481	either as bytes or strings. Some of the special character sequences such as
				482	``\d`` and ``\w`` have different meanings depending on whether
				483	the pattern is supplied as bytes or a string. For example,
				484	``\d`` will match the characters ``[0-9]`` in bytes but
				485	in strings will match any character that's in the ``'Nd'`` category.
				486
				487	The string in this example has the number 57 written in both Thai and
				488	Arabic numerals::
				489
				490	import re
Cheryl Sabella	6677142	2018-02-02 16:16:27 -0500	[diff] [blame]	491	p = re.compile(r'\d+')
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	492
				493	s = "Over \u0e55\u0e57 57 flavours"
				494	m = p.search(s)
				495	print(repr(m.group()))
				496
				497	When executed, ``\d+`` will match the Thai numerals and print them
				498	out. If you supply the :const:`re.ASCII` flag to
				499	:func:`~re.compile`, ``\d+`` will match the substring "57" instead.
				500
				501	Similarly, ``\w`` matches a wide variety of Unicode characters but
				502	only ``[a-zA-Z0-9_]`` in bytes or if :const:`re.ASCII` is supplied,
				503	and ``\s`` will match either Unicode whitespace characters or
				504	``[ \t\n\r\f\v]``.
				505
				506
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	507	References
				508	----------
				509
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	510	.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section?
				511
				512	Some good alternative discussions of Python's Unicode support are:
				513
				514	* `Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>`_, by Nick Coghlan.
Sanyam Khurana	1b4587a	2017-12-06 22:09:33 +0530	[diff] [blame]	515	* `Pragmatic Unicode <https://nedbatchelder.com/text/unipain.html>`_, a PyCon 2012 presentation by Ned Batchelder.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	516
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	517	The :class:`str` type is described in the Python library reference at
Ezio Melotti	a6229e6	2012-10-12 10:59:14 +0300	[diff] [blame]	518	:ref:`textseq`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	519
				520	The documentation for the :mod:`unicodedata` module.
				521
				522	The documentation for the :mod:`codecs` module.
				523
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	524	Marc-André Lemburg gave `a presentation titled "Python and Unicode" (PDF slides)
				525	<https://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>`_ at
				526	EuroPython 2002. The slides are an excellent overview of the design of Python
				527	2's Unicode features (where the Unicode string type is called ``unicode`` and
				528	literals start with ``u``).
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	529
				530
				531	Reading and Writing Unicode Data
				532	================================
				533
				534	Once you've written some code that works with Unicode data, the next problem is
				535	input/output. How do you get Unicode strings into your program, and how do you
				536	convert Unicode into a form suitable for storage or transmission?
				537
				538	It's possible that you may not need to do anything depending on your input
				539	sources and output destinations; you should check whether the libraries used in
				540	your application support Unicode natively. XML parsers often return Unicode
				541	data, for example. Many relational databases also support Unicode-valued
				542	columns and can return Unicode values from an SQL query.
				543
				544	Unicode data is usually converted to a particular encoding before it gets
				545	written to disk or sent over a socket. It's possible to do all the work
Georg Brandl	3d596fa	2013-10-29 08:16:56 +0100	[diff] [blame]	546	yourself: open a file, read an 8-bit bytes object from it, and convert the bytes
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	547	with ``bytes.decode(encoding)``. However, the manual approach is not recommended.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	548
				549	One problem is the multi-byte nature of encodings; one Unicode character can be
				550	represented by several bytes. If you want to read the file in arbitrary-sized
Serhiy Storchaka	f8def28	2013-02-16 17:29:56 +0200	[diff] [blame]	551	chunks (say, 1024 or 4096 bytes), you need to write error-handling code to catch the case
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	552	where only part of the bytes encoding a single Unicode character are read at the
				553	end of a chunk. One solution would be to read the entire file into memory and
				554	then perform the decoding, but that prevents you from working with files that
Serhiy Storchaka	f8def28	2013-02-16 17:29:56 +0200	[diff] [blame]	555	are extremely large; if you need to read a 2 GiB file, you need 2 GiB of RAM.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	556	(More, really, since for at least a moment you'd need to have both the encoded
				557	string and its Unicode version in memory.)
				558
				559	The solution would be to use the low-level decoding interface to catch the case
				560	of partial coding sequences. The work of implementing this has already been
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	561	done for you: the built-in :func:`open` function can return a file-like object
				562	that assumes the file's contents are in a specified encoding and accepts Unicode
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	563	parameters for methods such as :meth:`~io.TextIOBase.read` and
Georg Brandl	325a1c2	2013-10-27 09:16:01 +0100	[diff] [blame]	564	:meth:`~io.TextIOBase.write`. This works through :func:`open`\'s encoding and
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	565	errors parameters which are interpreted just like those in :meth:`str.encode`
				566	and :meth:`bytes.decode`.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	567
				568	Reading Unicode from a file is therefore simple::
				569
Georg Brandl	e47e184	2013-10-06 13:07:10 +0200	[diff] [blame]	570	with open('unicode.txt', encoding='utf-8') as f:
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	571	for line in f:
				572	print(repr(line))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	573
				574	It's also possible to open files in update mode, allowing both reading and
				575	writing::
				576
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	577	with open('test', encoding='utf-8', mode='w+') as f:
				578	f.write('\u4500 blah blah blah\n')
				579	f.seek(0)
				580	print(repr(f.readline()[:1]))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	581
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	582	The Unicode character ``U+FEFF`` is used as a byte-order mark (BOM), and is often
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	583	written as the first character of a file in order to assist with autodetection
				584	of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
				585	present at the start of a file; when such an encoding is used, the BOM will be
				586	automatically written as the first character and will be silently dropped when
				587	the file is read. There are variants of these encodings, such as 'utf-16-le'
				588	and 'utf-16-be' for little-endian and big-endian encodings, that specify one
				589	particular byte ordering and don't skip the BOM.
				590
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	591	In some areas, it is also convention to use a "BOM" at the start of UTF-8
				592	encoded files; the name is misleading since UTF-8 is not byte-order dependent.
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	593	The mark simply announces that the file is encoded in UTF-8. For reading such
				594	files, use the 'utf-8-sig' codec to automatically skip the mark if present.
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	595
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	596
				597	Unicode filenames
				598	-----------------
				599
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	600	Most of the operating systems in common use today support filenames
				601	that contain arbitrary Unicode characters. Usually this is
				602	implemented by converting the Unicode string into some encoding that
				603	varies depending on the system. Today Python is converging on using
				604	UTF-8: Python on MacOS has used UTF-8 for several versions, and Python
				605	3.6 switched to using UTF-8 on Windows as well. On Unix systems,
				606	there will only be a filesystem encoding if you've set the ``LANG`` or
				607	``LC_CTYPE`` environment variables; if you haven't, the default
				608	encoding is again UTF-8.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	609
				610	The :func:`sys.getfilesystemencoding` function returns the encoding to use on
				611	your current system, in case you want to do the encoding manually, but there's
				612	not much reason to bother. When opening a file for reading or writing, you can
				613	usually just provide the Unicode string as the filename, and it will be
				614	automatically converted to the right encoding for you::
				615
Georg Brandl	f694518	2008-02-01 11:56:49 +0000	[diff] [blame]	616	filename = 'filename\u4500abc'
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	617	with open(filename, 'w') as f:
				618	f.write('blah\n')
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	619
				620	Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
				621	filenames.
				622
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	623	The :func:`os.listdir` function returns filenames, which raises an issue: should it return
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	624	the Unicode version of filenames, or should it return bytes containing
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	625	the encoded versions? :func:`os.listdir` can do both, depending on whether you
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	626	provided the directory path as bytes or a Unicode string. If you pass a
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	627	Unicode string as the path, filenames will be decoded using the filesystem's
				628	encoding and a list of Unicode strings will be returned, while passing a byte
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	629	path will return the filenames as bytes. For example,
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	630	assuming the default filesystem encoding is UTF-8, running the following
				631	program::
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	632
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	633	fn = 'filename\u4500abc'
				634	f = open(fn, 'w')
				635	f.close()
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	636
Georg Brandl	a1c6a1c	2009-01-03 21:26:05 +0000	[diff] [blame]	637	import os
				638	print(os.listdir(b'.'))
				639	print(os.listdir('.'))
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	640
Martin Panter	1050d2d	2016-07-26 11:18:21 +0200	[diff] [blame]	641	will produce the following output:
				642
				643	.. code-block:: shell-session
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	644
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	645	$ python listdir-test.py
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	646	[b'filename\xe4\x94\x80abc', ...]
				647	['filename\u4500abc', ...]
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	648
				649	The first list contains UTF-8-encoded filenames, and the second list contains
				650	the Unicode versions.
				651
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	652	Note that on most occasions, you should can just stick with using
				653	Unicode with these APIs. The bytes APIs should only be used on
				654	systems where undecodable file names can be present; that's
				655	pretty much only Unix systems now.
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	656
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	657
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	658	Tips for Writing Unicode-aware Programs
				659	---------------------------------------
				660
				661	This section provides some suggestions on writing software that deals with
				662	Unicode.
				663
				664	The most important tip is:
				665
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	666	Software should only work with Unicode strings internally, decoding the input
				667	data as soon as possible and encoding the output only at the end.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	668
Georg Brandl	0c07422	2008-11-22 10:26:59 +0000	[diff] [blame]	669	If you attempt to write processing functions that accept both Unicode and byte
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	670	strings, you will find your program vulnerable to bugs wherever you combine the
Ezio Melotti	410eee5	2013-01-20 12:16:03 +0200	[diff] [blame]	671	two different kinds of strings. There is no automatic encoding or decoding: if
				672	you do e.g. ``str + bytes``, a :exc:`TypeError` will be raised.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	673
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	674	When using data coming from a web browser or some other untrusted source, a
				675	common technique is to check for illegal characters in a string before using the
				676	string in a generated command line or storing it in a database. If you're doing
Antoine Pitrou	534e253	2011-12-05 01:21:46 +0100	[diff] [blame]	677	this, be careful to check the decoded string, not the encoded bytes data;
				678	some encodings may have interesting properties, such as not being bijective
				679	or not being fully ASCII-compatible. This is especially true if the input
				680	data also specifies the encoding, since the attacker can then choose a
				681	clever way to hide malicious text in the encoded bytestream.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	682
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	683
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	684	Converting Between File Encodings
				685	'''''''''''''''''''''''''''''''''
				686
				687	The :class:`~codecs.StreamRecoder` class can transparently convert between
				688	encodings, taking a stream that returns data in encoding #1
				689	and behaving like a stream returning data in encoding #2.
				690
				691	For example, if you have an input file f that's in Latin-1, you
Serhiy Storchaka	bfdcd43	2013-10-13 23:09:14 +0300	[diff] [blame]	692	can wrap it with a :class:`~codecs.StreamRecoder` to return bytes encoded in
				693	UTF-8::
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	694
				695	new_f = codecs.StreamRecoder(f,
				696	# en/decoder: used by read() to encode its results and
				697	# by write() to decode its input.
				698	codecs.getencoder('utf-8'), codecs.getdecoder('utf-8'),
				699
				700	# reader/writer: used to read and write to the stream.
				701	codecs.getreader('latin-1'), codecs.getwriter('latin-1') )
				702
				703
				704	Files in an Unknown Encoding
				705	''''''''''''''''''''''''''''
				706
				707	What can you do if you need to make a change to a file, but don't know
				708	the file's encoding? If you know the encoding is ASCII-compatible and
				709	only want to examine or modify the ASCII parts, you can open the file
				710	with the ``surrogateescape`` error handler::
				711
				712	with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
				713	data = f.read()
				714
				715	# make changes to the string 'data'
				716
				717	with open(fname + '.new', 'w',
Serhiy Storchaka	dba9039	2016-05-10 12:01:23 +0300	[diff] [blame]	718	encoding="ascii", errors="surrogateescape") as f:
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	719	f.write(data)
				720
				721	The ``surrogateescape`` error handler will decode any non-ASCII bytes
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	722	as code points in a special range running from U+DC80 to
				723	U+DCFF. These code points will then turn back into the
				724	same bytes when the ``surrogateescape`` error handler is used to
				725	encode the data and write it back out.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	726
				727
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	728	References
				729	----------
				730
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	731	One section of `Mastering Python 3 Input/Output
				732	<http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>`_,
				733	a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	734
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	735	The `PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
				736	Applications in Python"
				737	<https://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>`_
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	738	discuss questions of character encodings as well as how to internationalize
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	739	and localize an application. These slides cover Python 2.x only.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	740
Georg Brandl	9bdcb3b	2014-10-29 09:37:43 +0100	[diff] [blame]	741	`The Guts of Unicode in Python
				742	<http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>`_
				743	is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode
				744	representation in Python 3.3.
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	745
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	746
Alexander Belopolsky	93a6b13	2010-11-19 16:09:58 +0000	[diff] [blame]	747	Acknowledgements
				748	================
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	749
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	750	The initial draft of this document was written by Andrew Kuchling.
				751	It has since been revised further by Alexander Belopolsky, Georg Brandl,
				752	Andrew Kuchling, and Ezio Melotti.
Georg Brandl	116aa62	2007-08-15 14:28:22 +0000	[diff] [blame]	753
Andrew Kuchling	2151fc6	2013-06-20 09:29:09 -0400	[diff] [blame]	754	Thanks to the following people who have noted errors or offered
				755	suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
				756	Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
Andrew Kuchling	97c288d	2019-03-03 23:10:28 -0500	[diff] [blame^]	757	Lemburg, Martin von Löwis, Terry J. Reedy, Serhiy Storchaka,
				758	Eryk Sun, Chad Whitacre, Graham Wideman.