Blame - Doc/howto/unicode.rst - platform/external/python/cpython3

blob: 0946bdcec1a7c5b4408c70920b17e1327bbde779 [file] [log] [blame]

Andrew M. Kuchling	e8f44d6	2005-08-30 01:25:05 +0000	[diff] [blame]	1	Unicode HOWTO
				2	================
				3
				4	Version 1.02
				5
				6	This HOWTO discusses Python's support for Unicode, and explains various
				7	problems that people commonly encounter when trying to work with Unicode.
				8
				9	Introduction to Unicode
				10	------------------------------
				11
				12	History of Character Codes
				13	''''''''''''''''''''''''''''''
				14
				15	In 1968, the American Standard Code for Information Interchange,
				16	better known by its acronym ASCII, was standardized. ASCII defined
				17	numeric codes for various characters, with the numeric values running from 0 to
				18	127. For example, the lowercase letter 'a' is assigned 97 as its code
				19	value.
				20
				21	ASCII was an American-developed standard, so it only defined
				22	unaccented characters. There was an 'e', but no 'é' or 'Í'. This
				23	meant that languages which required accented characters couldn't be
				24	faithfully represented in ASCII. (Actually the missing accents matter
				25	for English, too, which contains words such as 'naïve' and 'café', and some
				26	publications have house styles which require spellings such as
				27	'coöperate'.)
				28
				29	For a while people just wrote programs that didn't display accents. I
				30	remember looking at Apple ][ BASIC programs, published in French-language
				31	publications in the mid-1980s, that had lines like these::
				32
				33	PRINT "FICHER EST COMPLETE."
				34	PRINT "CARACTERE NON ACCEPTE."
				35
				36	Those messages should contain accents, and they just look wrong to
				37	someone who can read French.
				38
				39	In the 1980s, almost all personal computers were 8-bit, meaning that
				40	bytes could hold values ranging from 0 to 255. ASCII codes only went
				41	up to 127, so some machines assigned values between 128 and 255 to
				42	accented characters. Different machines had different codes, however,
				43	which led to problems exchanging files. Eventually various commonly
				44	used sets of values for the 128-255 range emerged. Some were true
				45	standards, defined by the International Standards Organization, and
				46	some were de facto conventions that were invented by one company
				47	or another and managed to catch on.
				48
				49	255 characters aren't very many. For example, you can't fit
				50	both the accented characters used in Western Europe and the Cyrillic
				51	alphabet used for Russian into the 128-255 range because there are more than
				52	127 such characters.
				53
				54	You could write files using different codes (all your Russian
				55	files in a coding system called KOI8, all your French files in
				56	a different coding system called Latin1), but what if you wanted
				57	to write a French document that quotes some Russian text? In the
				58	1980s people began to want to solve this problem, and the Unicode
				59	standardization effort began.
				60
				61	Unicode started out using 16-bit characters instead of 8-bit characters. 16
				62	bits means you have 2^16 = 65,536 distinct values available, making it
				63	possible to represent many different characters from many different
				64	alphabets; an initial goal was to have Unicode contain the alphabets for
				65	every single human language. It turns out that even 16 bits isn't enough to
				66	meet that goal, and the modern Unicode specification uses a wider range of
				67	codes, 0-1,114,111 (0x10ffff in base-16).
				68
				69	There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
				70	originally separate efforts, but the specifications were merged with
				71	the 1.1 revision of Unicode.
				72
				73	(This discussion of Unicode's history is highly simplified. I don't
				74	think the average Python programmer needs to worry about the
				75	historical details; consult the Unicode consortium site listed in the
				76	References for more information.)
				77
				78
				79	Definitions
				80	''''''''''''''''''''''''
				81
				82	A character is the smallest possible component of a text. 'A',
				83	'B', 'C', etc., are all different characters. So are 'È' and
				84	'Í'. Characters are abstractions, and vary depending on the
				85	language or context you're talking about. For example, the symbol for
				86	ohms (Ω) is usually drawn much like the capital letter
				87	omega (Ω) in the Greek alphabet (they may even be the same in
				88	some fonts), but these are two different characters that have
				89	different meanings.
				90
				91	The Unicode standard describes how characters are represented by
				92	code points. A code point is an integer value, usually denoted in
				93	base 16. In the standard, a code point is written using the notation
				94	U+12ca to mean the character with value 0x12ca (4810 decimal). The
				95	Unicode standard contains a lot of tables listing characters and their
				96	corresponding code points::
				97
				98	0061 'a'; LATIN SMALL LETTER A
				99	0062 'b'; LATIN SMALL LETTER B
				100	0063 'c'; LATIN SMALL LETTER C
				101	...
				102	007B '{'; LEFT CURLY BRACKET
				103
				104	Strictly, these definitions imply that it's meaningless to say 'this is
				105	character U+12ca'. U+12ca is a code point, which represents some particular
				106	character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'.
				107	In informal contexts, this distinction between code points and characters will
				108	sometimes be forgotten.
				109
				110	A character is represented on a screen or on paper by a set of graphical
				111	elements that's called a glyph. The glyph for an uppercase A, for
				112	example, is two diagonal strokes and a horizontal stroke, though the exact
				113	details will depend on the font being used. Most Python code doesn't need
				114	to worry about glyphs; figuring out the correct glyph to display is
				115	generally the job of a GUI toolkit or a terminal's font renderer.
				116
				117
				118	Encodings
				119	'''''''''
				120
				121	To summarize the previous section:
				122	a Unicode string is a sequence of code points, which are
				123	numbers from 0 to 0x10ffff. This sequence needs to be represented as
				124	a set of bytes (meaning, values from 0-255) in memory. The rules for
				125	translating a Unicode string into a sequence of bytes are called an
				126	encoding.
				127
				128	The first encoding you might think of is an array of 32-bit integers.
				129	In this representation, the string "Python" would look like this::
				130
				131	P y t h o n
				132	0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
				133	0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
				134
				135	This representation is straightforward but using
				136	it presents a number of problems.
				137
				138	1. It's not portable; different processors order the bytes
				139	differently.
				140
				141	2. It's very wasteful of space. In most texts, the majority of the code
				142	points are less than 127, or less than 255, so a lot of space is occupied
				143	by zero bytes. The above string takes 24 bytes compared to the 6
				144	bytes needed for an ASCII representation. Increased RAM usage doesn't
				145	matter too much (desktop computers have megabytes of RAM, and strings
				146	aren't usually that large), but expanding our usage of disk and
				147	network bandwidth by a factor of 4 is intolerable.
				148
				149	3. It's not compatible with existing C functions such as ``strlen()``,
				150	so a new family of wide string functions would need to be used.
				151
				152	4. Many Internet standards are defined in terms of textual data, and
				153	can't handle content with embedded zero bytes.
				154
				155	Generally people don't use this encoding, choosing other encodings
				156	that are more efficient and convenient.
				157
				158	Encodings don't have to handle every possible Unicode character, and
				159	most encodings don't. For example, Python's default encoding is the
				160	'ascii' encoding. The rules for converting a Unicode string into the
				161	ASCII encoding are are simple; for each code point:
				162
				163	1. If the code point is <128, each byte is the same as the value of the
				164	code point.
				165
				166	2. If the code point is 128 or greater, the Unicode string can't
				167	be represented in this encoding. (Python raises a
				168	``UnicodeEncodeError`` exception in this case.)
				169
				170	Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode
				171	code points 0-255 are identical to the Latin-1 values, so converting
				172	to this encoding simply requires converting code points to byte
				173	values; if a code point larger than 255 is encountered, the string
				174	can't be encoded into Latin-1.
				175
				176	Encodings don't have to be simple one-to-one mappings like Latin-1.
				177	Consider IBM's EBCDIC, which was used on IBM mainframes. Letter
				178	values weren't in one block: 'a' through 'i' had values from 129 to
				179	137, but 'j' through 'r' were 145 through 153. If you wanted to use
				180	EBCDIC as an encoding, you'd probably use some sort of lookup table to
				181	perform the conversion, but this is largely an internal detail.
				182
				183	UTF-8 is one of the most commonly used encodings. UTF stands for
				184	"Unicode Transformation Format", and the '8' means that 8-bit numbers
				185	are used in the encoding. (There's also a UTF-16 encoding, but it's
				186	less frequently used than UTF-8.) UTF-8 uses the following rules:
				187
				188	1. If the code point is <128, it's represented by the corresponding byte value.
				189	2. If the code point is between 128 and 0x7ff, it's turned into two byte values
				190	between 128 and 255.
				191	3. Code points >0x7ff are turned into three- or four-byte sequences, where
				192	each byte of the sequence is between 128 and 255.
				193
				194	UTF-8 has several convenient properties:
				195
				196	1. It can handle any Unicode code point.
				197	2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent through protocols that can't handle zero bytes.
				198	3. A string of ASCII text is also valid UTF-8 text.
				199	4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
				200	5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize. It's also unlikely that random 8-bit data will look like valid UTF-8.
				201
				202
				203
				204	References
				205	''''''''''''''
				206
				207	The Unicode Consortium site at <http://www.unicode.org> has character
				208	charts, a glossary, and PDF versions of the Unicode specification. Be
				209	prepared for some difficult reading.
				210	<http://www.unicode.org/history/> is a chronology of the origin and
				211	development of Unicode.
				212
				213	To help understand the standard, Jukka Korpela has written an
				214	introductory guide to reading the Unicode character tables,
				215	available at <http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
				216
				217	Roman Czyborra wrote another explanation of Unicode's basic principles;
				218	it's at <http://czyborra.com/unicode/characters.html>.
				219	Czyborra has written a number of other Unicode-related documentation,
				220	available from <http://www.cyzborra.com>.
				221
				222	Two other good introductory articles were written by Joel Spolsky
				223	<http://www.joelonsoftware.com/articles/Unicode.html> and Jason
				224	Orendorff <http://www.jorendorff.com/articles/unicode/>. If this
				225	introduction didn't make things clear to you, you should try reading
				226	one of these alternate articles before continuing.
				227
				228	Wikipedia entries are often helpful; see the entries for "character
				229	encoding" <http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
				230	<http://en.wikipedia.org/wiki/UTF-8>, for example.
				231
				232
				233	Python's Unicode Support
				234	------------------------
				235
				236	Now that you've learned the rudiments of Unicode, we can look at
				237	Python's Unicode features.
				238
				239
				240	The Unicode Type
				241	'''''''''''''''''''
				242
				243	Unicode strings are expressed as instances of the ``unicode`` type,
				244	one of Python's repertoire of built-in types. It derives from an
				245	abstract type called ``basestring``, which is also an ancestor of the
				246	``str`` type; you can therefore check if a value is a string type with
				247	``isinstance(value, basestring)``. Under the hood, Python represents
				248	Unicode strings as either 16- or 32-bit integers, depending on how the
Andrew M. Kuchling	5040fee	2005-11-19 18:43:38 +0000	[diff] [blame]	249	Python interpreter was compiled.
Andrew M. Kuchling	e8f44d6	2005-08-30 01:25:05 +0000	[diff] [blame]	250
				251	The ``unicode()`` constructor has the signature ``unicode(string[, encoding, errors])``.
				252	All of its arguments should be 8-bit strings. The first argument is converted
				253	to Unicode using the specified encoding; if you leave off the ``encoding`` argument,
				254	the ASCII encoding is used for the conversion, so characters greater than 127 will
				255	be treated as errors::
				256
				257	>>> unicode('abcdef')
				258	u'abcdef'
				259	>>> s = unicode('abcdef')
				260	>>> type(s)
				261	<type 'unicode'>
				262	>>> unicode('abcdef' + chr(255))
				263	Traceback (most recent call last):
				264	File "<stdin>", line 1, in ?
				265	UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
				266	ordinal not in range(128)
				267
				268	The ``errors`` argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument
				269	are 'strict' (raise a ``UnicodeDecodeError`` exception),
				270	'replace' (add U+FFFD, 'REPLACEMENT CHARACTER'),
				271	or 'ignore' (just leave the character out of the Unicode result).
				272	The following examples show the differences::
				273
				274	>>> unicode('\x80abc', errors='strict')
				275	Traceback (most recent call last):
				276	File "<stdin>", line 1, in ?
				277	UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
				278	ordinal not in range(128)
				279	>>> unicode('\x80abc', errors='replace')
				280	u'\ufffdabc'
				281	>>> unicode('\x80abc', errors='ignore')
				282	u'abc'
				283
				284	Encodings are specified as strings containing the encoding's name.
				285	Python 2.4 comes with roughly 100 different encodings; see the Python
				286	Library Reference at
				287	<http://docs.python.org/lib/standard-encodings.html> for a list. Some
				288	encodings have multiple names; for example, 'latin-1', 'iso_8859_1'
				289	and '8859' are all synonyms for the same encoding.
				290
				291	One-character Unicode strings can also be created with the
				292	``unichr()`` built-in function, which takes integers and returns a
				293	Unicode string of length 1 that contains the corresponding code point.
				294	The reverse operation is the built-in `ord()` function that takes a
				295	one-character Unicode string and returns the code point value::
				296
				297	>>> unichr(40960)
				298	u'\ua000'
				299	>>> ord(u'\ua000')
				300	40960
				301
				302	Instances of the ``unicode`` type have many of the same methods as
				303	the 8-bit string type for operations such as searching and formatting::
				304
				305	>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
				306	>>> s.count('e')
				307	5
				308	>>> s.find('feather')
				309	9
				310	>>> s.find('bird')
				311	-1
				312	>>> s.replace('feather', 'sand')
				313	u'Was ever sand so lightly blown to and fro as this multitude?'
				314	>>> s.upper()
				315	u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
				316
				317	Note that the arguments to these methods can be Unicode strings or 8-bit strings.
				318	8-bit strings will be converted to Unicode before carrying out the operation;
				319	Python's default ASCII encoding will be used, so characters greater than 127 will cause an exception::
				320
				321	>>> s.find('Was\x9f')
				322	Traceback (most recent call last):
				323	File "<stdin>", line 1, in ?
				324	UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
				325	>>> s.find(u'Was\x9f')
				326	-1
				327
				328	Much Python code that operates on strings will therefore work with
				329	Unicode strings without requiring any changes to the code. (Input and
				330	output code needs more updating for Unicode; more on this later.)
				331
				332	Another important method is ``.encode([encoding], [errors='strict'])``,
				333	which returns an 8-bit string version of the
				334	Unicode string, encoded in the requested encoding. The ``errors``
				335	parameter is the same as the parameter of the ``unicode()``
				336	constructor, with one additional possibility; as well as 'strict',
				337	'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which
				338	uses XML's character references. The following example shows the
				339	different results::
				340
				341	>>> u = unichr(40960) + u'abcd' + unichr(1972)
				342	>>> u.encode('utf-8')
				343	'\xea\x80\x80abcd\xde\xb4'
				344	>>> u.encode('ascii')
				345	Traceback (most recent call last):
				346	File "<stdin>", line 1, in ?
				347	UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
				348	>>> u.encode('ascii', 'ignore')
				349	'abcd'
				350	>>> u.encode('ascii', 'replace')
				351	'?abcd?'
				352	>>> u.encode('ascii', 'xmlcharrefreplace')
				353	'ꀀabcd޴'
				354
				355	Python's 8-bit strings have a ``.decode([encoding], [errors])`` method
				356	that interprets the string using the given encoding::
				357
				358	>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
				359	>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
				360	>>> type(utf8_version), utf8_version
				361	(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
				362	>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
				363	>>> u == u2 # The two strings match
				364	True
				365
				366	The low-level routines for registering and accessing the available
				367	encodings are found in the ``codecs`` module. However, the encoding
				368	and decoding functions returned by this module are usually more
				369	low-level than is comfortable, so I'm not going to describe the
				370	``codecs`` module here. If you need to implement a completely new
				371	encoding, you'll need to learn about the ``codecs`` module interfaces,
				372	but implementing encodings is a specialized task that also won't be
				373	covered here. Consult the Python documentation to learn more about
				374	this module.
				375
				376	The most commonly used part of the ``codecs`` module is the
				377	``codecs.open()`` function which will be discussed in the section
				378	on input and output.
				379
				380
				381	Unicode Literals in Python Source Code
				382	''''''''''''''''''''''''''''''''''''''''''
				383
				384	In Python source code, Unicode literals are written as strings
				385	prefixed with the 'u' or 'U' character: ``u'abcdefghijk'``. Specific
				386	code points can be written using the ``\u`` escape sequence, which is
				387	followed by four hex digits giving the code point. The ``\U`` escape
				388	sequence is similar, but expects 8 hex digits, not 4.
				389
				390	Unicode literals can also use the same escape sequences as 8-bit
				391	strings, including ``\x``, but ``\x`` only takes two hex digits so it
				392	can't express an arbitrary code point. Octal escapes can go up to
				393	U+01ff, which is octal 777.
				394
				395	::
				396
				397	>>> s = u"a\xac\u1234\u20ac\U00008000"
				398	^^^^ two-digit hex escape
				399	^^^^^^ four-digit Unicode escape
				400	^^^^^^^^^^ eight-digit Unicode escape
				401	>>> for c in s: print ord(c),
				402	...
				403	97 172 4660 8364 32768
				404
				405	Using escape sequences for code points greater than 127 is fine in
				406	small doses, but becomes an annoyance if you're using many accented
				407	characters, as you would in a program with messages in French or some
				408	other accent-using language. You can also assemble strings using the
				409	``unichr()`` built-in function, but this is even more tedious.
				410
				411	Ideally, you'd want to be able to write literals in your language's
				412	natural encoding. You could then edit Python source code with your
				413	favorite editor which would display the accented characters naturally,
				414	and have the right characters used at runtime.
				415
				416	Python supports writing Unicode literals in any encoding, but you have
				417	to declare the encoding being used. This is done by including a
				418	special comment as either the first or second line of the source
				419	file::
				420
				421	#!/usr/bin/env python
				422	# -- coding: latin-1 --
				423
				424	u = u'abcdé'
				425	print ord(u[-1])
				426
				427	The syntax is inspired by Emacs's notation for specifying variables local to a file.
				428	Emacs supports many different variables, but Python only supports 'coding'.
				429	The ``-*-`` symbols indicate that the comment is special; within them,
				430	you must supply the name ``coding`` and the name of your chosen encoding,
				431	separated by ``':'``.
				432
				433	If you don't include such a comment, the default encoding used will be
				434	ASCII. Versions of Python before 2.4 were Euro-centric and assumed
				435	Latin-1 as a default encoding for string literals; in Python 2.4,
				436	characters greater than 127 still work but result in a warning. For
				437	example, the following program has no encoding declaration::
				438
				439	#!/usr/bin/env python
				440	u = u'abcdé'
				441	print ord(u[-1])
				442
				443	When you run it with Python 2.4, it will output the following warning::
				444
				445	amk:~$ python p263.py
				446	sys:1: DeprecationWarning: Non-ASCII character '\xe9'
				447	in file p263.py on line 2, but no encoding declared;
				448	see http://www.python.org/peps/pep-0263.html for details
				449
				450
				451	Unicode Properties
				452	'''''''''''''''''''
				453
				454	The Unicode specification includes a database of information about
				455	code points. For each code point that's defined, the information
				456	includes the character's name, its category, the numeric value if
				457	applicable (Unicode has characters representing the Roman numerals and
				458	fractions such as one-third and four-fifths). There are also
				459	properties related to the code point's use in bidirectional text and
				460	other display-related properties.
				461
				462	The following program displays some information about several
				463	characters, and prints the numeric value of one particular character::
				464
				465	import unicodedata
				466
				467	u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
				468
				469	for i, c in enumerate(u):
				470	print i, '%04x' % ord(c), unicodedata.category(c),
				471	print unicodedata.name(c)
				472
				473	# Get numeric value of second character
				474	print unicodedata.numeric(u[1])
				475
				476	When run, this prints::
				477
				478	0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
				479	1 0bf2 No TAMIL NUMBER ONE THOUSAND
				480	2 0f84 Mn TIBETAN MARK HALANTA
				481	3 1770 Lo TAGBANWA LETTER SA
				482	4 33af So SQUARE RAD OVER S SQUARED
				483	1000.0
				484
				485	The category codes are abbreviations describing the nature of the
				486	character. These are grouped into categories such as "Letter",
				487	"Number", "Punctuation", or "Symbol", which in turn are broken up into
				488	subcategories. To take the codes from the above output, ``'Ll'``
				489	means 'Letter, lowercase', ``'No'`` means "Number, other", ``'Mn'`` is
				490	"Mark, nonspacing", and ``'So'`` is "Symbol, other". See
				491	<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values>
				492	for a list of category codes.
				493
				494	References
				495	''''''''''''''
				496
				497	The Unicode and 8-bit string types are described in the Python library
				498	reference at <http://docs.python.org/lib/typesseq.html>.
				499
				500	The documentation for the ``unicodedata`` module is at
				501	<http://docs.python.org/lib/module-unicodedata.html>.
				502
				503	The documentation for the ``codecs`` module is at
				504	<http://docs.python.org/lib/module-codecs.html>.
				505
				506	Marc-André Lemburg gave a presentation at EuroPython 2002
				507	titled "Python and Unicode". A PDF version of his slides
				508	is available at <http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf>,
				509	and is an excellent overview of the design of Python's Unicode features.
				510
				511
				512	Reading and Writing Unicode Data
				513	----------------------------------------
				514
				515	Once you've written some code that works with Unicode data, the next
				516	problem is input/output. How do you get Unicode strings into your
				517	program, and how do you convert Unicode into a form suitable for
				518	storage or transmission?
				519
				520	It's possible that you may not need to do anything depending on your
				521	input sources and output destinations; you should check whether the
				522	libraries used in your application support Unicode natively. XML
				523	parsers often return Unicode data, for example. Many relational
				524	databases also support Unicode-valued columns and can return Unicode
				525	values from an SQL query.
				526
				527	Unicode data is usually converted to a particular encoding before it
				528	gets written to disk or sent over a socket. It's possible to do all
				529	the work yourself: open a file, read an 8-bit string from it, and
				530	convert the string with ``unicode(str, encoding)``. However, the
				531	manual approach is not recommended.
				532
				533	One problem is the multi-byte nature of encodings; one Unicode
				534	character can be represented by several bytes. If you want to read
				535	the file in arbitrary-sized chunks (say, 1K or 4K), you need to write
				536	error-handling code to catch the case where only part of the bytes
				537	encoding a single Unicode character are read at the end of a chunk.
				538	One solution would be to read the entire file into memory and then
				539	perform the decoding, but that prevents you from working with files
				540	that are extremely large; if you need to read a 2Gb file, you need 2Gb
				541	of RAM. (More, really, since for at least a moment you'd need to have
				542	both the encoded string and its Unicode version in memory.)
				543
				544	The solution would be to use the low-level decoding interface to catch
				545	the case of partial coding sequences. The work of implementing this
				546	has already been done for you: the ``codecs`` module includes a
				547	version of the ``open()`` function that returns a file-like object
				548	that assumes the file's contents are in a specified encoding and
				549	accepts Unicode parameters for methods such as ``.read()`` and
				550	``.write()``.
				551
				552	The function's parameters are
				553	``open(filename, mode='rb', encoding=None, errors='strict', buffering=1)``. ``mode`` can be
				554	``'r'``, ``'w'``, or ``'a'``, just like the corresponding parameter to the
				555	regular built-in ``open()`` function; add a ``'+'`` to
				556	update the file. ``buffering`` is similarly
				557	parallel to the standard function's parameter.
				558	``encoding`` is a string giving
				559	the encoding to use; if it's left as ``None``, a regular Python file
				560	object that accepts 8-bit strings is returned. Otherwise, a wrapper
				561	object is returned, and data written to or read from the wrapper
				562	object will be converted as needed. ``errors`` specifies the action
				563	for encoding errors and can be one of the usual values of 'strict',
				564	'ignore', and 'replace'.
				565
				566	Reading Unicode from a file is therefore simple::
				567
				568	import codecs
				569	f = codecs.open('unicode.rst', encoding='utf-8')
				570	for line in f:
				571	print repr(line)
				572
				573	It's also possible to open files in update mode,
				574	allowing both reading and writing::
				575
				576	f = codecs.open('test', encoding='utf-8', mode='w+')
				577	f.write(u'\u4500 blah blah blah\n')
				578	f.seek(0)
				579	print repr(f.readline()[:1])
				580	f.close()
				581
				582	Unicode character U+FEFF is used as a byte-order mark (BOM),
				583	and is often written as the first character of a file in order
				584	to assist with autodetection of the file's byte ordering.
				585	Some encodings, such as UTF-16, expect a BOM to be present at
				586	the start of a file; when such an encoding is used,
				587	the BOM will be automatically written as the first character
				588	and will be silently dropped when the file is read. There are
				589	variants of these encodings, such as 'utf-16-le' and 'utf-16-be'
				590	for little-endian and big-endian encodings, that specify
				591	one particular byte ordering and don't
				592	skip the BOM.
				593
				594
				595	Unicode filenames
				596	'''''''''''''''''''''''''
				597
				598	Most of the operating systems in common use today support filenames
				599	that contain arbitrary Unicode characters. Usually this is
				600	implemented by converting the Unicode string into some encoding that
				601	varies depending on the system. For example, MacOS X uses UTF-8 while
				602	Windows uses a configurable encoding; on Windows, Python uses the name
				603	"mbcs" to refer to whatever the currently configured encoding is. On
				604	Unix systems, there will only be a filesystem encoding if you've set
				605	the ``LANG`` or ``LC_CTYPE`` environment variables; if you haven't,
				606	the default encoding is ASCII.
				607
				608	The ``sys.getfilesystemencoding()`` function returns the encoding to
				609	use on your current system, in case you want to do the encoding
				610	manually, but there's not much reason to bother. When opening a file
				611	for reading or writing, you can usually just provide the Unicode
				612	string as the filename, and it will be automatically converted to the
				613	right encoding for you::
				614
				615	filename = u'filename\u4500abc'
				616	f = open(filename, 'w')
				617	f.write('blah\n')
				618	f.close()
				619
				620	Functions in the ``os`` module such as ``os.stat()`` will also accept
				621	Unicode filenames.
				622
				623	``os.listdir()``, which returns filenames, raises an issue: should it
				624	return the Unicode version of filenames, or should it return 8-bit
				625	strings containing the encoded versions? ``os.listdir()`` will do
				626	both, depending on whether you provided the directory path as an 8-bit
				627	string or a Unicode string. If you pass a Unicode string as the path,
				628	filenames will be decoded using the filesystem's encoding and a list
				629	of Unicode strings will be returned, while passing an 8-bit path will
				630	return the 8-bit versions of the filenames. For example, assuming the
				631	default filesystem encoding is UTF-8, running the following program::
				632
				633	fn = u'filename\u4500abc'
				634	f = open(fn, 'w')
				635	f.close()
				636
				637	import os
				638	print os.listdir('.')
				639	print os.listdir(u'.')
				640
				641	will produce the following output::
				642
				643	amk:~$ python t.py
				644	['.svn', 'filename\xe4\x94\x80abc', ...]
				645	[u'.svn', u'filename\u4500abc', ...]
				646
				647	The first list contains UTF-8-encoded filenames, and the second list
				648	contains the Unicode versions.
				649
				650
				651
				652	Tips for Writing Unicode-aware Programs
				653	''''''''''''''''''''''''''''''''''''''''''''
				654
				655	This section provides some suggestions on writing software that
				656	deals with Unicode.
				657
				658	The most important tip is:
				659
				660	Software should only work with Unicode strings internally,
				661	converting to a particular encoding on output.
				662
				663	If you attempt to write processing functions that accept both
				664	Unicode and 8-bit strings, you will find your program vulnerable to
				665	bugs wherever you combine the two different kinds of strings. Python's
				666	default encoding is ASCII, so whenever a character with an ASCII value >127
				667	is in the input data, you'll get a ``UnicodeDecodeError``
				668	because that character can't be handled by the ASCII encoding.
				669
				670	It's easy to miss such problems if you only test your software
				671	with data that doesn't contain any
				672	accents; everything will seem to work, but there's actually a bug in your
				673	program waiting for the first user who attempts to use characters >127.
				674	A second tip, therefore, is:
				675
				676	Include characters >127 and, even better, characters >255 in your
				677	test data.
				678
				679	When using data coming from a web browser or some other untrusted source,
				680	a common technique is to check for illegal characters in a string
				681	before using the string in a generated command line or storing it in a
				682	database. If you're doing this, be careful to check
				683	the string once it's in the form that will be used or stored; it's
				684	possible for encodings to be used to disguise characters. This is especially
				685	true if the input data also specifies the encoding;
				686	many encodings leave the commonly checked-for characters alone,
				687	but Python includes some encodings such as ``'base64'``
				688	that modify every single character.
				689
				690	For example, let's say you have a content management system that takes a
				691	Unicode filename, and you want to disallow paths with a '/' character.
				692	You might write this code::
				693
				694	def read_file (filename, encoding):
				695	if '/' in filename:
				696	raise ValueError("'/' not allowed in filenames")
				697	unicode_name = filename.decode(encoding)
				698	f = open(unicode_name, 'r')
				699	# ... return contents of file ...
				700
				701	However, if an attacker could specify the ``'base64'`` encoding,
				702	they could pass ``'L2V0Yy9wYXNzd2Q='``, which is the base-64
				703	encoded form of the string ``'/etc/passwd'``, to read a
				704	system file. The above code looks for ``'/'`` characters
				705	in the encoded form and misses the dangerous character
				706	in the resulting decoded form.
				707
				708	References
				709	''''''''''''''
				710
				711	The PDF slides for Marc-André Lemburg's presentation "Writing
				712	Unicode-aware Applications in Python" are available at
				713	<http://www.egenix.com/files/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
				714	and discuss questions of character encodings as well as how to
				715	internationalize and localize an application.
				716
				717
				718	Revision History and Acknowledgements
				719	------------------------------------------
				720
				721	Thanks to the following people who have noted errors or offered
				722	suggestions on this article: Nicholas Bastin,
				723	Marius Gedminas, Kent Johnson, Ken Krugler,
				724	Marc-André Lemburg, Martin von Löwis.
				725
				726	Version 1.0: posted August 5 2005.
				727
				728	Version 1.01: posted August 7 2005. Corrects factual and markup
				729	errors; adds several links.
				730
				731	Version 1.02: posted August 16 2005. Corrects factual errors.
				732
				733
				734	.. comment Additional topic: building Python w/ UCS2 or UCS4 support
				735	.. comment Describe obscure -U switch somewhere?
Thomas Wouters	d4ec0c3	2006-04-21 16:44:05 +0000	[diff] [blame]	736	.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
Andrew M. Kuchling	e8f44d6	2005-08-30 01:25:05 +0000	[diff] [blame]	737
				738	.. comment
				739	Original outline:
				740
				741	- [ ] Unicode introduction
				742	- [ ] ASCII
				743	- [ ] Terms
				744	- [ ] Character
				745	- [ ] Code point
				746	- [ ] Encodings
				747	- [ ] Common encodings: ASCII, Latin-1, UTF-8
				748	- [ ] Unicode Python type
				749	- [ ] Writing unicode literals
				750	- [ ] Obscurity: -U switch
				751	- [ ] Built-ins
				752	- [ ] unichr()
				753	- [ ] ord()
				754	- [ ] unicode() constructor
				755	- [ ] Unicode type
				756	- [ ] encode(), decode() methods
				757	- [ ] Unicodedata module for character properties
				758	- [ ] I/O
				759	- [ ] Reading/writing Unicode data into files
				760	- [ ] Byte-order marks
				761	- [ ] Unicode filenames
				762	- [ ] Writing Unicode programs
				763	- [ ] Do everything in Unicode
				764	- [ ] Declaring source code encodings (PEP 263)
				765	- [ ] Other issues
				766	- [ ] Building Python (UCS2, UCS4)