Blame - Doc/howto/unicode.rst - platform/external/python/cpython2

blob: 21ae11108828f8b6402d1cd4af223fe9c83a1e87 [file] [log] [blame]

Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1	*****************
				2	Unicode HOWTO
				3	*****************
				4
				5	:Release: 1.02
				6
				7	This HOWTO discusses Python's support for Unicode, and explains various problems
				8	that people commonly encounter when trying to work with Unicode.
				9
				10	Introduction to Unicode
				11	=======================
				12
				13	History of Character Codes
				14	--------------------------
				15
				16	In 1968, the American Standard Code for Information Interchange, better known by
				17	its acronym ASCII, was standardized. ASCII defined numeric codes for various
				18	characters, with the numeric values running from 0 to
				19	127. For example, the lowercase letter 'a' is assigned 97 as its code
				20	value.
				21
				22	ASCII was an American-developed standard, so it only defined unaccented
				23	characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
				24	which required accented characters couldn't be faithfully represented in ASCII.
				25	(Actually the missing accents matter for English, too, which contains words such
				26	as 'naïve' and 'café', and some publications have house styles which require
				27	spellings such as 'coöperate'.)
				28
				29	For a while people just wrote programs that didn't display accents. I remember
				30	looking at Apple ][ BASIC programs, published in French-language publications in
				31	the mid-1980s, that had lines like these::
				32
				33	PRINT "FICHER EST COMPLETE."
				34	PRINT "CARACTERE NON ACCEPTE."
				35
				36	Those messages should contain accents, and they just look wrong to someone who
				37	can read French.
				38
				39	In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
				40	hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
				41	machines assigned values between 128 and 255 to accented characters. Different
				42	machines had different codes, however, which led to problems exchanging files.
				43	Eventually various commonly used sets of values for the 128-255 range emerged.
				44	Some were true standards, defined by the International Standards Organization,
				45	and some were de facto conventions that were invented by one company or
				46	another and managed to catch on.
				47
				48	255 characters aren't very many. For example, you can't fit both the accented
				49	characters used in Western Europe and the Cyrillic alphabet used for Russian
				50	into the 128-255 range because there are more than 127 such characters.
				51
				52	You could write files using different codes (all your Russian files in a coding
				53	system called KOI8, all your French files in a different coding system called
				54	Latin1), but what if you wanted to write a French document that quotes some
				55	Russian text? In the 1980s people began to want to solve this problem, and the
				56	Unicode standardization effort began.
				57
				58	Unicode started out using 16-bit characters instead of 8-bit characters. 16
				59	bits means you have 2^16 = 65,536 distinct values available, making it possible
				60	to represent many different characters from many different alphabets; an initial
				61	goal was to have Unicode contain the alphabets for every single human language.
				62	It turns out that even 16 bits isn't enough to meet that goal, and the modern
				63	Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in
				64	base-16).
				65
				66	There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
				67	originally separate efforts, but the specifications were merged with the 1.1
				68	revision of Unicode.
				69
				70	(This discussion of Unicode's history is highly simplified. I don't think the
				71	average Python programmer needs to worry about the historical details; consult
				72	the Unicode consortium site listed in the References for more information.)
				73
				74
				75	Definitions
				76	-----------
				77
				78	A character is the smallest possible component of a text. 'A', 'B', 'C',
				79	etc., are all different characters. So are 'È' and 'Í'. Characters are
				80	abstractions, and vary depending on the language or context you're talking
				81	about. For example, the symbol for ohms (Ω) is usually drawn much like the
				82	capital letter omega (Ω) in the Greek alphabet (they may even be the same in
				83	some fonts), but these are two different characters that have different
				84	meanings.
				85
				86	The Unicode standard describes how characters are represented by **code
				87	points**. A code point is an integer value, usually denoted in base 16. In the
				88	standard, a code point is written using the notation U+12ca to mean the
				89	character with value 0x12ca (4810 decimal). The Unicode standard contains a lot
				90	of tables listing characters and their corresponding code points::
				91
				92	0061 'a'; LATIN SMALL LETTER A
				93	0062 'b'; LATIN SMALL LETTER B
				94	0063 'c'; LATIN SMALL LETTER C
				95	...
				96	007B '{'; LEFT CURLY BRACKET
				97
				98	Strictly, these definitions imply that it's meaningless to say 'this is
				99	character U+12ca'. U+12ca is a code point, which represents some particular
				100	character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
				101	informal contexts, this distinction between code points and characters will
				102	sometimes be forgotten.
				103
				104	A character is represented on a screen or on paper by a set of graphical
				105	elements that's called a glyph. The glyph for an uppercase A, for example,
				106	is two diagonal strokes and a horizontal stroke, though the exact details will
				107	depend on the font being used. Most Python code doesn't need to worry about
				108	glyphs; figuring out the correct glyph to display is generally the job of a GUI
				109	toolkit or a terminal's font renderer.
				110
				111
				112	Encodings
				113	---------
				114
				115	To summarize the previous section: a Unicode string is a sequence of code
				116	points, which are numbers from 0 to 0x10ffff. This sequence needs to be
				117	represented as a set of bytes (meaning, values from 0-255) in memory. The rules
				118	for translating a Unicode string into a sequence of bytes are called an
				119	encoding.
				120
				121	The first encoding you might think of is an array of 32-bit integers. In this
				122	representation, the string "Python" would look like this::
				123
				124	P y t h o n
				125	0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
				126	0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
				127
				128	This representation is straightforward but using it presents a number of
				129	problems.
				130
				131	1. It's not portable; different processors order the bytes differently.
				132
				133	2. It's very wasteful of space. In most texts, the majority of the code points
				134	are less than 127, or less than 255, so a lot of space is occupied by zero
				135	bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
				136	ASCII representation. Increased RAM usage doesn't matter too much (desktop
				137	computers have megabytes of RAM, and strings aren't usually that large), but
				138	expanding our usage of disk and network bandwidth by a factor of 4 is
				139	intolerable.
				140
				141	3. It's not compatible with existing C functions such as ``strlen()``, so a new
				142	family of wide string functions would need to be used.
				143
				144	4. Many Internet standards are defined in terms of textual data, and can't
				145	handle content with embedded zero bytes.
				146
				147	Generally people don't use this encoding, instead choosing other encodings that
				148	are more efficient and convenient.
				149
				150	Encodings don't have to handle every possible Unicode character, and most
				151	encodings don't. For example, Python's default encoding is the 'ascii'
				152	encoding. The rules for converting a Unicode string into the ASCII encoding are
				153	simple; for each code point:
				154
				155	1. If the code point is < 128, each byte is the same as the value of the code
				156	point.
				157
				158	2. If the code point is 128 or greater, the Unicode string can't be represented
				159	in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
				160	case.)
				161
				162	Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
				163	0-255 are identical to the Latin-1 values, so converting to this encoding simply
				164	requires converting code points to byte values; if a code point larger than 255
				165	is encountered, the string can't be encoded into Latin-1.
				166
				167	Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
				168	IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
				169	block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
				170	through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
				171	some sort of lookup table to perform the conversion, but this is largely an
				172	internal detail.
				173
				174	UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
				175	Transformation Format", and the '8' means that 8-bit numbers are used in the
				176	encoding. (There's also a UTF-16 encoding, but it's less frequently used than
				177	UTF-8.) UTF-8 uses the following rules:
				178
				179	1. If the code point is <128, it's represented by the corresponding byte value.
				180	2. If the code point is between 128 and 0x7ff, it's turned into two byte values
				181	between 128 and 255.
				182	3. Code points >0x7ff are turned into three- or four-byte sequences, where each
				183	byte of the sequence is between 128 and 255.
				184
				185	UTF-8 has several convenient properties:
				186
				187	1. It can handle any Unicode code point.
				188	2. A Unicode string is turned into a string of bytes containing no embedded zero
				189	bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
				190	processed by C functions such as ``strcpy()`` and sent through protocols that
				191	can't handle zero bytes.
				192	3. A string of ASCII text is also valid UTF-8 text.
				193	4. UTF-8 is fairly compact; the majority of code points are turned into two
				194	bytes, and values less than 128 occupy only a single byte.
				195	5. If bytes are corrupted or lost, it's possible to determine the start of the
				196	next UTF-8-encoded code point and resynchronize. It's also unlikely that
				197	random 8-bit data will look like valid UTF-8.
				198
				199
				200
				201	References
				202	----------
				203
				204	The Unicode Consortium site at <http://www.unicode.org> has character charts, a
				205	glossary, and PDF versions of the Unicode specification. Be prepared for some
				206	difficult reading. <http://www.unicode.org/history/> is a chronology of the
				207	origin and development of Unicode.
				208
				209	To help understand the standard, Jukka Korpela has written an introductory guide
				210	to reading the Unicode character tables, available at
				211	<http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
				212
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	213	Two other good introductory articles were written by Joel Spolsky
				214	<http://www.joelonsoftware.com/articles/Unicode.html> and Jason Orendorff
				215	<http://www.jorendorff.com/articles/unicode/>. If this introduction didn't make
				216	things clear to you, you should try reading one of these alternate articles
				217	before continuing.
				218
				219	Wikipedia entries are often helpful; see the entries for "character encoding"
				220	<http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
				221	<http://en.wikipedia.org/wiki/UTF-8>, for example.
				222
				223
				224	Python's Unicode Support
				225	========================
				226
				227	Now that you've learned the rudiments of Unicode, we can look at Python's
				228	Unicode features.
				229
				230
				231	The Unicode Type
				232	----------------
				233
				234	Unicode strings are expressed as instances of the :class:`unicode` type, one of
				235	Python's repertoire of built-in types. It derives from an abstract type called
				236	:class:`basestring`, which is also an ancestor of the :class:`str` type; you can
				237	therefore check if a value is a string type with ``isinstance(value,
				238	basestring)``. Under the hood, Python represents Unicode strings as either 16-
				239	or 32-bit integers, depending on how the Python interpreter was compiled.
				240
				241	The :func:`unicode` constructor has the signature ``unicode(string[, encoding,
				242	errors])``. All of its arguments should be 8-bit strings. The first argument
				243	is converted to Unicode using the specified encoding; if you leave off the
				244	``encoding`` argument, the ASCII encoding is used for the conversion, so
				245	characters greater than 127 will be treated as errors::
				246
				247	>>> unicode('abcdef')
				248	u'abcdef'
				249	>>> s = unicode('abcdef')
				250	>>> type(s)
				251	<type 'unicode'>
				252	>>> unicode('abcdef' + chr(255))
				253	Traceback (most recent call last):
				254	File "<stdin>", line 1, in ?
				255	UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
				256	ordinal not in range(128)
				257
				258	The ``errors`` argument specifies the response when the input string can't be
				259	converted according to the encoding's rules. Legal values for this argument are
				260	'strict' (raise a ``UnicodeDecodeError`` exception), 'replace' (add U+FFFD,
				261	'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the
				262	Unicode result). The following examples show the differences::
				263
				264	>>> unicode('\x80abc', errors='strict')
				265	Traceback (most recent call last):
				266	File "<stdin>", line 1, in ?
				267	UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
				268	ordinal not in range(128)
				269	>>> unicode('\x80abc', errors='replace')
				270	u'\ufffdabc'
				271	>>> unicode('\x80abc', errors='ignore')
				272	u'abc'
				273
				274	Encodings are specified as strings containing the encoding's name. Python 2.4
				275	comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl	1cf0522	2008-02-05 12:01:24 +0000	[diff] [blame]	276	:ref:`standard-encodings` for a list. Some encodings
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	277	have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all
				278	synonyms for the same encoding.
				279
				280	One-character Unicode strings can also be created with the :func:`unichr`
				281	built-in function, which takes integers and returns a Unicode string of length 1
				282	that contains the corresponding code point. The reverse operation is the
				283	built-in :func:`ord` function that takes a one-character Unicode string and
				284	returns the code point value::
				285
				286	>>> unichr(40960)
				287	u'\ua000'
				288	>>> ord(u'\ua000')
				289	40960
				290
				291	Instances of the :class:`unicode` type have many of the same methods as the
				292	8-bit string type for operations such as searching and formatting::
				293
				294	>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
				295	>>> s.count('e')
				296	5
				297	>>> s.find('feather')
				298	9
				299	>>> s.find('bird')
				300	-1
				301	>>> s.replace('feather', 'sand')
				302	u'Was ever sand so lightly blown to and fro as this multitude?'
				303	>>> s.upper()
				304	u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
				305
				306	Note that the arguments to these methods can be Unicode strings or 8-bit
				307	strings. 8-bit strings will be converted to Unicode before carrying out the
				308	operation; Python's default ASCII encoding will be used, so characters greater
				309	than 127 will cause an exception::
				310
				311	>>> s.find('Was\x9f')
				312	Traceback (most recent call last):
				313	File "<stdin>", line 1, in ?
				314	UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
				315	>>> s.find(u'Was\x9f')
				316	-1
				317
				318	Much Python code that operates on strings will therefore work with Unicode
				319	strings without requiring any changes to the code. (Input and output code needs
				320	more updating for Unicode; more on this later.)
				321
				322	Another important method is ``.encode([encoding], [errors='strict'])``, which
				323	returns an 8-bit string version of the Unicode string, encoded in the requested
				324	encoding. The ``errors`` parameter is the same as the parameter of the
				325	``unicode()`` constructor, with one additional possibility; as well as 'strict',
				326	'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's
				327	character references. The following example shows the different results::
				328
				329	>>> u = unichr(40960) + u'abcd' + unichr(1972)
				330	>>> u.encode('utf-8')
				331	'\xea\x80\x80abcd\xde\xb4'
				332	>>> u.encode('ascii')
				333	Traceback (most recent call last):
				334	File "<stdin>", line 1, in ?
				335	UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
				336	>>> u.encode('ascii', 'ignore')
				337	'abcd'
				338	>>> u.encode('ascii', 'replace')
				339	'?abcd?'
				340	>>> u.encode('ascii', 'xmlcharrefreplace')
				341	'ꀀabcd޴'
				342
				343	Python's 8-bit strings have a ``.decode([encoding], [errors])`` method that
				344	interprets the string using the given encoding::
				345
				346	>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
				347	>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
				348	>>> type(utf8_version), utf8_version
				349	(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
				350	>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
				351	>>> u == u2 # The two strings match
				352	True
				353
				354	The low-level routines for registering and accessing the available encodings are
				355	found in the :mod:`codecs` module. However, the encoding and decoding functions
				356	returned by this module are usually more low-level than is comfortable, so I'm
				357	not going to describe the :mod:`codecs` module here. If you need to implement a
				358	completely new encoding, you'll need to learn about the :mod:`codecs` module
				359	interfaces, but implementing encodings is a specialized task that also won't be
				360	covered here. Consult the Python documentation to learn more about this module.
				361
				362	The most commonly used part of the :mod:`codecs` module is the
				363	:func:`codecs.open` function which will be discussed in the section on input and
				364	output.
				365
				366
				367	Unicode Literals in Python Source Code
				368	--------------------------------------
				369
				370	In Python source code, Unicode literals are written as strings prefixed with the
				371	'u' or 'U' character: ``u'abcdefghijk'``. Specific code points can be written
				372	using the ``\u`` escape sequence, which is followed by four hex digits giving
				373	the code point. The ``\U`` escape sequence is similar, but expects 8 hex
				374	digits, not 4.
				375
				376	Unicode literals can also use the same escape sequences as 8-bit strings,
				377	including ``\x``, but ``\x`` only takes two hex digits so it can't express an
				378	arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.
				379
				380	::
				381
				382	>>> s = u"a\xac\u1234\u20ac\U00008000"
				383	^^^^ two-digit hex escape
				384	^^^^^^ four-digit Unicode escape
				385	^^^^^^^^^^ eight-digit Unicode escape
				386	>>> for c in s: print ord(c),
				387	...
				388	97 172 4660 8364 32768
				389
				390	Using escape sequences for code points greater than 127 is fine in small doses,
				391	but becomes an annoyance if you're using many accented characters, as you would
				392	in a program with messages in French or some other accent-using language. You
				393	can also assemble strings using the :func:`unichr` built-in function, but this is
				394	even more tedious.
				395
				396	Ideally, you'd want to be able to write literals in your language's natural
				397	encoding. You could then edit Python source code with your favorite editor
				398	which would display the accented characters naturally, and have the right
				399	characters used at runtime.
				400
				401	Python supports writing Unicode literals in any encoding, but you have to
				402	declare the encoding being used. This is done by including a special comment as
				403	either the first or second line of the source file::
				404
				405	#!/usr/bin/env python
				406	# -- coding: latin-1 --
				407
				408	u = u'abcdé'
				409	print ord(u[-1])
				410
				411	The syntax is inspired by Emacs's notation for specifying variables local to a
				412	file. Emacs supports many different variables, but Python only supports
				413	'coding'. The ``-*-`` symbols indicate that the comment is special; within
				414	them, you must supply the name ``coding`` and the name of your chosen encoding,
				415	separated by ``':'``.
				416
				417	If you don't include such a comment, the default encoding used will be ASCII.
				418	Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default
				419	encoding for string literals; in Python 2.4, characters greater than 127 still
				420	work but result in a warning. For example, the following program has no
				421	encoding declaration::
				422
				423	#!/usr/bin/env python
				424	u = u'abcdé'
				425	print ord(u[-1])
				426
				427	When you run it with Python 2.4, it will output the following warning::
				428
				429	amk:~$ python p263.py
				430	sys:1: DeprecationWarning: Non-ASCII character '\xe9'
				431	in file p263.py on line 2, but no encoding declared;
				432	see http://www.python.org/peps/pep-0263.html for details
				433
				434
				435	Unicode Properties
				436	------------------
				437
				438	The Unicode specification includes a database of information about code points.
				439	For each code point that's defined, the information includes the character's
				440	name, its category, the numeric value if applicable (Unicode has characters
				441	representing the Roman numerals and fractions such as one-third and
				442	four-fifths). There are also properties related to the code point's use in
				443	bidirectional text and other display-related properties.
				444
				445	The following program displays some information about several characters, and
				446	prints the numeric value of one particular character::
				447
				448	import unicodedata
				449
				450	u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
				451
				452	for i, c in enumerate(u):
				453	print i, '%04x' % ord(c), unicodedata.category(c),
				454	print unicodedata.name(c)
				455
				456	# Get numeric value of second character
				457	print unicodedata.numeric(u[1])
				458
				459	When run, this prints::
				460
				461	0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
				462	1 0bf2 No TAMIL NUMBER ONE THOUSAND
				463	2 0f84 Mn TIBETAN MARK HALANTA
				464	3 1770 Lo TAGBANWA LETTER SA
				465	4 33af So SQUARE RAD OVER S SQUARED
				466	1000.0
				467
				468	The category codes are abbreviations describing the nature of the character.
				469	These are grouped into categories such as "Letter", "Number", "Punctuation", or
				470	"Symbol", which in turn are broken up into subcategories. To take the codes
				471	from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
				472	"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
				473	other". See
				474	<http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values> for a
				475	list of category codes.
				476
				477	References
				478	----------
				479
				480	The Unicode and 8-bit string types are described in the Python library reference
				481	at :ref:`typesseq`.
				482
				483	The documentation for the :mod:`unicodedata` module.
				484
				485	The documentation for the :mod:`codecs` module.
				486
				487	Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and
				488	Unicode". A PDF version of his slides is available at
Georg Brandl	0267781	2008-03-15 00:20:19 +0000	[diff] [blame]	489	<http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	490	excellent overview of the design of Python's Unicode features.
				491
				492
				493	Reading and Writing Unicode Data
				494	================================
				495
				496	Once you've written some code that works with Unicode data, the next problem is
				497	input/output. How do you get Unicode strings into your program, and how do you
				498	convert Unicode into a form suitable for storage or transmission?
				499
				500	It's possible that you may not need to do anything depending on your input
				501	sources and output destinations; you should check whether the libraries used in
				502	your application support Unicode natively. XML parsers often return Unicode
				503	data, for example. Many relational databases also support Unicode-valued
				504	columns and can return Unicode values from an SQL query.
				505
				506	Unicode data is usually converted to a particular encoding before it gets
				507	written to disk or sent over a socket. It's possible to do all the work
				508	yourself: open a file, read an 8-bit string from it, and convert the string with
				509	``unicode(str, encoding)``. However, the manual approach is not recommended.
				510
				511	One problem is the multi-byte nature of encodings; one Unicode character can be
				512	represented by several bytes. If you want to read the file in arbitrary-sized
				513	chunks (say, 1K or 4K), you need to write error-handling code to catch the case
				514	where only part of the bytes encoding a single Unicode character are read at the
				515	end of a chunk. One solution would be to read the entire file into memory and
				516	then perform the decoding, but that prevents you from working with files that
				517	are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM.
				518	(More, really, since for at least a moment you'd need to have both the encoded
				519	string and its Unicode version in memory.)
				520
				521	The solution would be to use the low-level decoding interface to catch the case
				522	of partial coding sequences. The work of implementing this has already been
				523	done for you: the :mod:`codecs` module includes a version of the :func:`open`
				524	function that returns a file-like object that assumes the file's contents are in
				525	a specified encoding and accepts Unicode parameters for methods such as
				526	``.read()`` and ``.write()``.
				527
				528	The function's parameters are ``open(filename, mode='rb', encoding=None,
				529	errors='strict', buffering=1)``. ``mode`` can be ``'r'``, ``'w'``, or ``'a'``,
				530	just like the corresponding parameter to the regular built-in ``open()``
				531	function; add a ``'+'`` to update the file. ``buffering`` is similarly parallel
				532	to the standard function's parameter. ``encoding`` is a string giving the
				533	encoding to use; if it's left as ``None``, a regular Python file object that
				534	accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and
				535	data written to or read from the wrapper object will be converted as needed.
				536	``errors`` specifies the action for encoding errors and can be one of the usual
				537	values of 'strict', 'ignore', and 'replace'.
				538
				539	Reading Unicode from a file is therefore simple::
				540
				541	import codecs
				542	f = codecs.open('unicode.rst', encoding='utf-8')
				543	for line in f:
				544	print repr(line)
				545
				546	It's also possible to open files in update mode, allowing both reading and
				547	writing::
				548
				549	f = codecs.open('test', encoding='utf-8', mode='w+')
				550	f.write(u'\u4500 blah blah blah\n')
				551	f.seek(0)
				552	print repr(f.readline()[:1])
				553	f.close()
				554
				555	Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
				556	written as the first character of a file in order to assist with autodetection
				557	of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
				558	present at the start of a file; when such an encoding is used, the BOM will be
				559	automatically written as the first character and will be silently dropped when
				560	the file is read. There are variants of these encodings, such as 'utf-16-le'
				561	and 'utf-16-be' for little-endian and big-endian encodings, that specify one
				562	particular byte ordering and don't skip the BOM.
				563
				564
				565	Unicode filenames
				566	-----------------
				567
				568	Most of the operating systems in common use today support filenames that contain
				569	arbitrary Unicode characters. Usually this is implemented by converting the
				570	Unicode string into some encoding that varies depending on the system. For
Georg Brandl	9af9498	2008-09-13 17:41:16 +0000	[diff] [blame]	571	example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	572	Windows, Python uses the name "mbcs" to refer to whatever the currently
				573	configured encoding is. On Unix systems, there will only be a filesystem
				574	encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
				575	you haven't, the default encoding is ASCII.
				576
				577	The :func:`sys.getfilesystemencoding` function returns the encoding to use on
				578	your current system, in case you want to do the encoding manually, but there's
				579	not much reason to bother. When opening a file for reading or writing, you can
				580	usually just provide the Unicode string as the filename, and it will be
				581	automatically converted to the right encoding for you::
				582
				583	filename = u'filename\u4500abc'
				584	f = open(filename, 'w')
				585	f.write('blah\n')
				586	f.close()
				587
				588	Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
				589	filenames.
				590
				591	:func:`os.listdir`, which returns filenames, raises an issue: should it return
				592	the Unicode version of filenames, or should it return 8-bit strings containing
				593	the encoded versions? :func:`os.listdir` will do both, depending on whether you
				594	provided the directory path as an 8-bit string or a Unicode string. If you pass
				595	a Unicode string as the path, filenames will be decoded using the filesystem's
				596	encoding and a list of Unicode strings will be returned, while passing an 8-bit
				597	path will return the 8-bit versions of the filenames. For example, assuming the
				598	default filesystem encoding is UTF-8, running the following program::
				599
				600	fn = u'filename\u4500abc'
				601	f = open(fn, 'w')
				602	f.close()
				603
				604	import os
				605	print os.listdir('.')
				606	print os.listdir(u'.')
				607
				608	will produce the following output::
				609
				610	amk:~$ python t.py
				611	['.svn', 'filename\xe4\x94\x80abc', ...]
				612	[u'.svn', u'filename\u4500abc', ...]
				613
				614	The first list contains UTF-8-encoded filenames, and the second list contains
				615	the Unicode versions.
				616
				617
				618
				619	Tips for Writing Unicode-aware Programs
				620	---------------------------------------
				621
				622	This section provides some suggestions on writing software that deals with
				623	Unicode.
				624
				625	The most important tip is:
				626
				627	Software should only work with Unicode strings internally, converting to a
				628	particular encoding on output.
				629
				630	If you attempt to write processing functions that accept both Unicode and 8-bit
				631	strings, you will find your program vulnerable to bugs wherever you combine the
				632	two different kinds of strings. Python's default encoding is ASCII, so whenever
				633	a character with an ASCII value > 127 is in the input data, you'll get a
				634	:exc:`UnicodeDecodeError` because that character can't be handled by the ASCII
				635	encoding.
				636
				637	It's easy to miss such problems if you only test your software with data that
				638	doesn't contain any accents; everything will seem to work, but there's actually
				639	a bug in your program waiting for the first user who attempts to use characters
				640	> 127. A second tip, therefore, is:
				641
				642	Include characters > 127 and, even better, characters > 255 in your test
				643	data.
				644
				645	When using data coming from a web browser or some other untrusted source, a
				646	common technique is to check for illegal characters in a string before using the
				647	string in a generated command line or storing it in a database. If you're doing
				648	this, be careful to check the string once it's in the form that will be used or
				649	stored; it's possible for encodings to be used to disguise characters. This is
				650	especially true if the input data also specifies the encoding; many encodings
				651	leave the commonly checked-for characters alone, but Python includes some
				652	encodings such as ``'base64'`` that modify every single character.
				653
				654	For example, let's say you have a content management system that takes a Unicode
				655	filename, and you want to disallow paths with a '/' character. You might write
				656	this code::
				657
				658	def read_file (filename, encoding):
				659	if '/' in filename:
				660	raise ValueError("'/' not allowed in filenames")
				661	unicode_name = filename.decode(encoding)
				662	f = open(unicode_name, 'r')
				663	# ... return contents of file ...
				664
				665	However, if an attacker could specify the ``'base64'`` encoding, they could pass
				666	``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string
				667	``'/etc/passwd'``, to read a system file. The above code looks for ``'/'``
				668	characters in the encoded form and misses the dangerous character in the
				669	resulting decoded form.
				670
				671	References
				672	----------
				673
				674	The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
				675	Applications in Python" are available at
Georg Brandl	0267781	2008-03-15 00:20:19 +0000	[diff] [blame]	676	<http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	677	and discuss questions of character encodings as well as how to internationalize
				678	and localize an application.
				679
				680
				681	Revision History and Acknowledgements
				682	=====================================
				683
				684	Thanks to the following people who have noted errors or offered suggestions on
				685	this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
				686	Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
				687
				688	Version 1.0: posted August 5 2005.
				689
				690	Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
				691	several links.
				692
				693	Version 1.02: posted August 16 2005. Corrects factual errors.
				694
				695
				696	.. comment Additional topic: building Python w/ UCS2 or UCS4 support
				697	.. comment Describe obscure -U switch somewhere?
				698	.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
				699
				700	.. comment
				701	Original outline:
				702
				703	- [ ] Unicode introduction
				704	- [ ] ASCII
				705	- [ ] Terms
				706	- [ ] Character
				707	- [ ] Code point
				708	- [ ] Encodings
				709	- [ ] Common encodings: ASCII, Latin-1, UTF-8
				710	- [ ] Unicode Python type
				711	- [ ] Writing unicode literals
				712	- [ ] Obscurity: -U switch
				713	- [ ] Built-ins
				714	- [ ] unichr()
				715	- [ ] ord()
				716	- [ ] unicode() constructor
				717	- [ ] Unicode type
				718	- [ ] encode(), decode() methods
				719	- [ ] Unicodedata module for character properties
				720	- [ ] I/O
				721	- [ ] Reading/writing Unicode data into files
				722	- [ ] Byte-order marks
				723	- [ ] Unicode filenames
				724	- [ ] Writing Unicode programs
				725	- [ ] Do everything in Unicode
				726	- [ ] Declaring source code encodings (PEP 263)
				727	- [ ] Other issues
				728	- [ ] Building Python (UCS2, UCS4)