Blame - Doc/howto/unicode.rst - platform/external/python/cpython2

blob: 4e4921cfc9216b62fd2dd10c22b40ff0fa735e4c [file] [log] [blame]

Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1	*****************
				2	Unicode HOWTO
				3	*****************
				4
				5	:Release: 1.02
				6
				7	This HOWTO discusses Python's support for Unicode, and explains various problems
				8	that people commonly encounter when trying to work with Unicode.
				9
				10	Introduction to Unicode
				11	=======================
				12
				13	History of Character Codes
				14	--------------------------
				15
				16	In 1968, the American Standard Code for Information Interchange, better known by
				17	its acronym ASCII, was standardized. ASCII defined numeric codes for various
				18	characters, with the numeric values running from 0 to
				19	127. For example, the lowercase letter 'a' is assigned 97 as its code
				20	value.
				21
				22	ASCII was an American-developed standard, so it only defined unaccented
				23	characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
				24	which required accented characters couldn't be faithfully represented in ASCII.
				25	(Actually the missing accents matter for English, too, which contains words such
				26	as 'naïve' and 'café', and some publications have house styles which require
				27	spellings such as 'coöperate'.)
				28
				29	For a while people just wrote programs that didn't display accents. I remember
				30	looking at Apple ][ BASIC programs, published in French-language publications in
				31	the mid-1980s, that had lines like these::
				32
Georg Brandl	7044b11	2009-01-03 21:04:55 +0000	[diff] [blame]	33	PRINT "FICHIER EST COMPLETE."
				34	PRINT "CARACTERE NON ACCEPTE."
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	35
				36	Those messages should contain accents, and they just look wrong to someone who
				37	can read French.
				38
				39	In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
				40	hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
				41	machines assigned values between 128 and 255 to accented characters. Different
				42	machines had different codes, however, which led to problems exchanging files.
				43	Eventually various commonly used sets of values for the 128-255 range emerged.
				44	Some were true standards, defined by the International Standards Organization,
				45	and some were de facto conventions that were invented by one company or
				46	another and managed to catch on.
				47
				48	255 characters aren't very many. For example, you can't fit both the accented
				49	characters used in Western Europe and the Cyrillic alphabet used for Russian
				50	into the 128-255 range because there are more than 127 such characters.
				51
				52	You could write files using different codes (all your Russian files in a coding
				53	system called KOI8, all your French files in a different coding system called
				54	Latin1), but what if you wanted to write a French document that quotes some
				55	Russian text? In the 1980s people began to want to solve this problem, and the
				56	Unicode standardization effort began.
				57
				58	Unicode started out using 16-bit characters instead of 8-bit characters. 16
				59	bits means you have 2^16 = 65,536 distinct values available, making it possible
				60	to represent many different characters from many different alphabets; an initial
				61	goal was to have Unicode contain the alphabets for every single human language.
				62	It turns out that even 16 bits isn't enough to meet that goal, and the modern
				63	Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in
				64	base-16).
				65
				66	There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
				67	originally separate efforts, but the specifications were merged with the 1.1
				68	revision of Unicode.
				69
				70	(This discussion of Unicode's history is highly simplified. I don't think the
				71	average Python programmer needs to worry about the historical details; consult
				72	the Unicode consortium site listed in the References for more information.)
				73
				74
				75	Definitions
				76	-----------
				77
				78	A character is the smallest possible component of a text. 'A', 'B', 'C',
				79	etc., are all different characters. So are 'È' and 'Í'. Characters are
				80	abstractions, and vary depending on the language or context you're talking
				81	about. For example, the symbol for ohms (Ω) is usually drawn much like the
				82	capital letter omega (Ω) in the Greek alphabet (they may even be the same in
				83	some fonts), but these are two different characters that have different
				84	meanings.
				85
				86	The Unicode standard describes how characters are represented by **code
				87	points**. A code point is an integer value, usually denoted in base 16. In the
				88	standard, a code point is written using the notation U+12ca to mean the
				89	character with value 0x12ca (4810 decimal). The Unicode standard contains a lot
				90	of tables listing characters and their corresponding code points::
				91
Georg Brandl	7044b11	2009-01-03 21:04:55 +0000	[diff] [blame]	92	0061 'a'; LATIN SMALL LETTER A
				93	0062 'b'; LATIN SMALL LETTER B
				94	0063 'c'; LATIN SMALL LETTER C
				95	...
				96	007B '{'; LEFT CURLY BRACKET
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	97
				98	Strictly, these definitions imply that it's meaningless to say 'this is
				99	character U+12ca'. U+12ca is a code point, which represents some particular
				100	character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
				101	informal contexts, this distinction between code points and characters will
				102	sometimes be forgotten.
				103
				104	A character is represented on a screen or on paper by a set of graphical
				105	elements that's called a glyph. The glyph for an uppercase A, for example,
				106	is two diagonal strokes and a horizontal stroke, though the exact details will
				107	depend on the font being used. Most Python code doesn't need to worry about
				108	glyphs; figuring out the correct glyph to display is generally the job of a GUI
				109	toolkit or a terminal's font renderer.
				110
				111
				112	Encodings
				113	---------
				114
				115	To summarize the previous section: a Unicode string is a sequence of code
				116	points, which are numbers from 0 to 0x10ffff. This sequence needs to be
				117	represented as a set of bytes (meaning, values from 0-255) in memory. The rules
				118	for translating a Unicode string into a sequence of bytes are called an
				119	encoding.
				120
				121	The first encoding you might think of is an array of 32-bit integers. In this
				122	representation, the string "Python" would look like this::
				123
				124	P y t h o n
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	125	0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
				126	0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	127
				128	This representation is straightforward but using it presents a number of
				129	problems.
				130
				131	1. It's not portable; different processors order the bytes differently.
				132
				133	2. It's very wasteful of space. In most texts, the majority of the code points
				134	are less than 127, or less than 255, so a lot of space is occupied by zero
				135	bytes. The above string takes 24 bytes compared to the 6 bytes needed for an
				136	ASCII representation. Increased RAM usage doesn't matter too much (desktop
				137	computers have megabytes of RAM, and strings aren't usually that large), but
				138	expanding our usage of disk and network bandwidth by a factor of 4 is
				139	intolerable.
				140
				141	3. It's not compatible with existing C functions such as ``strlen()``, so a new
				142	family of wide string functions would need to be used.
				143
				144	4. Many Internet standards are defined in terms of textual data, and can't
				145	handle content with embedded zero bytes.
				146
				147	Generally people don't use this encoding, instead choosing other encodings that
				148	are more efficient and convenient.
				149
				150	Encodings don't have to handle every possible Unicode character, and most
				151	encodings don't. For example, Python's default encoding is the 'ascii'
				152	encoding. The rules for converting a Unicode string into the ASCII encoding are
				153	simple; for each code point:
				154
				155	1. If the code point is < 128, each byte is the same as the value of the code
				156	point.
				157
				158	2. If the code point is 128 or greater, the Unicode string can't be represented
				159	in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
				160	case.)
				161
				162	Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
				163	0-255 are identical to the Latin-1 values, so converting to this encoding simply
				164	requires converting code points to byte values; if a code point larger than 255
				165	is encountered, the string can't be encoded into Latin-1.
				166
				167	Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
				168	IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
				169	block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
				170	through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
				171	some sort of lookup table to perform the conversion, but this is largely an
				172	internal detail.
				173
				174	UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
				175	Transformation Format", and the '8' means that 8-bit numbers are used in the
				176	encoding. (There's also a UTF-16 encoding, but it's less frequently used than
				177	UTF-8.) UTF-8 uses the following rules:
				178
				179	1. If the code point is <128, it's represented by the corresponding byte value.
				180	2. If the code point is between 128 and 0x7ff, it's turned into two byte values
				181	between 128 and 255.
				182	3. Code points >0x7ff are turned into three- or four-byte sequences, where each
				183	byte of the sequence is between 128 and 255.
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	184
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	185	UTF-8 has several convenient properties:
				186
				187	1. It can handle any Unicode code point.
				188	2. A Unicode string is turned into a string of bytes containing no embedded zero
				189	bytes. This avoids byte-ordering issues, and means UTF-8 strings can be
				190	processed by C functions such as ``strcpy()`` and sent through protocols that
				191	can't handle zero bytes.
				192	3. A string of ASCII text is also valid UTF-8 text.
				193	4. UTF-8 is fairly compact; the majority of code points are turned into two
				194	bytes, and values less than 128 occupy only a single byte.
				195	5. If bytes are corrupted or lost, it's possible to determine the start of the
				196	next UTF-8-encoded code point and resynchronize. It's also unlikely that
				197	random 8-bit data will look like valid UTF-8.
				198
				199
				200
				201	References
				202	----------
				203
				204	The Unicode Consortium site at <http://www.unicode.org> has character charts, a
				205	glossary, and PDF versions of the Unicode specification. Be prepared for some
				206	difficult reading. <http://www.unicode.org/history/> is a chronology of the
				207	origin and development of Unicode.
				208
				209	To help understand the standard, Jukka Korpela has written an introductory guide
				210	to reading the Unicode character tables, available at
				211	<http://www.cs.tut.fi/~jkorpela/unicode/guide.html>.
				212
Georg Brandl	83130c3	2009-09-16 09:30:48 +0000	[diff] [blame]	213	Another good introductory article was written by Joel Spolsky
				214	<http://www.joelonsoftware.com/articles/Unicode.html>.
				215	If this introduction didn't make things clear to you, you should try reading this
				216	alternate article before continuing.
				217
				218	.. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	219
				220	Wikipedia entries are often helpful; see the entries for "character encoding"
				221	<http://en.wikipedia.org/wiki/Character_encoding> and UTF-8
				222	<http://en.wikipedia.org/wiki/UTF-8>, for example.
				223
				224
				225	Python's Unicode Support
				226	========================
				227
				228	Now that you've learned the rudiments of Unicode, we can look at Python's
				229	Unicode features.
				230
				231
				232	The Unicode Type
				233	----------------
				234
				235	Unicode strings are expressed as instances of the :class:`unicode` type, one of
				236	Python's repertoire of built-in types. It derives from an abstract type called
				237	:class:`basestring`, which is also an ancestor of the :class:`str` type; you can
				238	therefore check if a value is a string type with ``isinstance(value,
				239	basestring)``. Under the hood, Python represents Unicode strings as either 16-
				240	or 32-bit integers, depending on how the Python interpreter was compiled.
				241
				242	The :func:`unicode` constructor has the signature ``unicode(string[, encoding,
				243	errors])``. All of its arguments should be 8-bit strings. The first argument
				244	is converted to Unicode using the specified encoding; if you leave off the
				245	``encoding`` argument, the ASCII encoding is used for the conversion, so
				246	characters greater than 127 will be treated as errors::
				247
				248	>>> unicode('abcdef')
				249	u'abcdef'
				250	>>> s = unicode('abcdef')
				251	>>> type(s)
				252	<type 'unicode'>
				253	>>> unicode('abcdef' + chr(255))
				254	Traceback (most recent call last):
				255	File "<stdin>", line 1, in ?
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	256	UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	257	ordinal not in range(128)
				258
				259	The ``errors`` argument specifies the response when the input string can't be
				260	converted according to the encoding's rules. Legal values for this argument are
				261	'strict' (raise a ``UnicodeDecodeError`` exception), 'replace' (add U+FFFD,
				262	'REPLACEMENT CHARACTER'), or 'ignore' (just leave the character out of the
				263	Unicode result). The following examples show the differences::
				264
				265	>>> unicode('\x80abc', errors='strict')
				266	Traceback (most recent call last):
				267	File "<stdin>", line 1, in ?
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	268	UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	269	ordinal not in range(128)
				270	>>> unicode('\x80abc', errors='replace')
				271	u'\ufffdabc'
				272	>>> unicode('\x80abc', errors='ignore')
				273	u'abc'
				274
				275	Encodings are specified as strings containing the encoding's name. Python 2.4
				276	comes with roughly 100 different encodings; see the Python Library Reference at
Georg Brandl	1cf0522	2008-02-05 12:01:24 +0000	[diff] [blame]	277	:ref:`standard-encodings` for a list. Some encodings
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	278	have multiple names; for example, 'latin-1', 'iso_8859_1' and '8859' are all
				279	synonyms for the same encoding.
				280
				281	One-character Unicode strings can also be created with the :func:`unichr`
				282	built-in function, which takes integers and returns a Unicode string of length 1
				283	that contains the corresponding code point. The reverse operation is the
				284	built-in :func:`ord` function that takes a one-character Unicode string and
				285	returns the code point value::
				286
				287	>>> unichr(40960)
				288	u'\ua000'
				289	>>> ord(u'\ua000')
				290	40960
				291
				292	Instances of the :class:`unicode` type have many of the same methods as the
				293	8-bit string type for operations such as searching and formatting::
				294
				295	>>> s = u'Was ever feather so lightly blown to and fro as this multitude?'
				296	>>> s.count('e')
				297	5
				298	>>> s.find('feather')
				299	9
				300	>>> s.find('bird')
				301	-1
				302	>>> s.replace('feather', 'sand')
				303	u'Was ever sand so lightly blown to and fro as this multitude?'
				304	>>> s.upper()
				305	u'WAS EVER FEATHER SO LIGHTLY BLOWN TO AND FRO AS THIS MULTITUDE?'
				306
				307	Note that the arguments to these methods can be Unicode strings or 8-bit
				308	strings. 8-bit strings will be converted to Unicode before carrying out the
				309	operation; Python's default ASCII encoding will be used, so characters greater
				310	than 127 will cause an exception::
				311
				312	>>> s.find('Was\x9f')
				313	Traceback (most recent call last):
				314	File "<stdin>", line 1, in ?
				315	UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
				316	>>> s.find(u'Was\x9f')
				317	-1
				318
				319	Much Python code that operates on strings will therefore work with Unicode
				320	strings without requiring any changes to the code. (Input and output code needs
				321	more updating for Unicode; more on this later.)
				322
				323	Another important method is ``.encode([encoding], [errors='strict'])``, which
				324	returns an 8-bit string version of the Unicode string, encoded in the requested
				325	encoding. The ``errors`` parameter is the same as the parameter of the
				326	``unicode()`` constructor, with one additional possibility; as well as 'strict',
				327	'ignore', and 'replace', you can also pass 'xmlcharrefreplace' which uses XML's
				328	character references. The following example shows the different results::
				329
				330	>>> u = unichr(40960) + u'abcd' + unichr(1972)
				331	>>> u.encode('utf-8')
				332	'\xea\x80\x80abcd\xde\xb4'
				333	>>> u.encode('ascii')
				334	Traceback (most recent call last):
				335	File "<stdin>", line 1, in ?
				336	UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in position 0: ordinal not in range(128)
				337	>>> u.encode('ascii', 'ignore')
				338	'abcd'
				339	>>> u.encode('ascii', 'replace')
				340	'?abcd?'
				341	>>> u.encode('ascii', 'xmlcharrefreplace')
				342	'ꀀabcd޴'
				343
				344	Python's 8-bit strings have a ``.decode([encoding], [errors])`` method that
				345	interprets the string using the given encoding::
				346
				347	>>> u = unichr(40960) + u'abcd' + unichr(1972) # Assemble a string
				348	>>> utf8_version = u.encode('utf-8') # Encode as UTF-8
				349	>>> type(utf8_version), utf8_version
				350	(<type 'str'>, '\xea\x80\x80abcd\xde\xb4')
				351	>>> u2 = utf8_version.decode('utf-8') # Decode using UTF-8
				352	>>> u == u2 # The two strings match
				353	True
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	354
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	355	The low-level routines for registering and accessing the available encodings are
				356	found in the :mod:`codecs` module. However, the encoding and decoding functions
				357	returned by this module are usually more low-level than is comfortable, so I'm
				358	not going to describe the :mod:`codecs` module here. If you need to implement a
				359	completely new encoding, you'll need to learn about the :mod:`codecs` module
				360	interfaces, but implementing encodings is a specialized task that also won't be
				361	covered here. Consult the Python documentation to learn more about this module.
				362
				363	The most commonly used part of the :mod:`codecs` module is the
				364	:func:`codecs.open` function which will be discussed in the section on input and
				365	output.
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	366
				367
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	368	Unicode Literals in Python Source Code
				369	--------------------------------------
				370
				371	In Python source code, Unicode literals are written as strings prefixed with the
				372	'u' or 'U' character: ``u'abcdefghijk'``. Specific code points can be written
				373	using the ``\u`` escape sequence, which is followed by four hex digits giving
				374	the code point. The ``\U`` escape sequence is similar, but expects 8 hex
				375	digits, not 4.
				376
				377	Unicode literals can also use the same escape sequences as 8-bit strings,
				378	including ``\x``, but ``\x`` only takes two hex digits so it can't express an
				379	arbitrary code point. Octal escapes can go up to U+01ff, which is octal 777.
				380
				381	::
				382
				383	>>> s = u"a\xac\u1234\u20ac\U00008000"
				384	^^^^ two-digit hex escape
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	385	^^^^^^ four-digit Unicode escape
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	386	^^^^^^^^^^ eight-digit Unicode escape
				387	>>> for c in s: print ord(c),
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	388	...
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	389	97 172 4660 8364 32768
				390
				391	Using escape sequences for code points greater than 127 is fine in small doses,
				392	but becomes an annoyance if you're using many accented characters, as you would
				393	in a program with messages in French or some other accent-using language. You
				394	can also assemble strings using the :func:`unichr` built-in function, but this is
				395	even more tedious.
				396
				397	Ideally, you'd want to be able to write literals in your language's natural
				398	encoding. You could then edit Python source code with your favorite editor
				399	which would display the accented characters naturally, and have the right
				400	characters used at runtime.
				401
				402	Python supports writing Unicode literals in any encoding, but you have to
				403	declare the encoding being used. This is done by including a special comment as
				404	either the first or second line of the source file::
				405
				406	#!/usr/bin/env python
				407	# -- coding: latin-1 --
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	408
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	409	u = u'abcdé'
				410	print ord(u[-1])
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	411
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	412	The syntax is inspired by Emacs's notation for specifying variables local to a
				413	file. Emacs supports many different variables, but Python only supports
Georg Brandl	f1dd4bc	2008-11-22 10:08:50 +0000	[diff] [blame]	414	'coding'. The ``-*-`` symbols indicate to Emacs that the comment is special;
				415	they have no significance to Python but are a convention. Python looks for
				416	``coding: name`` or ``coding=name`` in the comment.
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	417
				418	If you don't include such a comment, the default encoding used will be ASCII.
				419	Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default
				420	encoding for string literals; in Python 2.4, characters greater than 127 still
				421	work but result in a warning. For example, the following program has no
				422	encoding declaration::
				423
				424	#!/usr/bin/env python
				425	u = u'abcdé'
				426	print ord(u[-1])
				427
				428	When you run it with Python 2.4, it will output the following warning::
				429
				430	amk:~$ python p263.py
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	431	sys:1: DeprecationWarning: Non-ASCII character '\xe9'
				432	in file p263.py on line 2, but no encoding declared;
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	433	see http://www.python.org/peps/pep-0263.html for details
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	434
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	435
				436	Unicode Properties
				437	------------------
				438
				439	The Unicode specification includes a database of information about code points.
				440	For each code point that's defined, the information includes the character's
				441	name, its category, the numeric value if applicable (Unicode has characters
				442	representing the Roman numerals and fractions such as one-third and
				443	four-fifths). There are also properties related to the code point's use in
				444	bidirectional text and other display-related properties.
				445
				446	The following program displays some information about several characters, and
				447	prints the numeric value of one particular character::
				448
				449	import unicodedata
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	450
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	451	u = unichr(233) + unichr(0x0bf2) + unichr(3972) + unichr(6000) + unichr(13231)
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	452
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	453	for i, c in enumerate(u):
				454	print i, '%04x' % ord(c), unicodedata.category(c),
				455	print unicodedata.name(c)
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	456
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	457	# Get numeric value of second character
				458	print unicodedata.numeric(u[1])
				459
				460	When run, this prints::
				461
				462	0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
				463	1 0bf2 No TAMIL NUMBER ONE THOUSAND
				464	2 0f84 Mn TIBETAN MARK HALANTA
				465	3 1770 Lo TAGBANWA LETTER SA
				466	4 33af So SQUARE RAD OVER S SQUARED
				467	1000.0
				468
				469	The category codes are abbreviations describing the nature of the character.
				470	These are grouped into categories such as "Letter", "Number", "Punctuation", or
				471	"Symbol", which in turn are broken up into subcategories. To take the codes
				472	from the above output, ``'Ll'`` means 'Letter, lowercase', ``'No'`` means
				473	"Number, other", ``'Mn'`` is "Mark, nonspacing", and ``'So'`` is "Symbol,
				474	other". See
Georg Brandl	a4314c2	2009-10-11 20:16:16 +0000	[diff] [blame^]	475	<http://unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values> for a
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	476	list of category codes.
				477
				478	References
				479	----------
				480
				481	The Unicode and 8-bit string types are described in the Python library reference
				482	at :ref:`typesseq`.
				483
				484	The documentation for the :mod:`unicodedata` module.
				485
				486	The documentation for the :mod:`codecs` module.
				487
				488	Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and
				489	Unicode". A PDF version of his slides is available at
Georg Brandl	0267781	2008-03-15 00:20:19 +0000	[diff] [blame]	490	<http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	491	excellent overview of the design of Python's Unicode features.
				492
				493
				494	Reading and Writing Unicode Data
				495	================================
				496
				497	Once you've written some code that works with Unicode data, the next problem is
				498	input/output. How do you get Unicode strings into your program, and how do you
				499	convert Unicode into a form suitable for storage or transmission?
				500
				501	It's possible that you may not need to do anything depending on your input
				502	sources and output destinations; you should check whether the libraries used in
				503	your application support Unicode natively. XML parsers often return Unicode
				504	data, for example. Many relational databases also support Unicode-valued
				505	columns and can return Unicode values from an SQL query.
				506
				507	Unicode data is usually converted to a particular encoding before it gets
				508	written to disk or sent over a socket. It's possible to do all the work
				509	yourself: open a file, read an 8-bit string from it, and convert the string with
				510	``unicode(str, encoding)``. However, the manual approach is not recommended.
				511
				512	One problem is the multi-byte nature of encodings; one Unicode character can be
				513	represented by several bytes. If you want to read the file in arbitrary-sized
				514	chunks (say, 1K or 4K), you need to write error-handling code to catch the case
				515	where only part of the bytes encoding a single Unicode character are read at the
				516	end of a chunk. One solution would be to read the entire file into memory and
				517	then perform the decoding, but that prevents you from working with files that
				518	are extremely large; if you need to read a 2Gb file, you need 2Gb of RAM.
				519	(More, really, since for at least a moment you'd need to have both the encoded
				520	string and its Unicode version in memory.)
				521
				522	The solution would be to use the low-level decoding interface to catch the case
				523	of partial coding sequences. The work of implementing this has already been
				524	done for you: the :mod:`codecs` module includes a version of the :func:`open`
				525	function that returns a file-like object that assumes the file's contents are in
				526	a specified encoding and accepts Unicode parameters for methods such as
				527	``.read()`` and ``.write()``.
				528
				529	The function's parameters are ``open(filename, mode='rb', encoding=None,
				530	errors='strict', buffering=1)``. ``mode`` can be ``'r'``, ``'w'``, or ``'a'``,
				531	just like the corresponding parameter to the regular built-in ``open()``
				532	function; add a ``'+'`` to update the file. ``buffering`` is similarly parallel
				533	to the standard function's parameter. ``encoding`` is a string giving the
				534	encoding to use; if it's left as ``None``, a regular Python file object that
				535	accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and
				536	data written to or read from the wrapper object will be converted as needed.
				537	``errors`` specifies the action for encoding errors and can be one of the usual
				538	values of 'strict', 'ignore', and 'replace'.
				539
				540	Reading Unicode from a file is therefore simple::
				541
				542	import codecs
				543	f = codecs.open('unicode.rst', encoding='utf-8')
				544	for line in f:
				545	print repr(line)
				546
				547	It's also possible to open files in update mode, allowing both reading and
				548	writing::
				549
				550	f = codecs.open('test', encoding='utf-8', mode='w+')
				551	f.write(u'\u4500 blah blah blah\n')
				552	f.seek(0)
				553	print repr(f.readline()[:1])
				554	f.close()
				555
				556	Unicode character U+FEFF is used as a byte-order mark (BOM), and is often
				557	written as the first character of a file in order to assist with autodetection
				558	of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be
				559	present at the start of a file; when such an encoding is used, the BOM will be
				560	automatically written as the first character and will be silently dropped when
				561	the file is read. There are variants of these encodings, such as 'utf-16-le'
				562	and 'utf-16-be' for little-endian and big-endian encodings, that specify one
				563	particular byte ordering and don't skip the BOM.
				564
				565
				566	Unicode filenames
				567	-----------------
				568
				569	Most of the operating systems in common use today support filenames that contain
				570	arbitrary Unicode characters. Usually this is implemented by converting the
				571	Unicode string into some encoding that varies depending on the system. For
Georg Brandl	9af9498	2008-09-13 17:41:16 +0000	[diff] [blame]	572	example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	573	Windows, Python uses the name "mbcs" to refer to whatever the currently
				574	configured encoding is. On Unix systems, there will only be a filesystem
				575	encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
				576	you haven't, the default encoding is ASCII.
				577
				578	The :func:`sys.getfilesystemencoding` function returns the encoding to use on
				579	your current system, in case you want to do the encoding manually, but there's
				580	not much reason to bother. When opening a file for reading or writing, you can
				581	usually just provide the Unicode string as the filename, and it will be
				582	automatically converted to the right encoding for you::
				583
				584	filename = u'filename\u4500abc'
				585	f = open(filename, 'w')
				586	f.write('blah\n')
				587	f.close()
				588
				589	Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
				590	filenames.
				591
				592	:func:`os.listdir`, which returns filenames, raises an issue: should it return
				593	the Unicode version of filenames, or should it return 8-bit strings containing
				594	the encoded versions? :func:`os.listdir` will do both, depending on whether you
				595	provided the directory path as an 8-bit string or a Unicode string. If you pass
				596	a Unicode string as the path, filenames will be decoded using the filesystem's
				597	encoding and a list of Unicode strings will be returned, while passing an 8-bit
				598	path will return the 8-bit versions of the filenames. For example, assuming the
				599	default filesystem encoding is UTF-8, running the following program::
				600
Georg Brandl	7044b11	2009-01-03 21:04:55 +0000	[diff] [blame]	601	fn = u'filename\u4500abc'
				602	f = open(fn, 'w')
				603	f.close()
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	604
Georg Brandl	7044b11	2009-01-03 21:04:55 +0000	[diff] [blame]	605	import os
				606	print os.listdir('.')
				607	print os.listdir(u'.')
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	608
				609	will produce the following output::
				610
Georg Brandl	7044b11	2009-01-03 21:04:55 +0000	[diff] [blame]	611	amk:~$ python t.py
				612	['.svn', 'filename\xe4\x94\x80abc', ...]
				613	[u'.svn', u'filename\u4500abc', ...]
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	614
				615	The first list contains UTF-8-encoded filenames, and the second list contains
				616	the Unicode versions.
				617
				618
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	619
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	620	Tips for Writing Unicode-aware Programs
				621	---------------------------------------
				622
				623	This section provides some suggestions on writing software that deals with
				624	Unicode.
				625
				626	The most important tip is:
				627
				628	Software should only work with Unicode strings internally, converting to a
				629	particular encoding on output.
				630
				631	If you attempt to write processing functions that accept both Unicode and 8-bit
				632	strings, you will find your program vulnerable to bugs wherever you combine the
				633	two different kinds of strings. Python's default encoding is ASCII, so whenever
				634	a character with an ASCII value > 127 is in the input data, you'll get a
				635	:exc:`UnicodeDecodeError` because that character can't be handled by the ASCII
				636	encoding.
				637
				638	It's easy to miss such problems if you only test your software with data that
				639	doesn't contain any accents; everything will seem to work, but there's actually
				640	a bug in your program waiting for the first user who attempts to use characters
				641	> 127. A second tip, therefore, is:
				642
				643	Include characters > 127 and, even better, characters > 255 in your test
				644	data.
				645
				646	When using data coming from a web browser or some other untrusted source, a
				647	common technique is to check for illegal characters in a string before using the
				648	string in a generated command line or storing it in a database. If you're doing
				649	this, be careful to check the string once it's in the form that will be used or
				650	stored; it's possible for encodings to be used to disguise characters. This is
				651	especially true if the input data also specifies the encoding; many encodings
				652	leave the commonly checked-for characters alone, but Python includes some
				653	encodings such as ``'base64'`` that modify every single character.
				654
				655	For example, let's say you have a content management system that takes a Unicode
				656	filename, and you want to disallow paths with a '/' character. You might write
				657	this code::
				658
				659	def read_file (filename, encoding):
				660	if '/' in filename:
				661	raise ValueError("'/' not allowed in filenames")
				662	unicode_name = filename.decode(encoding)
				663	f = open(unicode_name, 'r')
				664	# ... return contents of file ...
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	665
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	666	However, if an attacker could specify the ``'base64'`` encoding, they could pass
				667	``'L2V0Yy9wYXNzd2Q='``, which is the base-64 encoded form of the string
				668	``'/etc/passwd'``, to read a system file. The above code looks for ``'/'``
				669	characters in the encoded form and misses the dangerous character in the
				670	resulting decoded form.
				671
				672	References
				673	----------
				674
				675	The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware
				676	Applications in Python" are available at
Georg Brandl	0267781	2008-03-15 00:20:19 +0000	[diff] [blame]	677	<http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf>
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	678	and discuss questions of character encodings as well as how to internationalize
				679	and localize an application.
				680
				681
				682	Revision History and Acknowledgements
				683	=====================================
				684
				685	Thanks to the following people who have noted errors or offered suggestions on
				686	this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler,
				687	Marc-André Lemburg, Martin von Löwis, Chad Whitacre.
				688
				689	Version 1.0: posted August 5 2005.
				690
				691	Version 1.01: posted August 7 2005. Corrects factual and markup errors; adds
				692	several links.
				693
				694	Version 1.02: posted August 16 2005. Corrects factual errors.
				695
				696
				697	.. comment Additional topic: building Python w/ UCS2 or UCS4 support
				698	.. comment Describe obscure -U switch somewhere?
				699	.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter
				700
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	701	.. comment
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	702	Original outline:
				703
				704	- [ ] Unicode introduction
				705	- [ ] ASCII
				706	- [ ] Terms
Georg Brandl	7044b11	2009-01-03 21:04:55 +0000	[diff] [blame]	707	- [ ] Character
				708	- [ ] Code point
				709	- [ ] Encodings
				710	- [ ] Common encodings: ASCII, Latin-1, UTF-8
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	711	- [ ] Unicode Python type
Georg Brandl	7044b11	2009-01-03 21:04:55 +0000	[diff] [blame]	712	- [ ] Writing unicode literals
				713	- [ ] Obscurity: -U switch
				714	- [ ] Built-ins
				715	- [ ] unichr()
				716	- [ ] ord()
				717	- [ ] unicode() constructor
				718	- [ ] Unicode type
				719	- [ ] encode(), decode() methods
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	720	- [ ] Unicodedata module for character properties
				721	- [ ] I/O
Georg Brandl	7044b11	2009-01-03 21:04:55 +0000	[diff] [blame]	722	- [ ] Reading/writing Unicode data into files
				723	- [ ] Byte-order marks
				724	- [ ] Unicode filenames
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	725	- [ ] Writing Unicode programs
Georg Brandl	7044b11	2009-01-03 21:04:55 +0000	[diff] [blame]	726	- [ ] Do everything in Unicode
				727	- [ ] Declaring source code encodings (PEP 263)
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	728	- [ ] Other issues
Georg Brandl	7044b11	2009-01-03 21:04:55 +0000	[diff] [blame]	729	- [ ] Building Python (UCS2, UCS4)