Blame - Doc/library/unicodedata.rst - platform/external/python/cpython2

blob: a3a7c962bf2012903a824bf6208af11a661fc1fc [file] [log] [blame]

Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	1
				2	:mod:`unicodedata` --- Unicode Database
				3	=======================================
				4
				5	.. module:: unicodedata
				6	:synopsis: Access the Unicode Database.
				7	.. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>
				8	.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
				9	.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
				10
				11
				12	.. index::
				13	single: Unicode
				14	single: character
				15	pair: Unicode; database
				16
				17	This module provides access to the Unicode Character Database which defines
				18	character properties for all Unicode characters. The data in this database is
Ezio Melotti	ae735a7	2010-03-22 23:07:32 +0000	[diff] [blame]	19	based on the :file:`UnicodeData.txt` file version 5.2.0 which is publicly
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	20	available from ftp://ftp.unicode.org/.
				21
				22	The module uses the same names and symbols as defined by the UnicodeData File
Ezio Melotti	0d0b80b	2010-03-23 00:38:12 +0000	[diff] [blame]	23	Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html).
				24	It defines the following functions:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	25
				26
				27	.. function:: lookup(name)
				28
				29	Look up character by name. If a character with the given name is found, return
				30	the corresponding Unicode character. If not found, :exc:`KeyError` is raised.
				31
				32
				33	.. function:: name(unichr[, default])
				34
				35	Returns the name assigned to the Unicode character unichr as a string. If no
				36	name is defined, default is returned, or, if not given, :exc:`ValueError` is
				37	raised.
				38
				39
				40	.. function:: decimal(unichr[, default])
				41
				42	Returns the decimal value assigned to the Unicode character unichr as integer.
				43	If no such value is defined, default is returned, or, if not given,
				44	:exc:`ValueError` is raised.
				45
				46
				47	.. function:: digit(unichr[, default])
				48
				49	Returns the digit value assigned to the Unicode character unichr as integer.
				50	If no such value is defined, default is returned, or, if not given,
				51	:exc:`ValueError` is raised.
				52
				53
				54	.. function:: numeric(unichr[, default])
				55
				56	Returns the numeric value assigned to the Unicode character unichr as float.
				57	If no such value is defined, default is returned, or, if not given,
				58	:exc:`ValueError` is raised.
				59
				60
				61	.. function:: category(unichr)
				62
				63	Returns the general category assigned to the Unicode character unichr as
				64	string.
				65
				66
				67	.. function:: bidirectional(unichr)
				68
Ezio Melotti	28d21ca	2012-12-14 20:06:43 +0200	[diff] [blame]	69	Returns the bidirectional class assigned to the Unicode character unichr as
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	70	string. If no such value is defined, an empty string is returned.
				71
				72
				73	.. function:: combining(unichr)
				74
				75	Returns the canonical combining class assigned to the Unicode character unichr
				76	as integer. Returns ``0`` if no combining class is defined.
				77
				78
				79	.. function:: east_asian_width(unichr)
				80
				81	Returns the east asian width assigned to the Unicode character unichr as
				82	string.
				83
				84	.. versionadded:: 2.4
				85
				86
				87	.. function:: mirrored(unichr)
				88
				89	Returns the mirrored property assigned to the Unicode character unichr as
				90	integer. Returns ``1`` if the character has been identified as a "mirrored"
				91	character in bidirectional text, ``0`` otherwise.
				92
				93
				94	.. function:: decomposition(unichr)
				95
				96	Returns the character decomposition mapping assigned to the Unicode character
				97	unichr as string. An empty string is returned in case no such mapping is
				98	defined.
				99
				100
				101	.. function:: normalize(form, unistr)
				102
				103	Return the normal form form for the Unicode string unistr. Valid values for
				104	form are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
				105
				106	The Unicode standard defines various normalization forms of a Unicode string,
				107	based on the definition of canonical equivalence and compatibility equivalence.
				108	In Unicode, several characters can be expressed in various way. For example, the
				109	character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
Ezio Melotti	8eab1fd	2012-01-16 08:42:32 +0200	[diff] [blame]	110	the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	111
				112	For each character, there are two normal forms: normal form C and normal form D.
				113	Normal form D (NFD) is also known as canonical decomposition, and translates
				114	each character into its decomposed form. Normal form C (NFC) first applies a
				115	canonical decomposition, then composes pre-combined characters again.
				116
				117	In addition to these two forms, there are two additional normal forms based on
				118	compatibility equivalence. In Unicode, certain characters are supported which
				119	normally would be unified with other characters. For example, U+2160 (ROMAN
				120	NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I).
				121	However, it is supported in Unicode for compatibility with existing character
				122	sets (e.g. gb2312).
				123
				124	The normal form KD (NFKD) will apply the compatibility decomposition, i.e.
				125	replace all compatibility characters with their equivalents. The normal form KC
				126	(NFKC) first applies the compatibility decomposition, followed by the canonical
				127	composition.
				128
Mark Summerfield	216ad33	2007-08-16 10:09:22 +0000	[diff] [blame]	129	Even if two unicode strings are normalized and look the same to
				130	a human reader, if one has combining characters and the other
				131	doesn't, they may not compare equal.
				132
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	133	.. versionadded:: 2.3
				134
				135	In addition, the module exposes the following constant:
				136
				137
				138	.. data:: unidata_version
				139
				140	The version of the Unicode database used in this module.
				141
				142	.. versionadded:: 2.3
				143
				144
				145	.. data:: ucd_3_2_0
				146
				147	This is an object that has the same methods as the entire module, but uses the
				148	Unicode database version 3.2 instead, for applications that require this
				149	specific version of the Unicode database (such as IDNA).
				150
				151	.. versionadded:: 2.5
				152
Georg Brandl	e8f1b00	2008-03-22 22:04:10 +0000	[diff] [blame]	153	Examples:
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	154
Georg Brandl	e8f1b00	2008-03-22 22:04:10 +0000	[diff] [blame]	155	>>> import unicodedata
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	156	>>> unicodedata.lookup('LEFT CURLY BRACKET')
				157	u'{'
				158	>>> unicodedata.name(u'/')
				159	'SOLIDUS'
				160	>>> unicodedata.decimal(u'9')
				161	9
				162	>>> unicodedata.decimal(u'a')
				163	Traceback (most recent call last):
				164	File "<stdin>", line 1, in ?
				165	ValueError: not a decimal
				166	>>> unicodedata.category(u'A') # 'L'etter, 'u'ppercase
Georg Brandl	c62ef8b	2009-01-03 20:55:06 +0000	[diff] [blame]	167	'Lu'
Georg Brandl	8ec7f65	2007-08-15 14:28:01 +0000	[diff] [blame]	168	>>> unicodedata.bidirectional(u'\u0660') # 'A'rabic, 'N'umber
				169	'AN'
				170