blob: 29229138a9697e989dc60cbdfdb7f1754d01100e [file] [log] [blame]
Fred Drake28b29442000-06-13 20:50:50 +00001\section{\module{unicodedata} ---
2 Unicode Database}
3
4\declaremodule{standard}{unicodedata}
5\modulesynopsis{Access the Unicode Database.}
6\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
7\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
Martin v. Löwis677bde22002-11-23 22:08:15 +00008\sectionauthor{Martin v. L\"owis}{martin@v.loewis.de}
Fred Drake28b29442000-06-13 20:50:50 +00009
10\index{Unicode}
11\index{character}
12\indexii{Unicode}{database}
13
14This module provides access to the Unicode Character Database which
15defines character properties for all Unicode characters. The data in
16this database is based on the \file{UnicodeData.txt} file version
Martin v. Löwis677bde22002-11-23 22:08:15 +0000173.2.0 which is publically available from \url{ftp://ftp.unicode.org/}.
Fred Drake28b29442000-06-13 20:50:50 +000018
19The module uses the same names and symbols as defined by the
Martin v. Löwis677bde22002-11-23 22:08:15 +000020UnicodeData File Format 3.2.0 (see
Fred Drake6e1fecc2000-09-16 13:46:42 +000021\url{http://www.unicode.org/Public/UNIDATA/UnicodeData.html}). It
Fred Drake28b29442000-06-13 20:50:50 +000022defines the following functions:
23
Fredrik Lundh0110d3b2001-01-24 08:10:07 +000024\begin{funcdesc}{lookup}{name}
25 Look up character by name. If a character with the
26 given name is found, return the corresponding Unicode
27 character. If not found, \exception{KeyError} is raised.
28\end{funcdesc}
29
30\begin{funcdesc}{name}{unichr\optional{, default}}
31 Returns the name assigned to the Unicode character
32 \var{unichr} as a string. If no name is defined,
33 \var{default} is returned, or, if not given,
34 \exception{ValueError} is raised.
35\end{funcdesc}
36
Fred Drake28b29442000-06-13 20:50:50 +000037\begin{funcdesc}{decimal}{unichr\optional{, default}}
38 Returns the decimal value assigned to the Unicode character
39 \var{unichr} as integer. If no such value is defined,
40 \var{default} is returned, or, if not given,
41 \exception{ValueError} is raised.
42\end{funcdesc}
43
44\begin{funcdesc}{digit}{unichr\optional{, default}}
45 Returns the digit value assigned to the Unicode character
46 \var{unichr} as integer. If no such value is defined,
47 \var{default} is returned, or, if not given,
48 \exception{ValueError} is raised.
49\end{funcdesc}
50
51\begin{funcdesc}{numeric}{unichr\optional{, default}}
52 Returns the numeric value assigned to the Unicode character
53 \var{unichr} as float. If no such value is defined, \var{default} is
54 returned, or, if not given, \exception{ValueError} is raised.
55\end{funcdesc}
56
57\begin{funcdesc}{category}{unichr}
58 Returns the general category assigned to the Unicode character
59 \var{unichr} as string.
60\end{funcdesc}
61
62\begin{funcdesc}{bidirectional}{unichr}
63 Returns the bidirectional category assigned to the Unicode character
64 \var{unichr} as string. If no such value is defined, an empty string
65 is returned.
66\end{funcdesc}
67
68\begin{funcdesc}{combining}{unichr}
69 Returns the canonical combining class assigned to the Unicode
70 character \var{unichr} as integer. Returns \code{0} if no combining
71 class is defined.
72\end{funcdesc}
73
Hye-Shik Change9ddfbb2004-08-04 07:38:35 +000074\begin{funcdesc}{east_asian_width}{unichr}
75 Returns the east asian width of assigned to the Unicode character
76 \var{unichr} as string.
77\end{funcdesc}
78
Fred Drake28b29442000-06-13 20:50:50 +000079\begin{funcdesc}{mirrored}{unichr}
80 Returns the mirrored property of assigned to the Unicode character
81 \var{unichr} as integer. Returns \code{1} if the character has been
82 identified as a ``mirrored'' character in bidirectional text,
83 \code{0} otherwise.
84\end{funcdesc}
85
86\begin{funcdesc}{decomposition}{unichr}
87 Returns the character decomposition mapping assigned to the Unicode
88 character \var{unichr} as string. An empty string is returned in case
89 no such mapping is defined.
90\end{funcdesc}
Martin v. Löwis677bde22002-11-23 22:08:15 +000091
92\begin{funcdesc}{normalize}{form, unistr}
93
94Return the normal form \var{form} for the Unicode string \var{unistr}.
95Valid values for \var{form} are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
96
97The Unicode standard defines various normalization forms of a Unicode
98string, based on the definition of canonical equivalence and
99compatibility equivalence. In Unicode, several characters can be
100expressed in various way. For example, the character U+00C7 (LATIN
101CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
102U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
103
104For each character, there are two normal forms: normal form C and
105normal form D. Normal form D (NFD) is also known as canonical
106decomposition, and translates each character into its decomposed form.
107Normal form C (NFC) first applies a canonical decomposition, then
108composes pre-combined characters again.
109
110In addition to these two forms, there two additional normal forms
111based on compatibility equivalence. In Unicode, certain characters are
112supported which normally would be unified with other characters. For
113example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
114(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
115compatibility with existing character sets (e.g. gb2312).
116
117The normal form KD (NFKD) will apply the compatibility decomposition,
118i.e. replace all compatibility characters with their equivalents. The
119normal form KC (NFKC) first applies the compatibility decomposition,
120followed by the canonical composition.
121
122\versionadded{2.3}
123\end{funcdesc}
124
Martin v. Löwisb5c980b2002-11-25 09:13:37 +0000125In addition, the module exposes the following constant:
126
127\begin{datadesc}{unidata_version}
128The version of the Unicode database used in this module.
129
130\versionadded{2.3}
Hye-Shik Change9ddfbb2004-08-04 07:38:35 +0000131\end{datadesc}