blob: bbbd5d9647cdf49e0198b48f9e0241ca2f6f24b8 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`unicodedata` --- Unicode Database
2=======================================
3
4.. module:: unicodedata
5 :synopsis: Access the Unicode Database.
6.. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>
7.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
8.. sectionauthor:: Martin v. Lรถwis <martin@v.loewis.de>
9
10
11.. index::
12 single: Unicode
13 single: character
14 pair: Unicode; database
15
16This module provides access to the Unicode Character Database which defines
17character properties for all Unicode characters. The data in this database is
Ezio Melotti4c5475d2010-03-22 23:16:42 +000018based on the :file:`UnicodeData.txt` file version 5.2.0 which is publicly
Georg Brandl116aa622007-08-15 14:28:22 +000019available from ftp://ftp.unicode.org/.
20
21The module uses the same names and symbols as defined by the UnicodeData File
Ezio Melottid96b2f22010-03-23 00:39:22 +000022Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html).
23It defines the following functions:
Georg Brandl116aa622007-08-15 14:28:22 +000024
25
26.. function:: lookup(name)
27
28 Look up character by name. If a character with the given name is found, return
Georg Brandlf6945182008-02-01 11:56:49 +000029 the corresponding character. If not found, :exc:`KeyError` is raised.
Georg Brandl116aa622007-08-15 14:28:22 +000030
31
Georg Brandlf6945182008-02-01 11:56:49 +000032.. function:: name(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000033
Georg Brandlf6945182008-02-01 11:56:49 +000034 Returns the name assigned to the character *chr* as a string. If no
Georg Brandl116aa622007-08-15 14:28:22 +000035 name is defined, *default* is returned, or, if not given, :exc:`ValueError` is
36 raised.
37
38
Georg Brandlf6945182008-02-01 11:56:49 +000039.. function:: decimal(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000040
Georg Brandlf6945182008-02-01 11:56:49 +000041 Returns the decimal value assigned to the character *chr* as integer.
Georg Brandl116aa622007-08-15 14:28:22 +000042 If no such value is defined, *default* is returned, or, if not given,
43 :exc:`ValueError` is raised.
44
45
Georg Brandlf6945182008-02-01 11:56:49 +000046.. function:: digit(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000047
Georg Brandlf6945182008-02-01 11:56:49 +000048 Returns the digit value assigned to the character *chr* as integer.
Georg Brandl116aa622007-08-15 14:28:22 +000049 If no such value is defined, *default* is returned, or, if not given,
50 :exc:`ValueError` is raised.
51
52
Georg Brandlf6945182008-02-01 11:56:49 +000053.. function:: numeric(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000054
Georg Brandlf6945182008-02-01 11:56:49 +000055 Returns the numeric value assigned to the character *chr* as float.
Georg Brandl116aa622007-08-15 14:28:22 +000056 If no such value is defined, *default* is returned, or, if not given,
57 :exc:`ValueError` is raised.
58
59
Georg Brandlf6945182008-02-01 11:56:49 +000060.. function:: category(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000061
Georg Brandlf6945182008-02-01 11:56:49 +000062 Returns the general category assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000063 string.
64
65
Georg Brandlf6945182008-02-01 11:56:49 +000066.. function:: bidirectional(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000067
Georg Brandlf6945182008-02-01 11:56:49 +000068 Returns the bidirectional category assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000069 string. If no such value is defined, an empty string is returned.
70
71
Georg Brandlf6945182008-02-01 11:56:49 +000072.. function:: combining(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000073
Georg Brandlf6945182008-02-01 11:56:49 +000074 Returns the canonical combining class assigned to the character *chr*
Georg Brandl116aa622007-08-15 14:28:22 +000075 as integer. Returns ``0`` if no combining class is defined.
76
77
Georg Brandlf6945182008-02-01 11:56:49 +000078.. function:: east_asian_width(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000079
Georg Brandlf6945182008-02-01 11:56:49 +000080 Returns the east asian width assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000081 string.
82
Georg Brandl116aa622007-08-15 14:28:22 +000083
Georg Brandlf6945182008-02-01 11:56:49 +000084.. function:: mirrored(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000085
Georg Brandlf6945182008-02-01 11:56:49 +000086 Returns the mirrored property assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000087 integer. Returns ``1`` if the character has been identified as a "mirrored"
88 character in bidirectional text, ``0`` otherwise.
89
90
Georg Brandlf6945182008-02-01 11:56:49 +000091.. function:: decomposition(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000092
Georg Brandlf6945182008-02-01 11:56:49 +000093 Returns the character decomposition mapping assigned to the character
94 *chr* as string. An empty string is returned in case no such mapping is
Georg Brandl116aa622007-08-15 14:28:22 +000095 defined.
96
97
98.. function:: normalize(form, unistr)
99
100 Return the normal form *form* for the Unicode string *unistr*. Valid values for
101 *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
102
103 The Unicode standard defines various normalization forms of a Unicode string,
104 based on the definition of canonical equivalence and compatibility equivalence.
105 In Unicode, several characters can be expressed in various way. For example, the
106 character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
Guido van Rossumda27fd22007-08-17 00:24:54 +0000107 the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C).
Georg Brandl116aa622007-08-15 14:28:22 +0000108
109 For each character, there are two normal forms: normal form C and normal form D.
110 Normal form D (NFD) is also known as canonical decomposition, and translates
111 each character into its decomposed form. Normal form C (NFC) first applies a
112 canonical decomposition, then composes pre-combined characters again.
113
114 In addition to these two forms, there are two additional normal forms based on
115 compatibility equivalence. In Unicode, certain characters are supported which
116 normally would be unified with other characters. For example, U+2160 (ROMAN
117 NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I).
118 However, it is supported in Unicode for compatibility with existing character
119 sets (e.g. gb2312).
120
121 The normal form KD (NFKD) will apply the compatibility decomposition, i.e.
122 replace all compatibility characters with their equivalents. The normal form KC
123 (NFKC) first applies the compatibility decomposition, followed by the canonical
124 composition.
125
Guido van Rossumda27fd22007-08-17 00:24:54 +0000126 Even if two unicode strings are normalized and look the same to
127 a human reader, if one has combining characters and the other
128 doesn't, they may not compare equal.
129
Georg Brandl116aa622007-08-15 14:28:22 +0000130
131In addition, the module exposes the following constant:
132
Georg Brandl116aa622007-08-15 14:28:22 +0000133.. data:: unidata_version
134
135 The version of the Unicode database used in this module.
136
Georg Brandl116aa622007-08-15 14:28:22 +0000137
138.. data:: ucd_3_2_0
139
140 This is an object that has the same methods as the entire module, but uses the
141 Unicode database version 3.2 instead, for applications that require this
142 specific version of the Unicode database (such as IDNA).
143
Christian Heimesfe337bf2008-03-23 21:54:12 +0000144Examples:
Georg Brandl116aa622007-08-15 14:28:22 +0000145
Christian Heimesfe337bf2008-03-23 21:54:12 +0000146 >>> import unicodedata
Georg Brandl116aa622007-08-15 14:28:22 +0000147 >>> unicodedata.lookup('LEFT CURLY BRACKET')
Ezio Melotti985e24d2009-09-13 07:54:02 +0000148 '{'
Georg Brandlf6945182008-02-01 11:56:49 +0000149 >>> unicodedata.name('/')
Georg Brandl116aa622007-08-15 14:28:22 +0000150 'SOLIDUS'
Georg Brandlf6945182008-02-01 11:56:49 +0000151 >>> unicodedata.decimal('9')
Georg Brandl116aa622007-08-15 14:28:22 +0000152 9
Georg Brandlf6945182008-02-01 11:56:49 +0000153 >>> unicodedata.decimal('a')
Georg Brandl116aa622007-08-15 14:28:22 +0000154 Traceback (most recent call last):
155 File "<stdin>", line 1, in ?
156 ValueError: not a decimal
Georg Brandlf6945182008-02-01 11:56:49 +0000157 >>> unicodedata.category('A') # 'L'etter, 'u'ppercase
Georg Brandl48310cd2009-01-03 21:18:54 +0000158 'Lu'
Georg Brandlf6945182008-02-01 11:56:49 +0000159 >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber
Georg Brandl116aa622007-08-15 14:28:22 +0000160 'AN'
161