blob: 26eced62e5a793c792b57ff927d555835deb1d36 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`unicodedata` --- Unicode Database
2=======================================
3
4.. module:: unicodedata
5 :synopsis: Access the Unicode Database.
Antoine Pitroufbd4f802012-08-11 16:51:50 +02006.. moduleauthor:: Marc-André Lemburg <mal@lemburg.com>
7.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>
Georg Brandl116aa622007-08-15 14:28:22 +00008.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
9
10
11.. index::
12 single: Unicode
13 single: character
14 pair: Unicode; database
15
Alexander Belopolskyfc557892010-12-10 18:11:24 +000016This module provides access to the Unicode Character Database (UCD) which
17defines character properties for all Unicode characters. The data contained in
Benjamin Petersonb8350f12012-09-29 13:47:39 -040018this database is compiled from the `UCD version 6.2.0
19<http://www.unicode.org/Public/6.2.0/ucd>`_.
Georg Brandl116aa622007-08-15 14:28:22 +000020
Alexander Belopolskyfc557892010-12-10 18:11:24 +000021The module uses the same names and symbols as defined by Unicode
22Standard Annex #44, `"Unicode Character Database"
23<http://www.unicode.org/reports/tr44/tr44-6.html>`_. It defines the
24following functions:
Georg Brandl116aa622007-08-15 14:28:22 +000025
26
27.. function:: lookup(name)
28
29 Look up character by name. If a character with the given name is found, return
Georg Brandlf6945182008-02-01 11:56:49 +000030 the corresponding character. If not found, :exc:`KeyError` is raised.
Georg Brandl116aa622007-08-15 14:28:22 +000031
Ezio Melotti931b8aa2011-10-21 21:57:36 +030032 .. versionchanged:: 3.3
33 Support for name aliases [#]_ and named sequences [#]_ has been added.
34
Georg Brandl116aa622007-08-15 14:28:22 +000035
Georg Brandlf6945182008-02-01 11:56:49 +000036.. function:: name(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000037
Georg Brandlf6945182008-02-01 11:56:49 +000038 Returns the name assigned to the character *chr* as a string. If no
Georg Brandl116aa622007-08-15 14:28:22 +000039 name is defined, *default* is returned, or, if not given, :exc:`ValueError` is
40 raised.
41
42
Georg Brandlf6945182008-02-01 11:56:49 +000043.. function:: decimal(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000044
Georg Brandlf6945182008-02-01 11:56:49 +000045 Returns the decimal value assigned to the character *chr* as integer.
Georg Brandl116aa622007-08-15 14:28:22 +000046 If no such value is defined, *default* is returned, or, if not given,
47 :exc:`ValueError` is raised.
48
49
Georg Brandlf6945182008-02-01 11:56:49 +000050.. function:: digit(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000051
Georg Brandlf6945182008-02-01 11:56:49 +000052 Returns the digit value assigned to the character *chr* as integer.
Georg Brandl116aa622007-08-15 14:28:22 +000053 If no such value is defined, *default* is returned, or, if not given,
54 :exc:`ValueError` is raised.
55
56
Georg Brandlf6945182008-02-01 11:56:49 +000057.. function:: numeric(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000058
Georg Brandlf6945182008-02-01 11:56:49 +000059 Returns the numeric value assigned to the character *chr* as float.
Georg Brandl116aa622007-08-15 14:28:22 +000060 If no such value is defined, *default* is returned, or, if not given,
61 :exc:`ValueError` is raised.
62
63
Georg Brandlf6945182008-02-01 11:56:49 +000064.. function:: category(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000065
Georg Brandlf6945182008-02-01 11:56:49 +000066 Returns the general category assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000067 string.
68
69
Georg Brandlf6945182008-02-01 11:56:49 +000070.. function:: bidirectional(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000071
Ezio Melotti1e5c9b72012-12-14 20:06:43 +020072 Returns the bidirectional class assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000073 string. If no such value is defined, an empty string is returned.
74
75
Georg Brandlf6945182008-02-01 11:56:49 +000076.. function:: combining(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000077
Georg Brandlf6945182008-02-01 11:56:49 +000078 Returns the canonical combining class assigned to the character *chr*
Georg Brandl116aa622007-08-15 14:28:22 +000079 as integer. Returns ``0`` if no combining class is defined.
80
81
Georg Brandlf6945182008-02-01 11:56:49 +000082.. function:: east_asian_width(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000083
Georg Brandlf6945182008-02-01 11:56:49 +000084 Returns the east asian width assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000085 string.
86
Georg Brandl116aa622007-08-15 14:28:22 +000087
Georg Brandlf6945182008-02-01 11:56:49 +000088.. function:: mirrored(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000089
Georg Brandlf6945182008-02-01 11:56:49 +000090 Returns the mirrored property assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000091 integer. Returns ``1`` if the character has been identified as a "mirrored"
92 character in bidirectional text, ``0`` otherwise.
93
94
Georg Brandlf6945182008-02-01 11:56:49 +000095.. function:: decomposition(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000096
Georg Brandlf6945182008-02-01 11:56:49 +000097 Returns the character decomposition mapping assigned to the character
98 *chr* as string. An empty string is returned in case no such mapping is
Georg Brandl116aa622007-08-15 14:28:22 +000099 defined.
100
101
102.. function:: normalize(form, unistr)
103
104 Return the normal form *form* for the Unicode string *unistr*. Valid values for
105 *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
106
107 The Unicode standard defines various normalization forms of a Unicode string,
108 based on the definition of canonical equivalence and compatibility equivalence.
109 In Unicode, several characters can be expressed in various way. For example, the
110 character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
Ezio Melotti01b34af2012-01-16 08:42:32 +0200111 the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
Georg Brandl116aa622007-08-15 14:28:22 +0000112
113 For each character, there are two normal forms: normal form C and normal form D.
114 Normal form D (NFD) is also known as canonical decomposition, and translates
115 each character into its decomposed form. Normal form C (NFC) first applies a
116 canonical decomposition, then composes pre-combined characters again.
117
118 In addition to these two forms, there are two additional normal forms based on
119 compatibility equivalence. In Unicode, certain characters are supported which
120 normally would be unified with other characters. For example, U+2160 (ROMAN
121 NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I).
122 However, it is supported in Unicode for compatibility with existing character
123 sets (e.g. gb2312).
124
125 The normal form KD (NFKD) will apply the compatibility decomposition, i.e.
126 replace all compatibility characters with their equivalents. The normal form KC
127 (NFKC) first applies the compatibility decomposition, followed by the canonical
128 composition.
129
Guido van Rossumda27fd22007-08-17 00:24:54 +0000130 Even if two unicode strings are normalized and look the same to
131 a human reader, if one has combining characters and the other
132 doesn't, they may not compare equal.
133
Georg Brandl116aa622007-08-15 14:28:22 +0000134
135In addition, the module exposes the following constant:
136
Georg Brandl116aa622007-08-15 14:28:22 +0000137.. data:: unidata_version
138
139 The version of the Unicode database used in this module.
140
Georg Brandl116aa622007-08-15 14:28:22 +0000141
142.. data:: ucd_3_2_0
143
144 This is an object that has the same methods as the entire module, but uses the
145 Unicode database version 3.2 instead, for applications that require this
146 specific version of the Unicode database (such as IDNA).
147
Christian Heimesfe337bf2008-03-23 21:54:12 +0000148Examples:
Georg Brandl116aa622007-08-15 14:28:22 +0000149
Christian Heimesfe337bf2008-03-23 21:54:12 +0000150 >>> import unicodedata
Georg Brandl116aa622007-08-15 14:28:22 +0000151 >>> unicodedata.lookup('LEFT CURLY BRACKET')
Ezio Melotti985e24d2009-09-13 07:54:02 +0000152 '{'
Georg Brandlf6945182008-02-01 11:56:49 +0000153 >>> unicodedata.name('/')
Georg Brandl116aa622007-08-15 14:28:22 +0000154 'SOLIDUS'
Georg Brandlf6945182008-02-01 11:56:49 +0000155 >>> unicodedata.decimal('9')
Georg Brandl116aa622007-08-15 14:28:22 +0000156 9
Georg Brandlf6945182008-02-01 11:56:49 +0000157 >>> unicodedata.decimal('a')
Georg Brandl116aa622007-08-15 14:28:22 +0000158 Traceback (most recent call last):
159 File "<stdin>", line 1, in ?
160 ValueError: not a decimal
Georg Brandlf6945182008-02-01 11:56:49 +0000161 >>> unicodedata.category('A') # 'L'etter, 'u'ppercase
Georg Brandl48310cd2009-01-03 21:18:54 +0000162 'Lu'
Georg Brandlf6945182008-02-01 11:56:49 +0000163 >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber
Georg Brandl116aa622007-08-15 14:28:22 +0000164 'AN'
165
Ezio Melotti931b8aa2011-10-21 21:57:36 +0300166
167.. rubric:: Footnotes
168
Benjamin Peterson7192edd2013-03-12 11:56:38 -0500169.. [#] http://www.unicode.org/Public/6.2.0/ucd/NameAliases.txt
Ezio Melotti931b8aa2011-10-21 21:57:36 +0300170
Benjamin Peterson7192edd2013-03-12 11:56:38 -0500171.. [#] http://www.unicode.org/Public/6.2.0/ucd/NamedSequences.txt