blob: 59548f3e8b4ac5ffc71bfa02924419a233198735 [file] [log] [blame]
Georg Brandl116aa622007-08-15 14:28:22 +00001:mod:`unicodedata` --- Unicode Database
2=======================================
3
4.. module:: unicodedata
5 :synopsis: Access the Unicode Database.
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Antoine Pitroufbd4f802012-08-11 16:51:50 +02007.. moduleauthor:: Marc-André Lemburg <mal@lemburg.com>
8.. sectionauthor:: Marc-André Lemburg <mal@lemburg.com>
Georg Brandl116aa622007-08-15 14:28:22 +00009.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
10
Georg Brandl116aa622007-08-15 14:28:22 +000011.. index::
12 single: Unicode
13 single: character
14 pair: Unicode; database
15
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040016--------------
17
Alexander Belopolskyfc557892010-12-10 18:11:24 +000018This module provides access to the Unicode Character Database (UCD) which
19defines character properties for all Unicode characters. The data contained in
Benjamin Peterson7c69c1c2018-06-06 20:14:28 -070020this database is compiled from the `UCD version 11.0.0
21<http://www.unicode.org/Public/11.0.0/ucd>`_.
Georg Brandl116aa622007-08-15 14:28:22 +000022
Alexander Belopolskyfc557892010-12-10 18:11:24 +000023The module uses the same names and symbols as defined by Unicode
24Standard Annex #44, `"Unicode Character Database"
25<http://www.unicode.org/reports/tr44/tr44-6.html>`_. It defines the
26following functions:
Georg Brandl116aa622007-08-15 14:28:22 +000027
28
29.. function:: lookup(name)
30
31 Look up character by name. If a character with the given name is found, return
Georg Brandlf6945182008-02-01 11:56:49 +000032 the corresponding character. If not found, :exc:`KeyError` is raised.
Georg Brandl116aa622007-08-15 14:28:22 +000033
Ezio Melotti931b8aa2011-10-21 21:57:36 +030034 .. versionchanged:: 3.3
35 Support for name aliases [#]_ and named sequences [#]_ has been added.
36
Georg Brandl116aa622007-08-15 14:28:22 +000037
Georg Brandlf6945182008-02-01 11:56:49 +000038.. function:: name(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000039
Georg Brandlf6945182008-02-01 11:56:49 +000040 Returns the name assigned to the character *chr* as a string. If no
Georg Brandl116aa622007-08-15 14:28:22 +000041 name is defined, *default* is returned, or, if not given, :exc:`ValueError` is
42 raised.
43
44
Georg Brandlf6945182008-02-01 11:56:49 +000045.. function:: decimal(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000046
Georg Brandlf6945182008-02-01 11:56:49 +000047 Returns the decimal value assigned to the character *chr* as integer.
Georg Brandl116aa622007-08-15 14:28:22 +000048 If no such value is defined, *default* is returned, or, if not given,
49 :exc:`ValueError` is raised.
50
51
Georg Brandlf6945182008-02-01 11:56:49 +000052.. function:: digit(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000053
Georg Brandlf6945182008-02-01 11:56:49 +000054 Returns the digit value assigned to the character *chr* as integer.
Georg Brandl116aa622007-08-15 14:28:22 +000055 If no such value is defined, *default* is returned, or, if not given,
56 :exc:`ValueError` is raised.
57
58
Georg Brandlf6945182008-02-01 11:56:49 +000059.. function:: numeric(chr[, default])
Georg Brandl116aa622007-08-15 14:28:22 +000060
Georg Brandlf6945182008-02-01 11:56:49 +000061 Returns the numeric value assigned to the character *chr* as float.
Georg Brandl116aa622007-08-15 14:28:22 +000062 If no such value is defined, *default* is returned, or, if not given,
63 :exc:`ValueError` is raised.
64
65
Georg Brandlf6945182008-02-01 11:56:49 +000066.. function:: category(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000067
Georg Brandlf6945182008-02-01 11:56:49 +000068 Returns the general category assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000069 string.
70
71
Georg Brandlf6945182008-02-01 11:56:49 +000072.. function:: bidirectional(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000073
Ezio Melotti1e5c9b72012-12-14 20:06:43 +020074 Returns the bidirectional class assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000075 string. If no such value is defined, an empty string is returned.
76
77
Georg Brandlf6945182008-02-01 11:56:49 +000078.. function:: combining(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000079
Georg Brandlf6945182008-02-01 11:56:49 +000080 Returns the canonical combining class assigned to the character *chr*
Georg Brandl116aa622007-08-15 14:28:22 +000081 as integer. Returns ``0`` if no combining class is defined.
82
83
Georg Brandlf6945182008-02-01 11:56:49 +000084.. function:: east_asian_width(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000085
Georg Brandlf6945182008-02-01 11:56:49 +000086 Returns the east asian width assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000087 string.
88
Georg Brandl116aa622007-08-15 14:28:22 +000089
Georg Brandlf6945182008-02-01 11:56:49 +000090.. function:: mirrored(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000091
Georg Brandlf6945182008-02-01 11:56:49 +000092 Returns the mirrored property assigned to the character *chr* as
Georg Brandl116aa622007-08-15 14:28:22 +000093 integer. Returns ``1`` if the character has been identified as a "mirrored"
94 character in bidirectional text, ``0`` otherwise.
95
96
Georg Brandlf6945182008-02-01 11:56:49 +000097.. function:: decomposition(chr)
Georg Brandl116aa622007-08-15 14:28:22 +000098
Georg Brandlf6945182008-02-01 11:56:49 +000099 Returns the character decomposition mapping assigned to the character
100 *chr* as string. An empty string is returned in case no such mapping is
Georg Brandl116aa622007-08-15 14:28:22 +0000101 defined.
102
103
104.. function:: normalize(form, unistr)
105
106 Return the normal form *form* for the Unicode string *unistr*. Valid values for
107 *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
108
109 The Unicode standard defines various normalization forms of a Unicode string,
110 based on the definition of canonical equivalence and compatibility equivalence.
111 In Unicode, several characters can be expressed in various way. For example, the
112 character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
Ezio Melotti01b34af2012-01-16 08:42:32 +0200113 the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
Georg Brandl116aa622007-08-15 14:28:22 +0000114
115 For each character, there are two normal forms: normal form C and normal form D.
116 Normal form D (NFD) is also known as canonical decomposition, and translates
117 each character into its decomposed form. Normal form C (NFC) first applies a
118 canonical decomposition, then composes pre-combined characters again.
119
120 In addition to these two forms, there are two additional normal forms based on
121 compatibility equivalence. In Unicode, certain characters are supported which
122 normally would be unified with other characters. For example, U+2160 (ROMAN
123 NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I).
124 However, it is supported in Unicode for compatibility with existing character
125 sets (e.g. gb2312).
126
127 The normal form KD (NFKD) will apply the compatibility decomposition, i.e.
128 replace all compatibility characters with their equivalents. The normal form KC
129 (NFKC) first applies the compatibility decomposition, followed by the canonical
130 composition.
131
Guido van Rossumda27fd22007-08-17 00:24:54 +0000132 Even if two unicode strings are normalized and look the same to
133 a human reader, if one has combining characters and the other
134 doesn't, they may not compare equal.
135
Georg Brandl116aa622007-08-15 14:28:22 +0000136
137In addition, the module exposes the following constant:
138
Georg Brandl116aa622007-08-15 14:28:22 +0000139.. data:: unidata_version
140
141 The version of the Unicode database used in this module.
142
Georg Brandl116aa622007-08-15 14:28:22 +0000143
144.. data:: ucd_3_2_0
145
146 This is an object that has the same methods as the entire module, but uses the
147 Unicode database version 3.2 instead, for applications that require this
148 specific version of the Unicode database (such as IDNA).
149
Christian Heimesfe337bf2008-03-23 21:54:12 +0000150Examples:
Georg Brandl116aa622007-08-15 14:28:22 +0000151
Christian Heimesfe337bf2008-03-23 21:54:12 +0000152 >>> import unicodedata
Georg Brandl116aa622007-08-15 14:28:22 +0000153 >>> unicodedata.lookup('LEFT CURLY BRACKET')
Ezio Melotti985e24d2009-09-13 07:54:02 +0000154 '{'
Georg Brandlf6945182008-02-01 11:56:49 +0000155 >>> unicodedata.name('/')
Georg Brandl116aa622007-08-15 14:28:22 +0000156 'SOLIDUS'
Georg Brandlf6945182008-02-01 11:56:49 +0000157 >>> unicodedata.decimal('9')
Georg Brandl116aa622007-08-15 14:28:22 +0000158 9
Georg Brandlf6945182008-02-01 11:56:49 +0000159 >>> unicodedata.decimal('a')
Georg Brandl116aa622007-08-15 14:28:22 +0000160 Traceback (most recent call last):
UltimateCoder88569402017-05-03 22:16:45 +0530161 File "<stdin>", line 1, in <module>
Georg Brandl116aa622007-08-15 14:28:22 +0000162 ValueError: not a decimal
Georg Brandlf6945182008-02-01 11:56:49 +0000163 >>> unicodedata.category('A') # 'L'etter, 'u'ppercase
Georg Brandl48310cd2009-01-03 21:18:54 +0000164 'Lu'
Georg Brandlf6945182008-02-01 11:56:49 +0000165 >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber
Georg Brandl116aa622007-08-15 14:28:22 +0000166 'AN'
167
Ezio Melotti931b8aa2011-10-21 21:57:36 +0300168
169.. rubric:: Footnotes
170
Benjamin Peterson7c69c1c2018-06-06 20:14:28 -0700171.. [#] http://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt
Ezio Melotti931b8aa2011-10-21 21:57:36 +0300172
Benjamin Peterson7c69c1c2018-06-06 20:14:28 -0700173.. [#] http://www.unicode.org/Public/11.0.0/ucd/NamedSequences.txt