Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1 | |
| 2 | :mod:`unicodedata` --- Unicode Database |
| 3 | ======================================= |
| 4 | |
| 5 | .. module:: unicodedata |
| 6 | :synopsis: Access the Unicode Database. |
| 7 | .. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com> |
| 8 | .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com> |
| 9 | .. sectionauthor:: Martin v. Lรถwis <martin@v.loewis.de> |
| 10 | |
| 11 | |
| 12 | .. index:: |
| 13 | single: Unicode |
| 14 | single: character |
| 15 | pair: Unicode; database |
| 16 | |
| 17 | This module provides access to the Unicode Character Database which defines |
| 18 | character properties for all Unicode characters. The data in this database is |
Ezio Melotti | ae735a7 | 2010-03-22 23:07:32 +0000 | [diff] [blame] | 19 | based on the :file:`UnicodeData.txt` file version 5.2.0 which is publicly |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 20 | available from ftp://ftp.unicode.org/. |
| 21 | |
| 22 | The module uses the same names and symbols as defined by the UnicodeData File |
Ezio Melotti | 0d0b80b | 2010-03-23 00:38:12 +0000 | [diff] [blame] | 23 | Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html). |
| 24 | It defines the following functions: |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 25 | |
| 26 | |
| 27 | .. function:: lookup(name) |
| 28 | |
| 29 | Look up character by name. If a character with the given name is found, return |
| 30 | the corresponding Unicode character. If not found, :exc:`KeyError` is raised. |
| 31 | |
| 32 | |
| 33 | .. function:: name(unichr[, default]) |
| 34 | |
| 35 | Returns the name assigned to the Unicode character *unichr* as a string. If no |
| 36 | name is defined, *default* is returned, or, if not given, :exc:`ValueError` is |
| 37 | raised. |
| 38 | |
| 39 | |
| 40 | .. function:: decimal(unichr[, default]) |
| 41 | |
| 42 | Returns the decimal value assigned to the Unicode character *unichr* as integer. |
| 43 | If no such value is defined, *default* is returned, or, if not given, |
| 44 | :exc:`ValueError` is raised. |
| 45 | |
| 46 | |
| 47 | .. function:: digit(unichr[, default]) |
| 48 | |
| 49 | Returns the digit value assigned to the Unicode character *unichr* as integer. |
| 50 | If no such value is defined, *default* is returned, or, if not given, |
| 51 | :exc:`ValueError` is raised. |
| 52 | |
| 53 | |
| 54 | .. function:: numeric(unichr[, default]) |
| 55 | |
| 56 | Returns the numeric value assigned to the Unicode character *unichr* as float. |
| 57 | If no such value is defined, *default* is returned, or, if not given, |
| 58 | :exc:`ValueError` is raised. |
| 59 | |
| 60 | |
| 61 | .. function:: category(unichr) |
| 62 | |
| 63 | Returns the general category assigned to the Unicode character *unichr* as |
| 64 | string. |
| 65 | |
| 66 | |
| 67 | .. function:: bidirectional(unichr) |
| 68 | |
Ezio Melotti | 28d21ca | 2012-12-14 20:06:43 +0200 | [diff] [blame] | 69 | Returns the bidirectional class assigned to the Unicode character *unichr* as |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 70 | string. If no such value is defined, an empty string is returned. |
| 71 | |
| 72 | |
| 73 | .. function:: combining(unichr) |
| 74 | |
| 75 | Returns the canonical combining class assigned to the Unicode character *unichr* |
| 76 | as integer. Returns ``0`` if no combining class is defined. |
| 77 | |
| 78 | |
| 79 | .. function:: east_asian_width(unichr) |
| 80 | |
| 81 | Returns the east asian width assigned to the Unicode character *unichr* as |
| 82 | string. |
| 83 | |
| 84 | .. versionadded:: 2.4 |
| 85 | |
| 86 | |
| 87 | .. function:: mirrored(unichr) |
| 88 | |
| 89 | Returns the mirrored property assigned to the Unicode character *unichr* as |
| 90 | integer. Returns ``1`` if the character has been identified as a "mirrored" |
| 91 | character in bidirectional text, ``0`` otherwise. |
| 92 | |
| 93 | |
| 94 | .. function:: decomposition(unichr) |
| 95 | |
| 96 | Returns the character decomposition mapping assigned to the Unicode character |
| 97 | *unichr* as string. An empty string is returned in case no such mapping is |
| 98 | defined. |
| 99 | |
| 100 | |
| 101 | .. function:: normalize(form, unistr) |
| 102 | |
| 103 | Return the normal form *form* for the Unicode string *unistr*. Valid values for |
| 104 | *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'. |
| 105 | |
| 106 | The Unicode standard defines various normalization forms of a Unicode string, |
| 107 | based on the definition of canonical equivalence and compatibility equivalence. |
| 108 | In Unicode, several characters can be expressed in various way. For example, the |
| 109 | character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as |
Ezio Melotti | 8eab1fd | 2012-01-16 08:42:32 +0200 | [diff] [blame] | 110 | the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA). |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 111 | |
| 112 | For each character, there are two normal forms: normal form C and normal form D. |
| 113 | Normal form D (NFD) is also known as canonical decomposition, and translates |
| 114 | each character into its decomposed form. Normal form C (NFC) first applies a |
| 115 | canonical decomposition, then composes pre-combined characters again. |
| 116 | |
| 117 | In addition to these two forms, there are two additional normal forms based on |
| 118 | compatibility equivalence. In Unicode, certain characters are supported which |
| 119 | normally would be unified with other characters. For example, U+2160 (ROMAN |
| 120 | NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). |
| 121 | However, it is supported in Unicode for compatibility with existing character |
| 122 | sets (e.g. gb2312). |
| 123 | |
| 124 | The normal form KD (NFKD) will apply the compatibility decomposition, i.e. |
| 125 | replace all compatibility characters with their equivalents. The normal form KC |
| 126 | (NFKC) first applies the compatibility decomposition, followed by the canonical |
| 127 | composition. |
| 128 | |
Mark Summerfield | 216ad33 | 2007-08-16 10:09:22 +0000 | [diff] [blame] | 129 | Even if two unicode strings are normalized and look the same to |
| 130 | a human reader, if one has combining characters and the other |
| 131 | doesn't, they may not compare equal. |
| 132 | |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 133 | .. versionadded:: 2.3 |
| 134 | |
| 135 | In addition, the module exposes the following constant: |
| 136 | |
| 137 | |
| 138 | .. data:: unidata_version |
| 139 | |
| 140 | The version of the Unicode database used in this module. |
| 141 | |
| 142 | .. versionadded:: 2.3 |
| 143 | |
| 144 | |
| 145 | .. data:: ucd_3_2_0 |
| 146 | |
| 147 | This is an object that has the same methods as the entire module, but uses the |
| 148 | Unicode database version 3.2 instead, for applications that require this |
| 149 | specific version of the Unicode database (such as IDNA). |
| 150 | |
| 151 | .. versionadded:: 2.5 |
| 152 | |
Georg Brandl | e8f1b00 | 2008-03-22 22:04:10 +0000 | [diff] [blame] | 153 | Examples: |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 154 | |
Georg Brandl | e8f1b00 | 2008-03-22 22:04:10 +0000 | [diff] [blame] | 155 | >>> import unicodedata |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 156 | >>> unicodedata.lookup('LEFT CURLY BRACKET') |
| 157 | u'{' |
| 158 | >>> unicodedata.name(u'/') |
| 159 | 'SOLIDUS' |
| 160 | >>> unicodedata.decimal(u'9') |
| 161 | 9 |
| 162 | >>> unicodedata.decimal(u'a') |
| 163 | Traceback (most recent call last): |
| 164 | File "<stdin>", line 1, in ? |
| 165 | ValueError: not a decimal |
| 166 | >>> unicodedata.category(u'A') # 'L'etter, 'u'ppercase |
Georg Brandl | c62ef8b | 2009-01-03 20:55:06 +0000 | [diff] [blame] | 167 | 'Lu' |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 168 | >>> unicodedata.bidirectional(u'\u0660') # 'A'rabic, 'N'umber |
| 169 | 'AN' |
| 170 | |