| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1 |  | 
 | 2 | :mod:`unicodedata` --- Unicode Database | 
 | 3 | ======================================= | 
 | 4 |  | 
 | 5 | .. module:: unicodedata | 
 | 6 |    :synopsis: Access the Unicode Database. | 
 | 7 | .. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com> | 
 | 8 | .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com> | 
 | 9 | .. sectionauthor:: Martin v. Lรถwis <martin@v.loewis.de> | 
 | 10 |  | 
 | 11 |  | 
 | 12 | .. index:: | 
 | 13 |    single: Unicode | 
 | 14 |    single: character | 
 | 15 |    pair: Unicode; database | 
 | 16 |  | 
 | 17 | This module provides access to the Unicode Character Database which defines | 
 | 18 | character properties for all Unicode characters. The data in this database is | 
| Ezio Melotti | ae735a7 | 2010-03-22 23:07:32 +0000 | [diff] [blame] | 19 | based on the :file:`UnicodeData.txt` file version 5.2.0 which is publicly | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 20 | available from ftp://ftp.unicode.org/. | 
 | 21 |  | 
 | 22 | The module uses the same names and symbols as defined by the UnicodeData File | 
| Ezio Melotti | 0d0b80b | 2010-03-23 00:38:12 +0000 | [diff] [blame] | 23 | Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html). | 
 | 24 | It defines the following functions: | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 25 |  | 
 | 26 |  | 
 | 27 | .. function:: lookup(name) | 
 | 28 |  | 
 | 29 |    Look up character by name.  If a character with the given name is found, return | 
 | 30 |    the corresponding Unicode character.  If not found, :exc:`KeyError` is raised. | 
 | 31 |  | 
 | 32 |  | 
 | 33 | .. function:: name(unichr[, default]) | 
 | 34 |  | 
 | 35 |    Returns the name assigned to the Unicode character *unichr* as a string. If no | 
 | 36 |    name is defined, *default* is returned, or, if not given, :exc:`ValueError` is | 
 | 37 |    raised. | 
 | 38 |  | 
 | 39 |  | 
 | 40 | .. function:: decimal(unichr[, default]) | 
 | 41 |  | 
 | 42 |    Returns the decimal value assigned to the Unicode character *unichr* as integer. | 
 | 43 |    If no such value is defined, *default* is returned, or, if not given, | 
 | 44 |    :exc:`ValueError` is raised. | 
 | 45 |  | 
 | 46 |  | 
 | 47 | .. function:: digit(unichr[, default]) | 
 | 48 |  | 
 | 49 |    Returns the digit value assigned to the Unicode character *unichr* as integer. | 
 | 50 |    If no such value is defined, *default* is returned, or, if not given, | 
 | 51 |    :exc:`ValueError` is raised. | 
 | 52 |  | 
 | 53 |  | 
 | 54 | .. function:: numeric(unichr[, default]) | 
 | 55 |  | 
 | 56 |    Returns the numeric value assigned to the Unicode character *unichr* as float. | 
 | 57 |    If no such value is defined, *default* is returned, or, if not given, | 
 | 58 |    :exc:`ValueError` is raised. | 
 | 59 |  | 
 | 60 |  | 
 | 61 | .. function:: category(unichr) | 
 | 62 |  | 
 | 63 |    Returns the general category assigned to the Unicode character *unichr* as | 
 | 64 |    string. | 
 | 65 |  | 
 | 66 |  | 
 | 67 | .. function:: bidirectional(unichr) | 
 | 68 |  | 
| Ezio Melotti | 28d21ca | 2012-12-14 20:06:43 +0200 | [diff] [blame] | 69 |    Returns the bidirectional class assigned to the Unicode character *unichr* as | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 70 |    string. If no such value is defined, an empty string is returned. | 
 | 71 |  | 
 | 72 |  | 
 | 73 | .. function:: combining(unichr) | 
 | 74 |  | 
 | 75 |    Returns the canonical combining class assigned to the Unicode character *unichr* | 
 | 76 |    as integer. Returns ``0`` if no combining class is defined. | 
 | 77 |  | 
 | 78 |  | 
 | 79 | .. function:: east_asian_width(unichr) | 
 | 80 |  | 
 | 81 |    Returns the east asian width assigned to the Unicode character *unichr* as | 
 | 82 |    string. | 
 | 83 |  | 
 | 84 |    .. versionadded:: 2.4 | 
 | 85 |  | 
 | 86 |  | 
 | 87 | .. function:: mirrored(unichr) | 
 | 88 |  | 
 | 89 |    Returns the mirrored property assigned to the Unicode character *unichr* as | 
 | 90 |    integer. Returns ``1`` if the character has been identified as a "mirrored" | 
 | 91 |    character in bidirectional text, ``0`` otherwise. | 
 | 92 |  | 
 | 93 |  | 
 | 94 | .. function:: decomposition(unichr) | 
 | 95 |  | 
 | 96 |    Returns the character decomposition mapping assigned to the Unicode character | 
 | 97 |    *unichr* as string. An empty string is returned in case no such mapping is | 
 | 98 |    defined. | 
 | 99 |  | 
 | 100 |  | 
 | 101 | .. function:: normalize(form, unistr) | 
 | 102 |  | 
 | 103 |    Return the normal form *form* for the Unicode string *unistr*. Valid values for | 
 | 104 |    *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'. | 
 | 105 |  | 
 | 106 |    The Unicode standard defines various normalization forms of a Unicode string, | 
 | 107 |    based on the definition of canonical equivalence and compatibility equivalence. | 
 | 108 |    In Unicode, several characters can be expressed in various way. For example, the | 
 | 109 |    character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as | 
| Ezio Melotti | 8eab1fd | 2012-01-16 08:42:32 +0200 | [diff] [blame] | 110 |    the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA). | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 111 |  | 
 | 112 |    For each character, there are two normal forms: normal form C and normal form D. | 
 | 113 |    Normal form D (NFD) is also known as canonical decomposition, and translates | 
 | 114 |    each character into its decomposed form. Normal form C (NFC) first applies a | 
 | 115 |    canonical decomposition, then composes pre-combined characters again. | 
 | 116 |  | 
 | 117 |    In addition to these two forms, there are two additional normal forms based on | 
 | 118 |    compatibility equivalence. In Unicode, certain characters are supported which | 
 | 119 |    normally would be unified with other characters. For example, U+2160 (ROMAN | 
 | 120 |    NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). | 
 | 121 |    However, it is supported in Unicode for compatibility with existing character | 
 | 122 |    sets (e.g. gb2312). | 
 | 123 |  | 
 | 124 |    The normal form KD (NFKD) will apply the compatibility decomposition, i.e. | 
 | 125 |    replace all compatibility characters with their equivalents. The normal form KC | 
 | 126 |    (NFKC) first applies the compatibility decomposition, followed by the canonical | 
 | 127 |    composition. | 
 | 128 |  | 
| Mark Summerfield | 216ad33 | 2007-08-16 10:09:22 +0000 | [diff] [blame] | 129 |    Even if two unicode strings are normalized and look the same to | 
 | 130 |    a human reader, if one has combining characters and the other | 
 | 131 |    doesn't, they may not compare equal. | 
 | 132 |  | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 133 |    .. versionadded:: 2.3 | 
 | 134 |  | 
 | 135 | In addition, the module exposes the following constant: | 
 | 136 |  | 
 | 137 |  | 
 | 138 | .. data:: unidata_version | 
 | 139 |  | 
 | 140 |    The version of the Unicode database used in this module. | 
 | 141 |  | 
 | 142 |    .. versionadded:: 2.3 | 
 | 143 |  | 
 | 144 |  | 
 | 145 | .. data:: ucd_3_2_0 | 
 | 146 |  | 
 | 147 |    This is an object that has the same methods as the entire module, but uses the | 
 | 148 |    Unicode database version 3.2 instead, for applications that require this | 
 | 149 |    specific version of the Unicode database (such as IDNA). | 
 | 150 |  | 
 | 151 |    .. versionadded:: 2.5 | 
 | 152 |  | 
| Georg Brandl | e8f1b00 | 2008-03-22 22:04:10 +0000 | [diff] [blame] | 153 | Examples: | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 154 |  | 
| Georg Brandl | e8f1b00 | 2008-03-22 22:04:10 +0000 | [diff] [blame] | 155 |    >>> import unicodedata | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 156 |    >>> unicodedata.lookup('LEFT CURLY BRACKET') | 
 | 157 |    u'{' | 
 | 158 |    >>> unicodedata.name(u'/') | 
 | 159 |    'SOLIDUS' | 
 | 160 |    >>> unicodedata.decimal(u'9') | 
 | 161 |    9 | 
 | 162 |    >>> unicodedata.decimal(u'a') | 
 | 163 |    Traceback (most recent call last): | 
 | 164 |      File "<stdin>", line 1, in ? | 
 | 165 |    ValueError: not a decimal | 
 | 166 |    >>> unicodedata.category(u'A')  # 'L'etter, 'u'ppercase | 
| Georg Brandl | c62ef8b | 2009-01-03 20:55:06 +0000 | [diff] [blame] | 167 |    'Lu' | 
| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 168 |    >>> unicodedata.bidirectional(u'\u0660') # 'A'rabic, 'N'umber | 
 | 169 |    'AN' | 
 | 170 |  |