| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1 | :mod:`unicodedata` --- Unicode Database | 
 | 2 | ======================================= | 
 | 3 |  | 
 | 4 | .. module:: unicodedata | 
 | 5 |    :synopsis: Access the Unicode Database. | 
| Antoine Pitrou | fbd4f80 | 2012-08-11 16:51:50 +0200 | [diff] [blame] | 6 | .. moduleauthor:: Marc-André Lemburg <mal@lemburg.com> | 
 | 7 | .. sectionauthor:: Marc-André Lemburg <mal@lemburg.com> | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 8 | .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de> | 
 | 9 |  | 
 | 10 |  | 
 | 11 | .. index:: | 
 | 12 |    single: Unicode | 
 | 13 |    single: character | 
 | 14 |    pair: Unicode; database | 
 | 15 |  | 
| Alexander Belopolsky | fc55789 | 2010-12-10 18:11:24 +0000 | [diff] [blame] | 16 | This module provides access to the Unicode Character Database (UCD) which | 
 | 17 | defines character properties for all Unicode characters. The data contained in | 
| Benjamin Peterson | 3032ed7 | 2014-07-06 13:04:20 -0700 | [diff] [blame] | 18 | this database is compiled from the `UCD version 7.0.0 | 
 | 19 | <http://www.unicode.org/Public/7.0.0/ucd>`_. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 20 |  | 
| Alexander Belopolsky | fc55789 | 2010-12-10 18:11:24 +0000 | [diff] [blame] | 21 | The module uses the same names and symbols as defined by Unicode | 
 | 22 | Standard Annex #44, `"Unicode Character Database" | 
 | 23 | <http://www.unicode.org/reports/tr44/tr44-6.html>`_.  It defines the | 
 | 24 | following functions: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 25 |  | 
 | 26 |  | 
 | 27 | .. function:: lookup(name) | 
 | 28 |  | 
 | 29 |    Look up character by name.  If a character with the given name is found, return | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 30 |    the corresponding character.  If not found, :exc:`KeyError` is raised. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 31 |  | 
| Ezio Melotti | 931b8aa | 2011-10-21 21:57:36 +0300 | [diff] [blame] | 32 |    .. versionchanged:: 3.3 | 
 | 33 |       Support for name aliases [#]_ and named sequences [#]_ has been added. | 
 | 34 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 35 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 36 | .. function:: name(chr[, default]) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 37 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 38 |    Returns the name assigned to the character *chr* as a string. If no | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 39 |    name is defined, *default* is returned, or, if not given, :exc:`ValueError` is | 
 | 40 |    raised. | 
 | 41 |  | 
 | 42 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 43 | .. function:: decimal(chr[, default]) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 44 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 45 |    Returns the decimal value assigned to the character *chr* as integer. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 46 |    If no such value is defined, *default* is returned, or, if not given, | 
 | 47 |    :exc:`ValueError` is raised. | 
 | 48 |  | 
 | 49 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 50 | .. function:: digit(chr[, default]) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 51 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 52 |    Returns the digit value assigned to the character *chr* as integer. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 53 |    If no such value is defined, *default* is returned, or, if not given, | 
 | 54 |    :exc:`ValueError` is raised. | 
 | 55 |  | 
 | 56 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 57 | .. function:: numeric(chr[, default]) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 58 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 59 |    Returns the numeric value assigned to the character *chr* as float. | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 60 |    If no such value is defined, *default* is returned, or, if not given, | 
 | 61 |    :exc:`ValueError` is raised. | 
 | 62 |  | 
 | 63 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 64 | .. function:: category(chr) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 65 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 66 |    Returns the general category assigned to the character *chr* as | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 67 |    string. | 
 | 68 |  | 
 | 69 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 70 | .. function:: bidirectional(chr) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 71 |  | 
| Ezio Melotti | 1e5c9b7 | 2012-12-14 20:06:43 +0200 | [diff] [blame] | 72 |    Returns the bidirectional class assigned to the character *chr* as | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 73 |    string. If no such value is defined, an empty string is returned. | 
 | 74 |  | 
 | 75 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 76 | .. function:: combining(chr) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 77 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 78 |    Returns the canonical combining class assigned to the character *chr* | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 79 |    as integer. Returns ``0`` if no combining class is defined. | 
 | 80 |  | 
 | 81 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 82 | .. function:: east_asian_width(chr) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 83 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 84 |    Returns the east asian width assigned to the character *chr* as | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 85 |    string. | 
 | 86 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 87 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 88 | .. function:: mirrored(chr) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 89 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 90 |    Returns the mirrored property assigned to the character *chr* as | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 91 |    integer. Returns ``1`` if the character has been identified as a "mirrored" | 
 | 92 |    character in bidirectional text, ``0`` otherwise. | 
 | 93 |  | 
 | 94 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 95 | .. function:: decomposition(chr) | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 96 |  | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 97 |    Returns the character decomposition mapping assigned to the character | 
 | 98 |    *chr* as string. An empty string is returned in case no such mapping is | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 99 |    defined. | 
 | 100 |  | 
 | 101 |  | 
 | 102 | .. function:: normalize(form, unistr) | 
 | 103 |  | 
 | 104 |    Return the normal form *form* for the Unicode string *unistr*. Valid values for | 
 | 105 |    *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'. | 
 | 106 |  | 
 | 107 |    The Unicode standard defines various normalization forms of a Unicode string, | 
 | 108 |    based on the definition of canonical equivalence and compatibility equivalence. | 
 | 109 |    In Unicode, several characters can be expressed in various way. For example, the | 
 | 110 |    character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as | 
| Ezio Melotti | 01b34af | 2012-01-16 08:42:32 +0200 | [diff] [blame] | 111 |    the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA). | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 112 |  | 
 | 113 |    For each character, there are two normal forms: normal form C and normal form D. | 
 | 114 |    Normal form D (NFD) is also known as canonical decomposition, and translates | 
 | 115 |    each character into its decomposed form. Normal form C (NFC) first applies a | 
 | 116 |    canonical decomposition, then composes pre-combined characters again. | 
 | 117 |  | 
 | 118 |    In addition to these two forms, there are two additional normal forms based on | 
 | 119 |    compatibility equivalence. In Unicode, certain characters are supported which | 
 | 120 |    normally would be unified with other characters. For example, U+2160 (ROMAN | 
 | 121 |    NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). | 
 | 122 |    However, it is supported in Unicode for compatibility with existing character | 
 | 123 |    sets (e.g. gb2312). | 
 | 124 |  | 
 | 125 |    The normal form KD (NFKD) will apply the compatibility decomposition, i.e. | 
 | 126 |    replace all compatibility characters with their equivalents. The normal form KC | 
 | 127 |    (NFKC) first applies the compatibility decomposition, followed by the canonical | 
 | 128 |    composition. | 
 | 129 |  | 
| Guido van Rossum | da27fd2 | 2007-08-17 00:24:54 +0000 | [diff] [blame] | 130 |    Even if two unicode strings are normalized and look the same to | 
 | 131 |    a human reader, if one has combining characters and the other | 
 | 132 |    doesn't, they may not compare equal. | 
 | 133 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 134 |  | 
 | 135 | In addition, the module exposes the following constant: | 
 | 136 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 137 | .. data:: unidata_version | 
 | 138 |  | 
 | 139 |    The version of the Unicode database used in this module. | 
 | 140 |  | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 141 |  | 
 | 142 | .. data:: ucd_3_2_0 | 
 | 143 |  | 
 | 144 |    This is an object that has the same methods as the entire module, but uses the | 
 | 145 |    Unicode database version 3.2 instead, for applications that require this | 
 | 146 |    specific version of the Unicode database (such as IDNA). | 
 | 147 |  | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 148 | Examples: | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 149 |  | 
| Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 150 |    >>> import unicodedata | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 151 |    >>> unicodedata.lookup('LEFT CURLY BRACKET') | 
| Ezio Melotti | 985e24d | 2009-09-13 07:54:02 +0000 | [diff] [blame] | 152 |    '{' | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 153 |    >>> unicodedata.name('/') | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 154 |    'SOLIDUS' | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 155 |    >>> unicodedata.decimal('9') | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 156 |    9 | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 157 |    >>> unicodedata.decimal('a') | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 158 |    Traceback (most recent call last): | 
 | 159 |      File "<stdin>", line 1, in ? | 
 | 160 |    ValueError: not a decimal | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 161 |    >>> unicodedata.category('A')  # 'L'etter, 'u'ppercase | 
| Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 162 |    'Lu' | 
| Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 163 |    >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber | 
| Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 164 |    'AN' | 
 | 165 |  | 
| Ezio Melotti | 931b8aa | 2011-10-21 21:57:36 +0300 | [diff] [blame] | 166 |  | 
 | 167 | .. rubric:: Footnotes | 
 | 168 |  | 
| Benjamin Peterson | 3032ed7 | 2014-07-06 13:04:20 -0700 | [diff] [blame] | 169 | .. [#] http://www.unicode.org/Public/7.0.0/ucd/NameAliases.txt | 
| Ezio Melotti | 931b8aa | 2011-10-21 21:57:36 +0300 | [diff] [blame] | 170 |  | 
| Benjamin Peterson | 3032ed7 | 2014-07-06 13:04:20 -0700 | [diff] [blame] | 171 | .. [#] http://www.unicode.org/Public/7.0.0/ucd/NamedSequences.txt |