Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1 | :mod:`unicodedata` --- Unicode Database |
| 2 | ======================================= |
| 3 | |
| 4 | .. module:: unicodedata |
| 5 | :synopsis: Access the Unicode Database. |
Antoine Pitrou | fbd4f80 | 2012-08-11 16:51:50 +0200 | [diff] [blame] | 6 | .. moduleauthor:: Marc-André Lemburg <mal@lemburg.com> |
| 7 | .. sectionauthor:: Marc-André Lemburg <mal@lemburg.com> |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 8 | .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de> |
| 9 | |
| 10 | |
| 11 | .. index:: |
| 12 | single: Unicode |
| 13 | single: character |
| 14 | pair: Unicode; database |
| 15 | |
Alexander Belopolsky | fc55789 | 2010-12-10 18:11:24 +0000 | [diff] [blame] | 16 | This module provides access to the Unicode Character Database (UCD) which |
| 17 | defines character properties for all Unicode characters. The data contained in |
Benjamin Peterson | b8350f1 | 2012-09-29 13:47:39 -0400 | [diff] [blame] | 18 | this database is compiled from the `UCD version 6.2.0 |
| 19 | <http://www.unicode.org/Public/6.2.0/ucd>`_. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 20 | |
Alexander Belopolsky | fc55789 | 2010-12-10 18:11:24 +0000 | [diff] [blame] | 21 | The module uses the same names and symbols as defined by Unicode |
| 22 | Standard Annex #44, `"Unicode Character Database" |
| 23 | <http://www.unicode.org/reports/tr44/tr44-6.html>`_. It defines the |
| 24 | following functions: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 25 | |
| 26 | |
| 27 | .. function:: lookup(name) |
| 28 | |
| 29 | Look up character by name. If a character with the given name is found, return |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 30 | the corresponding character. If not found, :exc:`KeyError` is raised. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 31 | |
Ezio Melotti | 931b8aa | 2011-10-21 21:57:36 +0300 | [diff] [blame] | 32 | .. versionchanged:: 3.3 |
| 33 | Support for name aliases [#]_ and named sequences [#]_ has been added. |
| 34 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 35 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 36 | .. function:: name(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 37 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 38 | Returns the name assigned to the character *chr* as a string. If no |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 39 | name is defined, *default* is returned, or, if not given, :exc:`ValueError` is |
| 40 | raised. |
| 41 | |
| 42 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 43 | .. function:: decimal(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 44 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 45 | Returns the decimal value assigned to the character *chr* as integer. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 46 | If no such value is defined, *default* is returned, or, if not given, |
| 47 | :exc:`ValueError` is raised. |
| 48 | |
| 49 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 50 | .. function:: digit(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 51 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 52 | Returns the digit value assigned to the character *chr* as integer. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 53 | If no such value is defined, *default* is returned, or, if not given, |
| 54 | :exc:`ValueError` is raised. |
| 55 | |
| 56 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 57 | .. function:: numeric(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 58 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 59 | Returns the numeric value assigned to the character *chr* as float. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 60 | If no such value is defined, *default* is returned, or, if not given, |
| 61 | :exc:`ValueError` is raised. |
| 62 | |
| 63 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 64 | .. function:: category(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 65 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 66 | Returns the general category assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 67 | string. |
| 68 | |
| 69 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 70 | .. function:: bidirectional(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 71 | |
Ezio Melotti | 1e5c9b7 | 2012-12-14 20:06:43 +0200 | [diff] [blame] | 72 | Returns the bidirectional class assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 73 | string. If no such value is defined, an empty string is returned. |
| 74 | |
| 75 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 76 | .. function:: combining(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 77 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 78 | Returns the canonical combining class assigned to the character *chr* |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 79 | as integer. Returns ``0`` if no combining class is defined. |
| 80 | |
| 81 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 82 | .. function:: east_asian_width(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 83 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 84 | Returns the east asian width assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 85 | string. |
| 86 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 87 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 88 | .. function:: mirrored(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 89 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 90 | Returns the mirrored property assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 91 | integer. Returns ``1`` if the character has been identified as a "mirrored" |
| 92 | character in bidirectional text, ``0`` otherwise. |
| 93 | |
| 94 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 95 | .. function:: decomposition(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 96 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 97 | Returns the character decomposition mapping assigned to the character |
| 98 | *chr* as string. An empty string is returned in case no such mapping is |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 99 | defined. |
| 100 | |
| 101 | |
| 102 | .. function:: normalize(form, unistr) |
| 103 | |
| 104 | Return the normal form *form* for the Unicode string *unistr*. Valid values for |
| 105 | *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'. |
| 106 | |
| 107 | The Unicode standard defines various normalization forms of a Unicode string, |
| 108 | based on the definition of canonical equivalence and compatibility equivalence. |
| 109 | In Unicode, several characters can be expressed in various way. For example, the |
| 110 | character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as |
Ezio Melotti | 01b34af | 2012-01-16 08:42:32 +0200 | [diff] [blame] | 111 | the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA). |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 112 | |
| 113 | For each character, there are two normal forms: normal form C and normal form D. |
| 114 | Normal form D (NFD) is also known as canonical decomposition, and translates |
| 115 | each character into its decomposed form. Normal form C (NFC) first applies a |
| 116 | canonical decomposition, then composes pre-combined characters again. |
| 117 | |
| 118 | In addition to these two forms, there are two additional normal forms based on |
| 119 | compatibility equivalence. In Unicode, certain characters are supported which |
| 120 | normally would be unified with other characters. For example, U+2160 (ROMAN |
| 121 | NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). |
| 122 | However, it is supported in Unicode for compatibility with existing character |
| 123 | sets (e.g. gb2312). |
| 124 | |
| 125 | The normal form KD (NFKD) will apply the compatibility decomposition, i.e. |
| 126 | replace all compatibility characters with their equivalents. The normal form KC |
| 127 | (NFKC) first applies the compatibility decomposition, followed by the canonical |
| 128 | composition. |
| 129 | |
Guido van Rossum | da27fd2 | 2007-08-17 00:24:54 +0000 | [diff] [blame] | 130 | Even if two unicode strings are normalized and look the same to |
| 131 | a human reader, if one has combining characters and the other |
| 132 | doesn't, they may not compare equal. |
| 133 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 134 | |
| 135 | In addition, the module exposes the following constant: |
| 136 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 137 | .. data:: unidata_version |
| 138 | |
| 139 | The version of the Unicode database used in this module. |
| 140 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 141 | |
| 142 | .. data:: ucd_3_2_0 |
| 143 | |
| 144 | This is an object that has the same methods as the entire module, but uses the |
| 145 | Unicode database version 3.2 instead, for applications that require this |
| 146 | specific version of the Unicode database (such as IDNA). |
| 147 | |
Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 148 | Examples: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 149 | |
Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 150 | >>> import unicodedata |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 151 | >>> unicodedata.lookup('LEFT CURLY BRACKET') |
Ezio Melotti | 985e24d | 2009-09-13 07:54:02 +0000 | [diff] [blame] | 152 | '{' |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 153 | >>> unicodedata.name('/') |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 154 | 'SOLIDUS' |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 155 | >>> unicodedata.decimal('9') |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 156 | 9 |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 157 | >>> unicodedata.decimal('a') |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 158 | Traceback (most recent call last): |
| 159 | File "<stdin>", line 1, in ? |
| 160 | ValueError: not a decimal |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 161 | >>> unicodedata.category('A') # 'L'etter, 'u'ppercase |
Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 162 | 'Lu' |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 163 | >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 164 | 'AN' |
| 165 | |
Ezio Melotti | 931b8aa | 2011-10-21 21:57:36 +0300 | [diff] [blame] | 166 | |
| 167 | .. rubric:: Footnotes |
| 168 | |
Benjamin Peterson | fb36e66 | 2012-02-20 22:34:50 -0500 | [diff] [blame] | 169 | .. [#] http://www.unicode.org/Public/6.1.0/ucd/NameAliases.txt |
Ezio Melotti | 931b8aa | 2011-10-21 21:57:36 +0300 | [diff] [blame] | 170 | |
Benjamin Peterson | fb36e66 | 2012-02-20 22:34:50 -0500 | [diff] [blame] | 171 | .. [#] http://www.unicode.org/Public/6.1.0/ucd/NamedSequences.txt |