Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1 | |
| 2 | :mod:`unicodedata` --- Unicode Database |
| 3 | ======================================= |
| 4 | |
| 5 | .. module:: unicodedata |
| 6 | :synopsis: Access the Unicode Database. |
| 7 | .. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com> |
| 8 | .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com> |
| 9 | .. sectionauthor:: Martin v. Löwis <martin@v.loewis.de> |
| 10 | |
| 11 | |
| 12 | .. index:: |
| 13 | single: Unicode |
| 14 | single: character |
| 15 | pair: Unicode; database |
| 16 | |
| 17 | This module provides access to the Unicode Character Database which defines |
| 18 | character properties for all Unicode characters. The data in this database is |
Martin v. Löwis | 93cbca3 | 2008-09-10 14:08:48 +0000 | [diff] [blame] | 19 | based on the :file:`UnicodeData.txt` file version 5.1.0 which is publicly |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 20 | available from ftp://ftp.unicode.org/. |
| 21 | |
| 22 | The module uses the same names and symbols as defined by the UnicodeData File |
Martin v. Löwis | 93cbca3 | 2008-09-10 14:08:48 +0000 | [diff] [blame] | 23 | Format 5.1.0 (see http://www.unicode.org/Public/5.1.0/ucd/UCD.html). It defines |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 24 | the following functions: |
| 25 | |
| 26 | |
| 27 | .. function:: lookup(name) |
| 28 | |
| 29 | Look up character by name. If a character with the given name is found, return |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 30 | the corresponding character. If not found, :exc:`KeyError` is raised. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 31 | |
| 32 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 33 | .. function:: name(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 34 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 35 | Returns the name assigned to the character *chr* as a string. If no |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 36 | name is defined, *default* is returned, or, if not given, :exc:`ValueError` is |
| 37 | raised. |
| 38 | |
| 39 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 40 | .. function:: decimal(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 41 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 42 | Returns the decimal value assigned to the character *chr* as integer. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 43 | If no such value is defined, *default* is returned, or, if not given, |
| 44 | :exc:`ValueError` is raised. |
| 45 | |
| 46 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 47 | .. function:: digit(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 48 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 49 | Returns the digit value assigned to the character *chr* as integer. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 50 | If no such value is defined, *default* is returned, or, if not given, |
| 51 | :exc:`ValueError` is raised. |
| 52 | |
| 53 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 54 | .. function:: numeric(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 55 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 56 | Returns the numeric value assigned to the character *chr* as float. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 57 | If no such value is defined, *default* is returned, or, if not given, |
| 58 | :exc:`ValueError` is raised. |
| 59 | |
| 60 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 61 | .. function:: category(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 62 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 63 | Returns the general category assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 64 | string. |
| 65 | |
| 66 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 67 | .. function:: bidirectional(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 68 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 69 | Returns the bidirectional category assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 70 | string. If no such value is defined, an empty string is returned. |
| 71 | |
| 72 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 73 | .. function:: combining(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 74 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 75 | Returns the canonical combining class assigned to the character *chr* |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 76 | as integer. Returns ``0`` if no combining class is defined. |
| 77 | |
| 78 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 79 | .. function:: east_asian_width(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 80 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 81 | Returns the east asian width assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 82 | string. |
| 83 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 84 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 85 | .. function:: mirrored(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 86 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 87 | Returns the mirrored property assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 88 | integer. Returns ``1`` if the character has been identified as a "mirrored" |
| 89 | character in bidirectional text, ``0`` otherwise. |
| 90 | |
| 91 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 92 | .. function:: decomposition(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 93 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 94 | Returns the character decomposition mapping assigned to the character |
| 95 | *chr* as string. An empty string is returned in case no such mapping is |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 96 | defined. |
| 97 | |
| 98 | |
| 99 | .. function:: normalize(form, unistr) |
| 100 | |
| 101 | Return the normal form *form* for the Unicode string *unistr*. Valid values for |
| 102 | *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'. |
| 103 | |
| 104 | The Unicode standard defines various normalization forms of a Unicode string, |
| 105 | based on the definition of canonical equivalence and compatibility equivalence. |
| 106 | In Unicode, several characters can be expressed in various way. For example, the |
| 107 | character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as |
Guido van Rossum | da27fd2 | 2007-08-17 00:24:54 +0000 | [diff] [blame] | 108 | the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C). |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 109 | |
| 110 | For each character, there are two normal forms: normal form C and normal form D. |
| 111 | Normal form D (NFD) is also known as canonical decomposition, and translates |
| 112 | each character into its decomposed form. Normal form C (NFC) first applies a |
| 113 | canonical decomposition, then composes pre-combined characters again. |
| 114 | |
| 115 | In addition to these two forms, there are two additional normal forms based on |
| 116 | compatibility equivalence. In Unicode, certain characters are supported which |
| 117 | normally would be unified with other characters. For example, U+2160 (ROMAN |
| 118 | NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). |
| 119 | However, it is supported in Unicode for compatibility with existing character |
| 120 | sets (e.g. gb2312). |
| 121 | |
| 122 | The normal form KD (NFKD) will apply the compatibility decomposition, i.e. |
| 123 | replace all compatibility characters with their equivalents. The normal form KC |
| 124 | (NFKC) first applies the compatibility decomposition, followed by the canonical |
| 125 | composition. |
| 126 | |
Guido van Rossum | da27fd2 | 2007-08-17 00:24:54 +0000 | [diff] [blame] | 127 | Even if two unicode strings are normalized and look the same to |
| 128 | a human reader, if one has combining characters and the other |
| 129 | doesn't, they may not compare equal. |
| 130 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 131 | |
| 132 | In addition, the module exposes the following constant: |
| 133 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 134 | .. data:: unidata_version |
| 135 | |
| 136 | The version of the Unicode database used in this module. |
| 137 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 138 | |
| 139 | .. data:: ucd_3_2_0 |
| 140 | |
| 141 | This is an object that has the same methods as the entire module, but uses the |
| 142 | Unicode database version 3.2 instead, for applications that require this |
| 143 | specific version of the Unicode database (such as IDNA). |
| 144 | |
Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 145 | Examples: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 146 | |
Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 147 | >>> import unicodedata |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 148 | >>> unicodedata.lookup('LEFT CURLY BRACKET') |
Ezio Melotti | 985e24d | 2009-09-13 07:54:02 +0000 | [diff] [blame^] | 149 | '{' |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 150 | >>> unicodedata.name('/') |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 151 | 'SOLIDUS' |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 152 | >>> unicodedata.decimal('9') |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 153 | 9 |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 154 | >>> unicodedata.decimal('a') |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 155 | Traceback (most recent call last): |
| 156 | File "<stdin>", line 1, in ? |
| 157 | ValueError: not a decimal |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 158 | >>> unicodedata.category('A') # 'L'etter, 'u'ppercase |
Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 159 | 'Lu' |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 160 | >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 161 | 'AN' |
| 162 | |