Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1 | :mod:`unicodedata` --- Unicode Database |
| 2 | ======================================= |
| 3 | |
| 4 | .. module:: unicodedata |
| 5 | :synopsis: Access the Unicode Database. |
| 6 | .. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com> |
| 7 | .. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com> |
| 8 | .. sectionauthor:: Martin v. Lรถwis <martin@v.loewis.de> |
| 9 | |
| 10 | |
| 11 | .. index:: |
| 12 | single: Unicode |
| 13 | single: character |
| 14 | pair: Unicode; database |
| 15 | |
| 16 | This module provides access to the Unicode Character Database which defines |
| 17 | character properties for all Unicode characters. The data in this database is |
Ezio Melotti | 4c5475d | 2010-03-22 23:16:42 +0000 | [diff] [blame] | 18 | based on the :file:`UnicodeData.txt` file version 5.2.0 which is publicly |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 19 | available from ftp://ftp.unicode.org/. |
| 20 | |
| 21 | The module uses the same names and symbols as defined by the UnicodeData File |
Ezio Melotti | d96b2f2 | 2010-03-23 00:39:22 +0000 | [diff] [blame] | 22 | Format 5.2.0 (see http://www.unicode.org/reports/tr44/tr44-4.html). |
| 23 | It defines the following functions: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 24 | |
| 25 | |
| 26 | .. function:: lookup(name) |
| 27 | |
| 28 | Look up character by name. If a character with the given name is found, return |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 29 | the corresponding character. If not found, :exc:`KeyError` is raised. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 30 | |
| 31 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 32 | .. function:: name(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 33 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 34 | Returns the name assigned to the character *chr* as a string. If no |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 35 | name is defined, *default* is returned, or, if not given, :exc:`ValueError` is |
| 36 | raised. |
| 37 | |
| 38 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 39 | .. function:: decimal(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 40 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 41 | Returns the decimal value assigned to the character *chr* as integer. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 42 | If no such value is defined, *default* is returned, or, if not given, |
| 43 | :exc:`ValueError` is raised. |
| 44 | |
| 45 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 46 | .. function:: digit(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 47 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 48 | Returns the digit value assigned to the character *chr* as integer. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 49 | If no such value is defined, *default* is returned, or, if not given, |
| 50 | :exc:`ValueError` is raised. |
| 51 | |
| 52 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 53 | .. function:: numeric(chr[, default]) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 54 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 55 | Returns the numeric value assigned to the character *chr* as float. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 56 | If no such value is defined, *default* is returned, or, if not given, |
| 57 | :exc:`ValueError` is raised. |
| 58 | |
| 59 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 60 | .. function:: category(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 61 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 62 | Returns the general category assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 63 | string. |
| 64 | |
| 65 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 66 | .. function:: bidirectional(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 67 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 68 | Returns the bidirectional category assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 69 | string. If no such value is defined, an empty string is returned. |
| 70 | |
| 71 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 72 | .. function:: combining(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 73 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 74 | Returns the canonical combining class assigned to the character *chr* |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 75 | as integer. Returns ``0`` if no combining class is defined. |
| 76 | |
| 77 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 78 | .. function:: east_asian_width(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 79 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 80 | Returns the east asian width assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 81 | string. |
| 82 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 83 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 84 | .. function:: mirrored(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 85 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 86 | Returns the mirrored property assigned to the character *chr* as |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 87 | integer. Returns ``1`` if the character has been identified as a "mirrored" |
| 88 | character in bidirectional text, ``0`` otherwise. |
| 89 | |
| 90 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 91 | .. function:: decomposition(chr) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 92 | |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 93 | Returns the character decomposition mapping assigned to the character |
| 94 | *chr* as string. An empty string is returned in case no such mapping is |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 95 | defined. |
| 96 | |
| 97 | |
| 98 | .. function:: normalize(form, unistr) |
| 99 | |
| 100 | Return the normal form *form* for the Unicode string *unistr*. Valid values for |
| 101 | *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'. |
| 102 | |
| 103 | The Unicode standard defines various normalization forms of a Unicode string, |
| 104 | based on the definition of canonical equivalence and compatibility equivalence. |
| 105 | In Unicode, several characters can be expressed in various way. For example, the |
| 106 | character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as |
Guido van Rossum | da27fd2 | 2007-08-17 00:24:54 +0000 | [diff] [blame] | 107 | the sequence U+0327 (COMBINING CEDILLA) U+0043 (LATIN CAPITAL LETTER C). |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 108 | |
| 109 | For each character, there are two normal forms: normal form C and normal form D. |
| 110 | Normal form D (NFD) is also known as canonical decomposition, and translates |
| 111 | each character into its decomposed form. Normal form C (NFC) first applies a |
| 112 | canonical decomposition, then composes pre-combined characters again. |
| 113 | |
| 114 | In addition to these two forms, there are two additional normal forms based on |
| 115 | compatibility equivalence. In Unicode, certain characters are supported which |
| 116 | normally would be unified with other characters. For example, U+2160 (ROMAN |
| 117 | NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I). |
| 118 | However, it is supported in Unicode for compatibility with existing character |
| 119 | sets (e.g. gb2312). |
| 120 | |
| 121 | The normal form KD (NFKD) will apply the compatibility decomposition, i.e. |
| 122 | replace all compatibility characters with their equivalents. The normal form KC |
| 123 | (NFKC) first applies the compatibility decomposition, followed by the canonical |
| 124 | composition. |
| 125 | |
Guido van Rossum | da27fd2 | 2007-08-17 00:24:54 +0000 | [diff] [blame] | 126 | Even if two unicode strings are normalized and look the same to |
| 127 | a human reader, if one has combining characters and the other |
| 128 | doesn't, they may not compare equal. |
| 129 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 130 | |
| 131 | In addition, the module exposes the following constant: |
| 132 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 133 | .. data:: unidata_version |
| 134 | |
| 135 | The version of the Unicode database used in this module. |
| 136 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 137 | |
| 138 | .. data:: ucd_3_2_0 |
| 139 | |
| 140 | This is an object that has the same methods as the entire module, but uses the |
| 141 | Unicode database version 3.2 instead, for applications that require this |
| 142 | specific version of the Unicode database (such as IDNA). |
| 143 | |
Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 144 | Examples: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 145 | |
Christian Heimes | fe337bf | 2008-03-23 21:54:12 +0000 | [diff] [blame] | 146 | >>> import unicodedata |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 147 | >>> unicodedata.lookup('LEFT CURLY BRACKET') |
Ezio Melotti | 985e24d | 2009-09-13 07:54:02 +0000 | [diff] [blame] | 148 | '{' |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 149 | >>> unicodedata.name('/') |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 150 | 'SOLIDUS' |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 151 | >>> unicodedata.decimal('9') |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 152 | 9 |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 153 | >>> unicodedata.decimal('a') |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 154 | Traceback (most recent call last): |
| 155 | File "<stdin>", line 1, in ? |
| 156 | ValueError: not a decimal |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 157 | >>> unicodedata.category('A') # 'L'etter, 'u'ppercase |
Georg Brandl | 48310cd | 2009-01-03 21:18:54 +0000 | [diff] [blame] | 158 | 'Lu' |
Georg Brandl | f694518 | 2008-02-01 11:56:49 +0000 | [diff] [blame] | 159 | >>> unicodedata.bidirectional('\u0660') # 'A'rabic, 'N'umber |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 160 | 'AN' |
| 161 | |