Hye-Shik Chang | 3e2a306 | 2004-01-17 14:29:29 +0000 | [diff] [blame] | 1 | Notes on cjkcodecs |
| 2 | ------------------- |
| 3 | This directory contains source files for cjkcodecs extension modules. |
| 4 | They are based on CJKCodecs (http://cjkpython.i18n.org/#CJKCodecs) |
| 5 | as of Jan 17 2004 currently. |
| 6 | |
| 7 | |
| 8 | |
| 9 | To generate or modify mapping headers |
| 10 | ------------------------------------- |
| 11 | Mapping headers are imported from CJKCodecs as pre-generated form. |
| 12 | If you need to tweak or add something on it, please look at tools/ |
| 13 | subdirectory of CJKCodecs' distribution. |
| 14 | |
| 15 | |
| 16 | |
| 17 | Notes on implmentation characteristics of each codecs |
| 18 | ----------------------------------------------------- |
| 19 | |
| 20 | 1) Big5 codec |
| 21 | |
| 22 | The big5 codec maps the following characters as cp950 does rather |
| 23 | than conforming Unicode.org's that maps to 0xFFFD. |
| 24 | |
| 25 | BIG5 Unicode Description |
| 26 | |
| 27 | 0xA15A 0x2574 SPACING UNDERSCORE |
| 28 | 0xA1C3 0xFFE3 SPACING HEAVY OVERSCORE |
| 29 | 0xA1C5 0x02CD SPACING HEAVY UNDERSCORE |
| 30 | 0xA1FE 0xFF0F LT DIAG UP RIGHT TO LOW LEFT |
| 31 | 0xA240 0xFF3C LT DIAG UP LEFT TO LOW RIGHT |
| 32 | 0xA2CC 0x5341 HANGZHOU NUMERAL TEN |
| 33 | 0xA2CE 0x5345 HANGZHOU NUMERAL THIRTY |
| 34 | |
| 35 | Because unicode 0x5341, 0x5345, 0xFF0F, 0xFF3C is mapped to another |
| 36 | big5 codes already, a roundtrip compatibility is not guaranteed for |
| 37 | them. |
| 38 | |
| 39 | |
| 40 | 2) cp932 codec |
| 41 | |
| 42 | To conform to Windows's real mapping, cp932 codec maps the following |
| 43 | codepoints in addition of the official cp932 mapping. |
| 44 | |
| 45 | CP932 Unicode Description |
| 46 | |
| 47 | 0x80 0x80 UNDEFINED |
| 48 | 0xA0 0xF8F0 UNDEFINED |
| 49 | 0xFD 0xF8F1 UNDEFINED |
| 50 | 0xFE 0xF8F2 UNDEFINED |
| 51 | 0xFF 0xF8F3 UNDEFINED |
| 52 | |
| 53 | |
| 54 | 3) euc-jisx0213 codec |
| 55 | |
| 56 | The euc-jisx0213 codec maps JIS X 0213 Plane 1 code 0x2140 into |
| 57 | unicode U+FF3C instead of U+005C as on unicode.org's mapping. |
| 58 | Because euc-jisx0213 has REVERSE SOLIDUS on 0x5c already and A140 |
| 59 | is shown as a full width character, mapping to U+FF3C can make |
| 60 | more sense. |
| 61 | |
| 62 | The euc-jisx0213 codec is enabled to decode JIS X 0212 codes on |
| 63 | codeset 2. Because JIS X 0212 and JIS X 0213 Plane 2 don't have |
| 64 | overlapped by each other, it doesn't bother standard conformations |
| 65 | (and JIS X 0213 Plane 2 is intended to use so.) On encoding |
| 66 | sessions, the codec will try to encode kanji characters in this |
| 67 | order: |
| 68 | |
| 69 | JIS X 0213 Plane 1 -> JIS X 0213 Plane 2 -> JIS X 0212 |
| 70 | |
| 71 | |
| 72 | 4) euc-jp codec |
| 73 | |
| 74 | The euc-jp codec is a compatibility instance on these points: |
| 75 | - U+FF3C FULLWIDTH REVERSE SOLIDUS is mapped to EUC-JP A1C0 (vice versa) |
| 76 | - U+00A5 YEN SIGN is mapped to EUC-JP 0x5c. (one way) |
| 77 | - U+203E OVERLINE is mapped to EUC-JP 0x7e. (one way) |
| 78 | |
| 79 | |
| 80 | 5) shift-jis codec |
| 81 | |
| 82 | The shift-jis codec is mapping 0x20-0x7e area to U+20-U+7E directly |
| 83 | instead of using JIS X 0201 for compatibility. The differences are: |
| 84 | - U+005C REVERSE SOLIDUS is mapped to SHIFT-JIS 0x5c. |
| 85 | - U+007E TILDE is mapped to SHIFT-JIS 0x7e. |
| 86 | - U+FF3C FULL-WIDTH REVERSE SOLIDUS is mapped to SHIFT-JIS 815f. |
| 87 | |