Daniel Veillard | 8f8a9dd | 2005-01-25 21:41:42 +0000 | [diff] [blame] | 1 | <?xml version="1.0"?> |
| 2 | <!DOCTYPE kanjidic2 [ |
| 3 | <!-- Version 1.3 |
| 4 | This is the DTD of the XML-format kanji file combining information from |
| 5 | the KANJIDIC and KANJD212 files. It is intended to be largely self- |
| 6 | documenting, with each field being accompanied by an explanatory |
| 7 | comment. |
| 8 | |
| 9 | The file covers the following kanji: |
| 10 | (a) the 6,355 kanji from JIS X 0208; |
| 11 | (b) the 5,801 kanji from JIS X 0212; |
| 12 | (c) the 3,625 kanji from JIS X 0213 as follows: |
| 13 | (i) the 2,741 kanji which are also in JIS X 0212 have |
| 14 | JIS X 0213 code-points (kuten) added to the existing entry; |
| 15 | (ii) the 884 "new" kanji have new entries. |
| 16 | |
| 17 | At the end of the explanation for a number of fields there is a tag |
| 18 | with the format [N]. This indicates the leading letter(s) of the |
| 19 | equivalent field in the KANJIDIC and KANJD212 files. |
| 20 | |
| 21 | The KANJIDIC documentation should also be read for additional |
| 22 | information about the information in the file. |
| 23 | --><!ELEMENT kanjidic2 (header , character*)> |
| 24 | <!ELEMENT header (file_version , database_version , date_of_creation)> |
| 25 | <!-- |
| 26 | The single header element will contain identification information |
| 27 | about the version of the file |
| 28 | --><!ELEMENT file_version (#PCDATA)> |
| 29 | <!-- |
| 30 | This field denotes the version of kanjidic2 structure, as more |
| 31 | than one version may exist. |
| 32 | --><!ELEMENT database_version (#PCDATA)> |
| 33 | <!-- |
| 34 | The version of the file, in the format YYYY-NN, where NN will be |
| 35 | a number starting with 01 for the first version released in a |
| 36 | calendar year, then increasing for each version in that year. |
| 37 | --><!ELEMENT date_of_creation (#PCDATA)> |
| 38 | <!-- |
| 39 | The date the file was created in international format (YYYY-MM-DD). |
| 40 | --><!ELEMENT character (literal , codepoint , radical , misc , dic_number? , query_code? , reading_meaning? , nanori?)*> |
| 41 | <!ELEMENT literal (#PCDATA)> |
| 42 | <!-- |
| 43 | The character itself in UTF8 coding. |
| 44 | --><!ELEMENT codepoint (cp_value)+> |
| 45 | <!-- |
| 46 | The codepoint element states the code of the character in the various |
| 47 | character set standards. |
| 48 | --><!ELEMENT cp_value (#PCDATA)> |
| 49 | <!-- |
| 50 | The cp_value contains the codepoint of the character in a particular |
| 51 | standard. The standard will be identified in the cp_type attribute. |
| 52 | --><!ATTLIST cp_value cp_type CDATA #REQUIRED> |
| 53 | <!-- |
| 54 | The cp_type attribute states the coding standard applying to the |
| 55 | element. The values assigned so far are: |
| 56 | jis208 - JIS X 0208-1997 - kuten coding (nn-nn) |
| 57 | jis212 - JIS X 0212-1990 - kuten coding (nn-nn) |
| 58 | jis213 - JIS X 0213-2000 - kuten coding (p-nn-nn) |
| 59 | ucs - Unicode 4.0 - hex coding (4 or 5 hexadecimal digits) |
| 60 | --><!ELEMENT radical (rad_value)+> |
| 61 | <!ELEMENT rad_value (#PCDATA)> |
| 62 | <!-- |
| 63 | The radical number, in the range 1 to 214. The particular |
| 64 | classification type is stated in the rad_type attribute. |
| 65 | --><!ATTLIST rad_value rad_type CDATA #REQUIRED> |
| 66 | <!-- |
| 67 | The rad_type attribute states the type of radical classification. |
| 68 | classical - as recorded in the KangXi Zidian. |
| 69 | nelson - as used in the Nelson "Modern Japanese-English |
| 70 | Character Dictionary" (i.e. the Classic, not the New Nelson). |
| 71 | This will only be used where Nelson reclassified the kanji. |
| 72 | --><!ELEMENT misc (grade? , stroke_count+ , variant* , freq* , rad_name*)> |
| 73 | <!ELEMENT grade (#PCDATA)> |
| 74 | <!-- |
| 75 | The Jouyou Kanji grade level. 1 through 6 indicate the grade in which |
| 76 | the kanji is taught in Japanese schools. 8 indicates it is one of the |
| 77 | remaining Jouyou Kanji to be learned in junior high school, and 9 |
| 78 | indicates it is a Jinmeiyou (for use in names) kanji. [G] |
| 79 | --><!ELEMENT stroke_count (#PCDATA)> |
| 80 | <!-- |
| 81 | The stroke count of the kanji, including the radical. If more than |
| 82 | one, the first is considered the accepted count, while subsequent ones |
| 83 | are common miscounts. (See Appendix E. of the KANJIDIC documentation |
| 84 | for some of the rules applied when counting strokes in some of the |
| 85 | radicals.) [S] |
| 86 | --><!ELEMENT variant (#PCDATA)> |
| 87 | <!-- |
| 88 | A cross-reference code to another kanji, usually regarded as a variant. |
| 89 | The type of cross-reference is given in the var_type attribute. |
| 90 | --><!ATTLIST variant var_type CDATA #REQUIRED> |
| 91 | <!-- |
| 92 | The var_type attribute indicates the type of variant code. The current |
| 93 | values are: |
| 94 | jis208 - in JIS X 0208 - kuten coding |
| 95 | jis212 - in JIS X 0212 - kuten coding |
| 96 | jis213 - in JIS X 0213 - kuten coding |
| 97 | deroo - De Roo number - numeric |
| 98 | njecd - Halpern NJECD index number - numeric |
| 99 | s_h - The Kanji Dictionary (Spahn & Hadamitzky) - descriptor |
| 100 | nelson - "Classic" Nelson - numeric |
| 101 | oneill - Japanese Names (O'Neill) - numeric |
| 102 | --><!ELEMENT freq (#PCDATA)> |
| 103 | <!-- |
| 104 | A frequency-of-use ranking. The 2,500 most-used characters have a |
| 105 | ranking; those characters that lack this field are not ranked. The |
| 106 | frequency is a number from 1 to 2,500 that expresses the relative |
| 107 | frequency of occurrence of a character in modern Japanese. This is |
| 108 | based on a survey in newspapers, so it is biassed towards kanji |
| 109 | used in newspaper articles. The discrimination between the less |
| 110 | frequently used kanji is not strong. |
| 111 | --><!ELEMENT rad_name (#PCDATA)> |
| 112 | <!-- |
| 113 | When the kanji is itself a radical and has a name, this element |
| 114 | contains the name (in hiragana.) [T2] |
| 115 | --><!ELEMENT dic_number (dic_ref)+> |
| 116 | <!-- |
| 117 | This element contains the index numbers and similar unstructured |
| 118 | information such as page numbers in a number of published dictionaries, |
| 119 | and instructional books on kanji. |
| 120 | --><!ELEMENT dic_ref (#PCDATA)> |
| 121 | <!-- |
| 122 | Each dic_ref contains an index number. The particular dictionary, |
| 123 | etc. is defined by the dr_type attribute. |
| 124 | --><!ATTLIST dic_ref dr_type CDATA #REQUIRED> |
| 125 | <!-- |
| 126 | The dr_type defines the dictionary or reference book, etc. to which |
| 127 | dic_ref element applies. The initial allocation is: |
| 128 | nelson_c - "Modern Reader's Japanese-English Character Dictionary", |
| 129 | edited by Andrew Nelson (now published as the "Classic" |
| 130 | Nelson). |
| 131 | nelson_n - "The New Nelson Japanese-English Character Dictionary", |
| 132 | edited by John Haig. |
| 133 | halpern_njecd - "New Japanese-English Character Dictionary", |
| 134 | edited by Jack Halpern. |
| 135 | halpern_kkld - "Kanji Learners Dictionary" (Kodansha) edited by |
| 136 | Jack Halpern. |
| 137 | heisig - "Remembering The Kanji" by James Heisig. |
| 138 | gakken - "A New Dictionary of Kanji Usage" (Gakken) |
| 139 | oneill_names - "Japanese Names", by P.G. O'Neill. |
| 140 | oneill_kk - "Essential Kanji" by P.G. O'Neill. |
| 141 | moro - "Daikanwajiten" compiled by Morohashi. For some kanji two |
| 142 | additional attributes are used: m_vol: the volume of the |
| 143 | dictionary in which the kanji is found, and m_page: the page |
| 144 | number in the volume. |
| 145 | henshall - "A Guide To Remembering Japanese Characters" by |
| 146 | Kenneth G. Henshall. |
| 147 | sh_kk - "Kanji and Kana" by Spahn and Hadamitzky. |
| 148 | sakade - "A Guide To Reading and Writing Japanese" edited by |
| 149 | Florence Sakade. |
| 150 | henshall3 - "A Guide To Reading and Writing Japanese" 3rd |
| 151 | edition, edited by Henshall, Seeley and De Groot. |
| 152 | tutt_cards - Tuttle Kanji Cards, compiled by Alexander Kask. |
| 153 | crowley - "The Kanji Way to Japanese Language Power" by |
| 154 | Dale Crowley. |
| 155 | kanji_in_context - "Kanji in Context" by Nishiguchi and Kono. |
| 156 | busy_people - "Japanese For Busy People" vols I-III, published |
| 157 | by the AJLT. The codes are the volume.chapter. |
| 158 | kodansha_compact - the "Kodansha Compact Kanji Guide". |
| 159 | --><!ATTLIST dic_ref m_vol CDATA #IMPLIED> |
| 160 | <!-- |
| 161 | See above under "moro". |
| 162 | --><!ATTLIST dic_ref m_page CDATA #IMPLIED> |
| 163 | <!-- |
| 164 | See above under "moro". |
| 165 | --><!ELEMENT query_code (q_code)+> |
| 166 | <!-- |
| 167 | These codes contain information relating to the glyph, and can be used |
| 168 | for finding a required kanji. The type of code is defined by the |
| 169 | qc_type attribute. |
| 170 | --><!ELEMENT q_code (#PCDATA)> |
| 171 | <!-- |
| 172 | The q_code contains the actual query-code value, according to the |
| 173 | qc_type attribute. |
| 174 | --><!ATTLIST q_code qc_type CDATA #REQUIRED> |
| 175 | <!-- |
| 176 | The q_code attribute defines the type of query code. The current values |
| 177 | are: |
| 178 | skip - Halpern's SKIP (System of Kanji Indexing by Patterns) |
| 179 | code. The format is n-nn-nn. See the KANJIDIC documentation |
| 180 | for a description of the code and restrictions on the |
| 181 | commercial use of this data. [P] |
| 182 | |
| 183 | sh_desc - the descriptor codes for The Kanji Dictionary (Tuttle |
| 184 | 1996) by Spahn and Hadamitzky. They are in the form nxnn.n, |
| 185 | e.g. 3k11.2, where the kanji has 3 strokes in the |
| 186 | identifying radical, it is radical "k" in the SH |
| 187 | classification system, there are 11 other strokes, and it is |
| 188 | the 2nd kanji in the 3k11 sequence. (I am very grateful to |
| 189 | Mark Spahn for providing the list of these descriptor codes |
| 190 | for the kanji in this file.) [I] |
| 191 | four_corner - the "Four Corner" code for the kanji. This is a code |
| 192 | invented by Wang Chen in 1928. See the KANJIDIC documentation |
| 193 | for an overview of the Four Corner System. [Q] |
| 194 | |
| 195 | deroo - the codes developed by the late Father Joseph De Roo, and |
| 196 | published in his book "2001 Kanji" (Bojinsha). Fr De Roo |
| 197 | gave his permission for these codes to be included. [DR] |
| 198 | misclass - a possible misclassification of the kanji according |
| 199 | to one of the code types. (See the "Z" codes in the KANJIDIC |
| 200 | documentation for more details.) |
| 201 | |
| 202 | --><!ELEMENT reading_meaning (rmgroup* , nanori*)> |
| 203 | <!-- |
| 204 | The readings for the kanji in several languages, and the meanings, also |
| 205 | in several languages. The readings and meanings are grouped to enable |
| 206 | the handling of the situation where the meaning is differentiated by |
| 207 | reading. [T1] |
| 208 | --><!ELEMENT nanori (#PCDATA)> |
| 209 | <!-- |
| 210 | Japanese readings that are now only associated with names. |
| 211 | --><!ELEMENT rmgroup (reading* , meaning*)> |
| 212 | <!ELEMENT reading (#PCDATA)> |
| 213 | <!-- |
| 214 | The reading element contains the reading or pronunciation |
| 215 | of the kanji. |
| 216 | --><!ATTLIST reading r_type CDATA #REQUIRED> |
| 217 | <!-- |
| 218 | The r_type attribute defines the type of reading in the reading |
| 219 | element. The current values are: |
| 220 | pinyin - the modern PinYin romanization of the Chinese reading |
| 221 | of the kanji. The tones are represented by a concluding |
| 222 | digit. [Y] |
| 223 | korean_r - the romanized form of the Korean reading(s) of the |
| 224 | kanji. The readings are in the (Republic of Korea) Ministry |
| 225 | of Education style of romanization. [W] |
| 226 | korean_h - the Korean reading(s) of the kanji in hangul. |
| 227 | ja_on - the "on" Japanese reading of the kanji, in katakana. A |
| 228 | second attribute r_status, if present, will indicate with |
| 229 | a value of "jy" whether the reading is approved for a |
| 230 | "Jouyou kanji". |
| 231 | ja_kun - the "kun" Japanese reading of the kanji, in hiragana. |
| 232 | Where relevant the okurigana is also included separated by a |
| 233 | ".". Readings associated with prefixes and suffixes are |
| 234 | marked with a "-". A second attribute r_status, if present, |
| 235 | will indicate with a value of "jy" whether the reading is |
| 236 | approved for a "Jouyou kanji". |
| 237 | --><!ATTLIST reading r_status CDATA #IMPLIED> |
| 238 | <!-- |
| 239 | See under ja_on and ja_kun above. |
| 240 | --><!ELEMENT meaning (#PCDATA)> |
| 241 | <!-- |
| 242 | The meaning associated with the kanji. |
| 243 | --><!ATTLIST meaning m_lang CDATA #IMPLIED> |
| 244 | <!-- |
| 245 | The m_lang attribute defines the target language of the meaning. It |
| 246 | will be coded using the two-letter language code from the ISO 639 |
| 247 | standard. When absent, the value "en" (i.e. English) is implied. [{}] |
| 248 | -->]> |
| 249 | <kanjidic2> |
| 250 | </kanjidic2> |