| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 1 | \section{Standard Module \sectcode{xmllib}} | 
|  | 2 | % Author: Sjoerd Mullender | 
|  | 3 | \label{module-xmllib} | 
|  | 4 | \stmodindex{xmllib} | 
|  | 5 | \index{XML} | 
|  | 6 |  | 
|  | 7 | This module defines a class \code{XMLParser} which serves as the basis | 
|  | 8 | for parsing text files formatted in XML (eXtended Markup Language). | 
|  | 9 |  | 
|  | 10 | The \code{XMLParser} class must be instantiated without arguments.  It | 
|  | 11 | has the following interface methods: | 
|  | 12 |  | 
| Fred Drake | 0add4c1 | 1997-12-12 05:32:31 +0000 | [diff] [blame] | 13 | \renewcommand{\indexsubitem}{(XMLParser method)} | 
| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 14 |  | 
|  | 15 | \begin{funcdesc}{reset}{} | 
|  | 16 | Reset the instance.  Loses all unprocessed data.  This is called | 
|  | 17 | implicitly at the instantiation time. | 
|  | 18 | \end{funcdesc} | 
|  | 19 |  | 
|  | 20 | \begin{funcdesc}{setnomoretags}{} | 
|  | 21 | Stop processing tags.  Treat all following input as literal input | 
|  | 22 | (CDATA). | 
|  | 23 | \end{funcdesc} | 
|  | 24 |  | 
|  | 25 | \begin{funcdesc}{setliteral}{} | 
|  | 26 | Enter literal mode (CDATA mode). | 
|  | 27 | \end{funcdesc} | 
|  | 28 |  | 
|  | 29 | \begin{funcdesc}{feed}{data} | 
|  | 30 | Feed some text to the parser.  It is processed insofar as it consists | 
|  | 31 | of complete elements; incomplete data is buffered until more data is | 
|  | 32 | fed or \code{close()} is called. | 
|  | 33 | \end{funcdesc} | 
|  | 34 |  | 
|  | 35 | \begin{funcdesc}{close}{} | 
|  | 36 | Force processing of all buffered data as if it were followed by an | 
|  | 37 | end-of-file mark.  This method may be redefined by a derived class to | 
|  | 38 | define additional processing at the end of the input, but the | 
|  | 39 | redefined version should always call \code{XMLParser.close()}. | 
|  | 40 | \end{funcdesc} | 
|  | 41 |  | 
| Guido van Rossum | 02505e4 | 1998-01-29 14:55:24 +0000 | [diff] [blame] | 42 | \begin{funcdesc}{translate_references}{data} | 
|  | 43 | Translate all entity and character references in \code{data} and | 
|  | 44 | returns the translated string. | 
|  | 45 | \end{funcdesc} | 
|  | 46 |  | 
|  | 47 | \begin{funcdesc}{handle_xml}{encoding\, standalone} | 
|  | 48 | This method is called when the \code{<?xml ...?>} tag is processed. | 
|  | 49 | The arguments are the values of the encoding and standalone attributes | 
|  | 50 | in the tag.  Both encoding and standalone are optional.  The values | 
|  | 51 | passed to \code{handle_xml} default to \code{None} and the string | 
|  | 52 | \code{'no'} respectively. | 
|  | 53 | \end{funcdesc} | 
|  | 54 |  | 
|  | 55 | \begin{funcdesc}{handle_doctype}{tag\, data} | 
|  | 56 | This method is called when the \code{<!DOCTYPE...>} tag is processed. | 
|  | 57 | The arguments are the name of the root element and the uninterpreted | 
|  | 58 | contents of the tag, starting after the white space after the name of | 
|  | 59 | the root element. | 
|  | 60 | \end{funcdesc} | 
|  | 61 |  | 
| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 62 | \begin{funcdesc}{handle_starttag}{tag\, method\, attributes} | 
|  | 63 | This method is called to handle start tags for which a | 
|  | 64 | \code{start_\var{tag}()} method has been defined.  The \code{tag} | 
|  | 65 | argument is the name of the tag, and the \code{method} argument is the | 
|  | 66 | bound method which should be used to support semantic interpretation | 
|  | 67 | of the start tag.  The \var{attributes} argument is a dictionary of | 
|  | 68 | attributes, the key being the \var{name} and the value being the | 
|  | 69 | \var{value} of the attribute found inside the tag's \code{<>} brackets. | 
| Guido van Rossum | 02505e4 | 1998-01-29 14:55:24 +0000 | [diff] [blame] | 70 | Character and entity references in the \var{value} have | 
| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 71 | been interpreted.  For instance, for the tag | 
|  | 72 | \code{<A HREF="http://www.cwi.nl/">}, this method would be called as | 
| Fred Drake | b0744c5 | 1997-12-29 19:59:38 +0000 | [diff] [blame] | 73 | \code{handle_starttag('A', self.start_A, \{'HREF': 'http://www.cwi.nl/'\})}. | 
| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 74 | The base implementation simply calls \code{method} with \code{attributes} | 
|  | 75 | as the only argument. | 
|  | 76 | \end{funcdesc} | 
|  | 77 |  | 
|  | 78 | \begin{funcdesc}{handle_endtag}{tag\, method} | 
|  | 79 | This method is called to handle endtags for which an | 
|  | 80 | \code{end_\var{tag}()} method has been defined.  The \code{tag} | 
|  | 81 | argument is the name of the tag, and the | 
|  | 82 | \code{method} argument is the bound method which should be used to | 
|  | 83 | support semantic interpretation of the end tag.  If no | 
|  | 84 | \code{end_\var{tag}()} method is defined for the closing element, this | 
|  | 85 | handler is not called.  The base implementation simply calls | 
|  | 86 | \code{method}. | 
|  | 87 | \end{funcdesc} | 
|  | 88 |  | 
|  | 89 | \begin{funcdesc}{handle_data}{data} | 
|  | 90 | This method is called to process arbitrary data.  It is intended to be | 
|  | 91 | overridden by a derived class; the base class implementation does | 
|  | 92 | nothing. | 
|  | 93 | \end{funcdesc} | 
|  | 94 |  | 
|  | 95 | \begin{funcdesc}{handle_charref}{ref} | 
|  | 96 | This method is called to process a character reference of the form | 
|  | 97 | ``\code{\&\#\var{ref};}''.  \var{ref} can either be a decimal number, | 
|  | 98 | or a hexadecimal number when preceded by \code{x}. | 
|  | 99 | In the base implementation, \var{ref} must be a number in the | 
|  | 100 | range 0-255.  It translates the character to \ASCII{} and calls the | 
|  | 101 | method \code{handle_data()} with the character as argument.  If | 
|  | 102 | \var{ref} is invalid or out of range, the method | 
|  | 103 | \code{unknown_charref(\var{ref})} is called to handle the error.  A | 
|  | 104 | subclass must override this method to provide support for character | 
|  | 105 | references outside of the \ASCII{} range. | 
|  | 106 | \end{funcdesc} | 
|  | 107 |  | 
|  | 108 | \begin{funcdesc}{handle_entityref}{ref} | 
|  | 109 | This method is called to process a general entity reference of the form | 
|  | 110 | ``\code{\&\var{ref};}'' where \var{ref} is an general entity | 
|  | 111 | reference.  It looks for \var{ref} in the instance (or class) | 
|  | 112 | variable \code{entitydefs} which should be a mapping from entity names | 
|  | 113 | to corresponding translations. | 
|  | 114 | If a translation is found, it calls the method \code{handle_data()} | 
|  | 115 | with the translation; otherwise, it calls the method | 
|  | 116 | \code{unknown_entityref(\var{ref})}.  The default \code{entitydefs} | 
|  | 117 | defines translations for \code{\&}, \code{\&apos}, \code{\>}, | 
|  | 118 | \code{\<}, and \code{\"}. | 
|  | 119 | \end{funcdesc} | 
|  | 120 |  | 
|  | 121 | \begin{funcdesc}{handle_comment}{comment} | 
|  | 122 | This method is called when a comment is encountered.  The | 
|  | 123 | \code{comment} argument is a string containing the text between the | 
|  | 124 | ``\code{<!--}'' and ``\code{-->}'' delimiters, but not the delimiters | 
|  | 125 | themselves.  For example, the comment ``\code{<!--text-->}'' will | 
|  | 126 | cause this method to be called with the argument \code{'text'}.  The | 
|  | 127 | default method does nothing. | 
|  | 128 | \end{funcdesc} | 
|  | 129 |  | 
|  | 130 | \begin{funcdesc}{handle_cdata}{data} | 
|  | 131 | This method is called when a CDATA element is encountered.  The | 
|  | 132 | \code{data} argument is a string containing the text between the | 
|  | 133 | ``\code{<![CDATA[}'' and ``\code{]]>}'' delimiters, but not the delimiters | 
|  | 134 | themselves.  For example, the entity ``\code{<![CDATA[text]]>}'' will | 
|  | 135 | cause this method to be called with the argument \code{'text'}.  The | 
|  | 136 | default method does nothing. | 
|  | 137 | \end{funcdesc} | 
|  | 138 |  | 
|  | 139 | \begin{funcdesc}{handle_proc}{name\, data} | 
|  | 140 | This method is called when a processing instruction (PI) is encountered.  The | 
|  | 141 | \code{name} is the PI target, and the \code{data} argument is a | 
|  | 142 | string containing the text between the PI target and the closing delimiter, | 
|  | 143 | but not the delimiter itself.  For example, the instruction | 
|  | 144 | ``\code{<?XML text?>}'' will cause this method to be called with the | 
|  | 145 | arguments \code{'XML'} and \code{'text'}.  The default method does | 
| Fred Drake | 8aad4c8 | 1998-02-03 23:12:13 +0000 | [diff] [blame] | 146 | nothing.  Note that if a document starts with a \code{<?xml ...?>} | 
| Guido van Rossum | 02505e4 | 1998-01-29 14:55:24 +0000 | [diff] [blame] | 147 | tag, \code{handle_xml} is called to handle it. | 
| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 148 | \end{funcdesc} | 
|  | 149 |  | 
|  | 150 | \begin{funcdesc}{handle_special}{data} | 
|  | 151 | This method is called when a declaration is encountered.  The | 
|  | 152 | \code{data} argument is a string containing the text between the | 
|  | 153 | ``\code{<!}'' and ``\code{>}'' delimiters, but not the delimiters | 
| Guido van Rossum | 02505e4 | 1998-01-29 14:55:24 +0000 | [diff] [blame] | 154 | themselves.  For example, the entity ``\code{<!ENTITY text>}'' will | 
|  | 155 | cause this method to be called with the argument \code{'ENTITY text'}.  The | 
|  | 156 | default method does nothing.  Note that \code{<!DOCTYPE ...>} is | 
|  | 157 | handled separately if it is located at the start of the document. | 
| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 158 | \end{funcdesc} | 
|  | 159 |  | 
| Guido van Rossum | 02505e4 | 1998-01-29 14:55:24 +0000 | [diff] [blame] | 160 | \begin{funcdesc}{syntax_error}{message} | 
| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 161 | This method is called when a syntax error is encountered.  The | 
| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 162 | \code{message} is a description of what was wrong.  The default method | 
|  | 163 | raises a \code{RuntimeError} exception.  If this method is overridden, | 
|  | 164 | it is permissable for it to return.  This method is only called when | 
| Guido van Rossum | 02505e4 | 1998-01-29 14:55:24 +0000 | [diff] [blame] | 165 | the error can be recovered from.  Unrecoverable errors raise a | 
|  | 166 | \code{RuntimeError} without first calling \code{syntax_error}. | 
| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 167 | \end{funcdesc} | 
|  | 168 |  | 
|  | 169 | \begin{funcdesc}{unknown_starttag}{tag\, attributes} | 
|  | 170 | This method is called to process an unknown start tag.  It is intended | 
|  | 171 | to be overridden by a derived class; the base class implementation | 
|  | 172 | does nothing. | 
|  | 173 | \end{funcdesc} | 
|  | 174 |  | 
|  | 175 | \begin{funcdesc}{unknown_endtag}{tag} | 
|  | 176 | This method is called to process an unknown end tag.  It is intended | 
|  | 177 | to be overridden by a derived class; the base class implementation | 
|  | 178 | does nothing. | 
|  | 179 | \end{funcdesc} | 
|  | 180 |  | 
|  | 181 | \begin{funcdesc}{unknown_charref}{ref} | 
|  | 182 | This method is called to process unresolvable numeric character | 
|  | 183 | references.  It is intended to be overridden by a derived class; the | 
|  | 184 | base class implementation does nothing. | 
|  | 185 | \end{funcdesc} | 
|  | 186 |  | 
|  | 187 | \begin{funcdesc}{unknown_entityref}{ref} | 
|  | 188 | This method is called to process an unknown entity reference.  It is | 
|  | 189 | intended to be overridden by a derived class; the base class | 
|  | 190 | implementation does nothing. | 
|  | 191 | \end{funcdesc} | 
|  | 192 |  | 
|  | 193 | Apart from overriding or extending the methods listed above, derived | 
| Guido van Rossum | 02505e4 | 1998-01-29 14:55:24 +0000 | [diff] [blame] | 194 | classes may also define methods and variables of the following form to | 
|  | 195 | define processing of specific tags.  Tag names in the input stream are | 
|  | 196 | case dependent; the \var{tag} occurring in method names must be in the | 
| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 197 | correct case: | 
|  | 198 |  | 
|  | 199 | \begin{funcdesc}{start_\var{tag}}{attributes} | 
|  | 200 | This method is called to process an opening tag \var{tag}.  The | 
|  | 201 | \var{attributes} argument has the same meaning as described for | 
| Guido van Rossum | 02505e4 | 1998-01-29 14:55:24 +0000 | [diff] [blame] | 202 | \code{handle_starttag()} above.  In fact, the base implementation of | 
|  | 203 | \code{handle_starttag} calls this method. | 
| Guido van Rossum | a10768a | 1997-11-18 15:11:22 +0000 | [diff] [blame] | 204 | \end{funcdesc} | 
|  | 205 |  | 
|  | 206 | \begin{funcdesc}{end_\var{tag}}{} | 
|  | 207 | This method is called to process a closing tag \var{tag}. | 
|  | 208 | \end{funcdesc} | 
| Guido van Rossum | 02505e4 | 1998-01-29 14:55:24 +0000 | [diff] [blame] | 209 |  | 
|  | 210 | \begin{datadesc}{\var{tag}_attributes} | 
|  | 211 | If a class or instance variable \code{\var{tag}_attributes} exists, it | 
|  | 212 | should be a list or a dictionary.  If a list, the elements of the list | 
|  | 213 | are the valid attributes for the element \var{tag}; if a dictionary, | 
|  | 214 | the keys are the valid attributes for the element \var{tag}, and the | 
|  | 215 | values the default values of the attributes, or \code{None} if there | 
|  | 216 | is no default. | 
|  | 217 | In addition to the attributes that were present in the tag, the | 
|  | 218 | attribute dictionary that is passed to \code{handle_starttag} and | 
|  | 219 | \code{unknown_starttag} contains values for all attributes that have a | 
|  | 220 | default value. | 
|  | 221 | \end{datadesc} |