Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 1 | \section{\module{xml.sax.xmlreader} --- |
| 2 | Interface for XML parsers} |
| 3 | |
| 4 | \declaremodule{standard}{xml.sax.xmlreader} |
| 5 | \modulesynopsis{Interface which SAX-compliant XML parsers must implement.} |
| 6 | \sectionauthor{Martin v. L\"owis}{loewis@informatik.hu-berlin.de} |
| 7 | \moduleauthor{Lars Marius Garshol}{larsga@garshol.priv.no} |
| 8 | |
| 9 | \versionadded{2.0} |
| 10 | |
| 11 | |
| 12 | SAX parsers implement the \class{XMLReader} interface. They are |
| 13 | implemented in a Python module, which must provide a function |
| 14 | \function{create_parser()}. This function is invoked by |
| 15 | \function{xml.sax.make_parser()} with no arguments to create a new |
| 16 | parser object. |
| 17 | |
| 18 | \begin{classdesc}{XMLReader}{} |
| 19 | Base class which can be inherited by SAX parsers. |
| 20 | \end{classdesc} |
| 21 | |
| 22 | \begin{classdesc}{IncrementalParser}{} |
| 23 | In some cases, it is desirable not to parse an input source at once, |
| 24 | but to feed chunks of the document as they get available. Note that |
| 25 | the reader will normally not read the entire file, but read it in |
| 26 | chunks as well; still \method{parse()} won't return until the entire |
| 27 | document is processed. So these interfaces should be used if the |
| 28 | blocking behaviour of \method{parse()} is not desirable. |
| 29 | |
| 30 | When the parser is instantiated it is ready to begin accepting data |
| 31 | from the feed method immediately. After parsing has been finished |
| 32 | with a call to close the reset method must be called to make the |
| 33 | parser ready to accept new data, either from feed or using the parse |
| 34 | method. |
| 35 | |
| 36 | Note that these methods must \emph{not} be called during parsing, |
| 37 | that is, after parse has been called and before it returns. |
| 38 | |
| 39 | By default, the class also implements the parse method of the |
| 40 | XMLReader interface using the feed, close and reset methods of the |
| 41 | IncrementalParser interface as a convenience to SAX 2.0 driver |
| 42 | writers. |
| 43 | \end{classdesc} |
| 44 | |
| 45 | \begin{classdesc}{Locator}{} |
| 46 | Interface for associating a SAX event with a document location. A |
| 47 | locator object will return valid results only during calls to |
| 48 | DocumentHandler methods; at any other time, the results are |
| 49 | unpredictable. If information is not available, methods may return |
| 50 | \code{None}. |
| 51 | \end{classdesc} |
| 52 | |
| 53 | \begin{classdesc}{InputSource}{\optional{systemId}} |
| 54 | Encapsulation of the information needed by the \class{XMLReader} to |
| 55 | read entities. |
| 56 | |
| 57 | This class may include information about the public identifier, |
| 58 | system identifier, byte stream (possibly with character encoding |
| 59 | information) and/or the character stream of an entity. |
| 60 | |
| 61 | Applications will create objects of this class for use in the |
| 62 | \method{XMLReader.parse()} method and for returning from |
| 63 | EntityResolver.resolveEntity. |
| 64 | |
| 65 | An \class{InputSource} belongs to the application, the |
| 66 | \class{XMLReader} is not allowed to modify \class{InputSource} objects |
| 67 | passed to it from the application, although it may make copies and |
| 68 | modify those. |
| 69 | \end{classdesc} |
| 70 | |
| 71 | \begin{classdesc}{AttributesImpl}{attrs} |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 72 | This is an implementation of the \ulink{\class{Attributes} |
| 73 | interface}{attributes-objects.html} (see |
| 74 | section~\ref{attributes-objects}). This is a dictionary-like |
| 75 | object which represents the element attributes in a |
| 76 | \method{startElement()} call. In addition to the most useful |
| 77 | dictionary operations, it supports a number of other methods as |
| 78 | described by the interface. Objects of this class should be |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 79 | instantiated by readers; \var{attrs} must be a dictionary-like |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 80 | object containing a mapping from attribute names to attribute |
| 81 | values. |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 82 | \end{classdesc} |
| 83 | |
| 84 | \begin{classdesc}{AttributesNSImpl}{attrs, qnames} |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 85 | Namespace-aware variant of \class{AttributesImpl}, which will be |
| 86 | passed to \method{startElementNS()}. It is derived from |
| 87 | \class{AttributesImpl}, but understands attribute names as |
| 88 | two-tuples of \var{namespaceURI} and \var{localname}. In addition, |
| 89 | it provides a number of methods expecting qualified names as they |
| 90 | appear in the original document. This class implements the |
| 91 | \ulink{\class{AttributesNS} interface}{attributes-ns-objects.html} |
| 92 | (see section~\ref{attributes-ns-objects}). |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 93 | \end{classdesc} |
| 94 | |
| 95 | |
| 96 | \subsection{XMLReader Objects \label{xmlreader-objects}} |
| 97 | |
| 98 | The \class{XMLReader} interface supports the following methods: |
| 99 | |
| 100 | \begin{methoddesc}[XMLReader]{parse}{source} |
| 101 | Process an input source, producing SAX events. The \var{source} |
Fred Drake | 907e76b | 2001-07-06 20:30:11 +0000 | [diff] [blame] | 102 | object can be a system identifier (a string identifying the |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 103 | input source -- typically a file name or an URL), a file-like |
| 104 | object, or an \class{InputSource} object. When \method{parse()} |
| 105 | returns, the input is completely processed, and the parser object |
| 106 | can be discarded or reset. As a limitation, the current implementation |
| 107 | only accepts byte streams; processing of character streams is for |
| 108 | further study. |
| 109 | \end{methoddesc} |
| 110 | |
| 111 | \begin{methoddesc}[XMLReader]{getContentHandler}{} |
| 112 | Return the current \class{ContentHandler}. |
| 113 | \end{methoddesc} |
| 114 | |
| 115 | \begin{methoddesc}[XMLReader]{setContentHandler}{handler} |
| 116 | Set the current \class{ContentHandler}. If no |
| 117 | \class{ContentHandler} is set, content events will be discarded. |
| 118 | \end{methoddesc} |
| 119 | |
| 120 | \begin{methoddesc}[XMLReader]{getDTDHandler}{} |
| 121 | Return the current \class{DTDHandler}. |
| 122 | \end{methoddesc} |
| 123 | |
| 124 | \begin{methoddesc}[XMLReader]{setDTDHandler}{handler} |
| 125 | Set the current \class{DTDHandler}. If no \class{DTDHandler} is |
| 126 | set, DTD events will be discarded. |
| 127 | \end{methoddesc} |
| 128 | |
| 129 | \begin{methoddesc}[XMLReader]{getEntityResolver}{} |
| 130 | Return the current \class{EntityResolver}. |
| 131 | \end{methoddesc} |
| 132 | |
| 133 | \begin{methoddesc}[XMLReader]{setEntityResolver}{handler} |
| 134 | Set the current \class{EntityResolver}. If no |
| 135 | \class{EntityResolver} is set, attempts to resolve an external |
| 136 | entity will result in opening the system identifier for the entity, |
| 137 | and fail if it is not available. |
| 138 | \end{methoddesc} |
| 139 | |
| 140 | \begin{methoddesc}[XMLReader]{getErrorHandler}{} |
| 141 | Return the current \class{ErrorHandler}. |
| 142 | \end{methoddesc} |
| 143 | |
| 144 | \begin{methoddesc}[XMLReader]{setErrorHandler}{handler} |
| 145 | Set the current error handler. If no \class{ErrorHandler} is set, |
| 146 | errors will be raised as exceptions, and warnings will be printed. |
| 147 | \end{methoddesc} |
| 148 | |
| 149 | \begin{methoddesc}[XMLReader]{setLocale}{locale} |
| 150 | Allow an application to set the locale for errors and warnings. |
| 151 | |
| 152 | SAX parsers are not required to provide localization for errors and |
| 153 | warnings; if they cannot support the requested locale, however, they |
| 154 | must throw a SAX exception. Applications may request a locale change |
| 155 | in the middle of a parse. |
| 156 | \end{methoddesc} |
| 157 | |
| 158 | \begin{methoddesc}[XMLReader]{getFeature}{featurename} |
| 159 | Return the current setting for feature \var{featurename}. If the |
| 160 | feature is not recognized, \exception{SAXNotRecognizedException} is |
| 161 | raised. The well-known featurenames are listed in the module |
| 162 | \module{xml.sax.handler}. |
| 163 | \end{methoddesc} |
| 164 | |
| 165 | \begin{methoddesc}[XMLReader]{setFeature}{featurename, value} |
| 166 | Set the \var{featurename} to \var{value}. If the feature is not |
| 167 | recognized, \exception{SAXNotRecognizedException} is raised. If the |
| 168 | feature or its setting is not supported by the parser, |
| 169 | \var{SAXNotSupportedException} is raised. |
| 170 | \end{methoddesc} |
| 171 | |
| 172 | \begin{methoddesc}[XMLReader]{getProperty}{propertyname} |
| 173 | Return the current setting for property \var{propertyname}. If the |
| 174 | property is not recognized, a \exception{SAXNotRecognizedException} |
| 175 | is raised. The well-known propertynames are listed in the module |
| 176 | \module{xml.sax.handler}. |
| 177 | \end{methoddesc} |
| 178 | |
| 179 | \begin{methoddesc}[XMLReader]{setProperty}{propertyname, value} |
| 180 | Set the \var{propertyname} to \var{value}. If the property is not |
| 181 | recognized, \exception{SAXNotRecognizedException} is raised. If the |
| 182 | property or its setting is not supported by the parser, |
| 183 | \var{SAXNotSupportedException} is raised. |
| 184 | \end{methoddesc} |
| 185 | |
| 186 | |
| 187 | \subsection{IncrementalParser Objects |
| 188 | \label{incremental-parser-objects}} |
| 189 | |
| 190 | Instances of \class{IncrementalParser} offer the following additional |
| 191 | methods: |
| 192 | |
| 193 | \begin{methoddesc}[IncrementalParser]{feed}{data} |
| 194 | Process a chunk of \var{data}. |
| 195 | \end{methoddesc} |
| 196 | |
| 197 | \begin{methoddesc}[IncrementalParser]{close}{} |
| 198 | Assume the end of the document. That will check well-formedness |
| 199 | conditions that can be checked only at the end, invoke handlers, and |
| 200 | may clean up resources allocated during parsing. |
| 201 | \end{methoddesc} |
| 202 | |
| 203 | \begin{methoddesc}[IncrementalParser]{reset}{} |
| 204 | This method is called after close has been called to reset the |
| 205 | parser so that it is ready to parse new documents. The results of |
| 206 | calling parse or feed after close without calling reset are |
Fred Drake | 5976816 | 2001-11-06 22:11:34 +0000 | [diff] [blame] | 207 | undefined. |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 208 | \end{methoddesc} |
| 209 | |
| 210 | |
| 211 | \subsection{Locator Objects \label{locator-objects}} |
| 212 | |
| 213 | Instances of \class{Locator} provide these methods: |
| 214 | |
| 215 | \begin{methoddesc}[Locator]{getColumnNumber}{} |
| 216 | Return the column number where the current event ends. |
| 217 | \end{methoddesc} |
| 218 | |
| 219 | \begin{methoddesc}[Locator]{getLineNumber}{} |
| 220 | Return the line number where the current event ends. |
| 221 | \end{methoddesc} |
| 222 | |
| 223 | \begin{methoddesc}[Locator]{getPublicId}{} |
| 224 | Return the public identifier for the current event. |
| 225 | \end{methoddesc} |
| 226 | |
| 227 | \begin{methoddesc}[Locator]{getSystemId}{} |
| 228 | Return the system identifier for the current event. |
| 229 | \end{methoddesc} |
| 230 | |
| 231 | |
| 232 | \subsection{InputSource Objects \label{input-source-objects}} |
| 233 | |
| 234 | \begin{methoddesc}[InputSource]{setPublicId}{id} |
| 235 | Sets the public identifier of this \class{InputSource}. |
| 236 | \end{methoddesc} |
| 237 | |
| 238 | \begin{methoddesc}[InputSource]{getPublicId}{} |
| 239 | Returns the public identifier of this \class{InputSource}. |
| 240 | \end{methoddesc} |
| 241 | |
| 242 | \begin{methoddesc}[InputSource]{setSystemId}{id} |
| 243 | Sets the system identifier of this \class{InputSource}. |
| 244 | \end{methoddesc} |
| 245 | |
| 246 | \begin{methoddesc}[InputSource]{getSystemId}{} |
| 247 | Returns the system identifier of this \class{InputSource}. |
| 248 | \end{methoddesc} |
| 249 | |
| 250 | \begin{methoddesc}[InputSource]{setEncoding}{encoding} |
| 251 | Sets the character encoding of this \class{InputSource}. |
| 252 | |
| 253 | The encoding must be a string acceptable for an XML encoding |
| 254 | declaration (see section 4.3.3 of the XML recommendation). |
| 255 | |
| 256 | The encoding attribute of the \class{InputSource} is ignored if the |
| 257 | \class{InputSource} also contains a character stream. |
| 258 | \end{methoddesc} |
| 259 | |
| 260 | \begin{methoddesc}[InputSource]{getEncoding}{} |
| 261 | Get the character encoding of this InputSource. |
| 262 | \end{methoddesc} |
| 263 | |
| 264 | \begin{methoddesc}[InputSource]{setByteStream}{bytefile} |
| 265 | Set the byte stream (a Python file-like object which does not |
| 266 | perform byte-to-character conversion) for this input source. |
| 267 | |
| 268 | The SAX parser will ignore this if there is also a character stream |
| 269 | specified, but it will use a byte stream in preference to opening a |
| 270 | URI connection itself. |
| 271 | |
| 272 | If the application knows the character encoding of the byte stream, |
| 273 | it should set it with the setEncoding method. |
| 274 | \end{methoddesc} |
| 275 | |
| 276 | \begin{methoddesc}[InputSource]{getByteStream}{} |
| 277 | Get the byte stream for this input source. |
| 278 | |
| 279 | The getEncoding method will return the character encoding for this |
| 280 | byte stream, or None if unknown. |
| 281 | \end{methoddesc} |
| 282 | |
| 283 | \begin{methoddesc}[InputSource]{setCharacterStream}{charfile} |
| 284 | Set the character stream for this input source. (The stream must be |
| 285 | a Python 1.6 Unicode-wrapped file-like that performs conversion to |
| 286 | Unicode strings.) |
| 287 | |
| 288 | If there is a character stream specified, the SAX parser will ignore |
| 289 | any byte stream and will not attempt to open a URI connection to the |
| 290 | system identifier. |
| 291 | \end{methoddesc} |
| 292 | |
| 293 | \begin{methoddesc}[InputSource]{getCharacterStream}{} |
| 294 | Get the character stream for this input source. |
| 295 | \end{methoddesc} |
| 296 | |
| 297 | |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 298 | \subsection{The \class{Attributes} Interface \label{attributes-objects}} |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 299 | |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 300 | \class{Attributes} objects implement a portion of the mapping |
| 301 | protocol, including the methods \method{copy()}, \method{get()}, |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 302 | \method{has_key()}, \method{items()}, \method{keys()}, and |
| 303 | \method{values()}. The following methods are also provided: |
| 304 | |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 305 | \begin{methoddesc}[Attributes]{getLength}{} |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 306 | Return the number of attributes. |
| 307 | \end{methoddesc} |
| 308 | |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 309 | \begin{methoddesc}[Attributes]{getNames}{} |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 310 | Return the names of the attributes. |
| 311 | \end{methoddesc} |
| 312 | |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 313 | \begin{methoddesc}[Attributes]{getType}{name} |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 314 | Returns the type of the attribute \var{name}, which is normally |
| 315 | \code{'CDATA'}. |
| 316 | \end{methoddesc} |
| 317 | |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 318 | \begin{methoddesc}[Attributes]{getValue}{name} |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 319 | Return the value of attribute \var{name}. |
| 320 | \end{methoddesc} |
| 321 | |
| 322 | % getValueByQName, getNameByQName, getQNameByName, getQNames available |
| 323 | % here already, but documented only for derived class. |
| 324 | |
| 325 | |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 326 | \subsection{The \class{AttributesNS} Interface \label{attributes-ns-objects}} |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 327 | |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 328 | This interface is a subtype of the \ulink{\class{Attributes} |
| 329 | interface}{attributes-objects.html} (see |
| 330 | section~\ref{attributes-objects}). All methods supported by that |
| 331 | interface are also available on \class{AttributesNS} objects. |
| 332 | |
| 333 | The following methods are also available: |
| 334 | |
| 335 | \begin{methoddesc}[AttributesNS]{getValueByQName}{name} |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 336 | Return the value for a qualified name. |
| 337 | \end{methoddesc} |
| 338 | |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 339 | \begin{methoddesc}[AttributesNS]{getNameByQName}{name} |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 340 | Return the \code{(\var{namespace}, \var{localname})} pair for a |
| 341 | qualified \var{name}. |
| 342 | \end{methoddesc} |
| 343 | |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 344 | \begin{methoddesc}[AttributesNS]{getQNameByName}{name} |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 345 | Return the qualified name for a \code{(\var{namespace}, |
| 346 | \var{localname})} pair. |
| 347 | \end{methoddesc} |
| 348 | |
Fred Drake | c5e2792 | 2002-06-25 17:10:50 +0000 | [diff] [blame^] | 349 | \begin{methoddesc}[AttributesNS]{getQNames}{} |
Fred Drake | 014f0e3 | 2000-10-12 20:05:09 +0000 | [diff] [blame] | 350 | Return the qualified names of all attributes. |
| 351 | \end{methoddesc} |