Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 1 | \section{\module{xml.dom.minidom} --- |
| 2 | Lightweight DOM implementation} |
| 3 | |
| 4 | \declaremodule{standard}{xml.dom.minidom} |
| 5 | \modulesynopsis{Lightweight Document Object Model (DOM) implementation.} |
| 6 | \moduleauthor{Paul Prescod}{paul@prescod.net} |
| 7 | \sectionauthor{Paul Prescod}{paul@prescod.net} |
Martin v. Löwis | 338bcbc | 2003-04-18 22:04:34 +0000 | [diff] [blame] | 8 | \sectionauthor{Martin v. L\"owis}{martin@v.loewis.de} |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 9 | |
| 10 | \versionadded{2.0} |
| 11 | |
| 12 | \module{xml.dom.minidom} is a light-weight implementation of the |
| 13 | Document Object Model interface. It is intended to be |
| 14 | simpler than the full DOM and also significantly smaller. |
| 15 | |
| 16 | DOM applications typically start by parsing some XML into a DOM. With |
| 17 | \module{xml.dom.minidom}, this is done through the parse functions: |
| 18 | |
| 19 | \begin{verbatim} |
| 20 | from xml.dom.minidom import parse, parseString |
| 21 | |
| 22 | dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name |
| 23 | |
| 24 | datasource = open('c:\\temp\\mydata.xml') |
| 25 | dom2 = parse(datasource) # parse an open file |
| 26 | |
| 27 | dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>') |
| 28 | \end{verbatim} |
| 29 | |
Fred Drake | 50276ab | 2002-10-24 19:36:04 +0000 | [diff] [blame] | 30 | The \function{parse()} function can take either a filename or an open |
| 31 | file object. |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 32 | |
| 33 | \begin{funcdesc}{parse}{filename_or_file{, parser}} |
| 34 | Return a \class{Document} from the given input. \var{filename_or_file} |
| 35 | may be either a file name, or a file-like object. \var{parser}, if |
| 36 | given, must be a SAX2 parser object. This function will change the |
| 37 | document handler of the parser and activate namespace support; other |
| 38 | parser configuration (like setting an entity resolver) must have been |
| 39 | done in advance. |
| 40 | \end{funcdesc} |
| 41 | |
| 42 | If you have XML in a string, you can use the |
| 43 | \function{parseString()} function instead: |
| 44 | |
| 45 | \begin{funcdesc}{parseString}{string\optional{, parser}} |
| 46 | Return a \class{Document} that represents the \var{string}. This |
| 47 | method creates a \class{StringIO} object for the string and passes |
| 48 | that on to \function{parse}. |
| 49 | \end{funcdesc} |
| 50 | |
| 51 | Both functions return a \class{Document} object representing the |
| 52 | content of the document. |
| 53 | |
Fred Drake | 50276ab | 2002-10-24 19:36:04 +0000 | [diff] [blame] | 54 | What the \function{parse()} and \function{parseString()} functions do |
| 55 | is connect an XML parser with a ``DOM builder'' that can accept parse |
| 56 | events from any SAX parser and convert them into a DOM tree. The name |
| 57 | of the functions are perhaps misleading, but are easy to grasp when |
| 58 | learning the interfaces. The parsing of the document will be |
| 59 | completed before these functions return; it's simply that these |
| 60 | functions do not provide a parser implementation themselves. |
| 61 | |
| 62 | You can also create a \class{Document} by calling a method on a ``DOM |
| 63 | Implementation'' object. You can get this object either by calling |
| 64 | the \function{getDOMImplementation()} function in the |
| 65 | \refmodule{xml.dom} package or the \module{xml.dom.minidom} module. |
| 66 | Using the implementation from the \module{xml.dom.minidom} module will |
| 67 | always return a \class{Document} instance from the minidom |
| 68 | implementation, while the version from \refmodule{xml.dom} may provide |
| 69 | an alternate implementation (this is likely if you have the |
| 70 | \ulink{PyXML package}{http://pyxml.sourceforge.net/} installed). Once |
| 71 | you have a \class{Document}, you can add child nodes to it to populate |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 72 | the DOM: |
| 73 | |
| 74 | \begin{verbatim} |
Fred Drake | 50276ab | 2002-10-24 19:36:04 +0000 | [diff] [blame] | 75 | from xml.dom.minidom import getDOMImplementation |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 76 | |
Fred Drake | 50276ab | 2002-10-24 19:36:04 +0000 | [diff] [blame] | 77 | impl = getDOMImplementation() |
| 78 | |
| 79 | newdoc = impl.createDocument(None, "some_tag", None) |
| 80 | top_element = newdoc.documentElement |
| 81 | text = newdoc.createTextNode('Some textual content.') |
| 82 | top_element.appendChild(text) |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 83 | \end{verbatim} |
| 84 | |
| 85 | Once you have a DOM document object, you can access the parts of your |
| 86 | XML document through its properties and methods. These properties are |
| 87 | defined in the DOM specification. The main property of the document |
| 88 | object is the \member{documentElement} property. It gives you the |
| 89 | main element in the XML document: the one that holds all others. Here |
| 90 | is an example program: |
| 91 | |
| 92 | \begin{verbatim} |
| 93 | dom3 = parseString("<myxml>Some data</myxml>") |
| 94 | assert dom3.documentElement.tagName == "myxml" |
| 95 | \end{verbatim} |
| 96 | |
| 97 | When you are finished with a DOM, you should clean it up. This is |
| 98 | necessary because some versions of Python do not support garbage |
| 99 | collection of objects that refer to each other in a cycle. Until this |
| 100 | restriction is removed from all versions of Python, it is safest to |
| 101 | write your code as if cycles would not be cleaned up. |
| 102 | |
| 103 | The way to clean up a DOM is to call its \method{unlink()} method: |
| 104 | |
| 105 | \begin{verbatim} |
| 106 | dom1.unlink() |
| 107 | dom2.unlink() |
| 108 | dom3.unlink() |
| 109 | \end{verbatim} |
| 110 | |
| 111 | \method{unlink()} is a \module{xml.dom.minidom}-specific extension to |
| 112 | the DOM API. After calling \method{unlink()} on a node, the node and |
| 113 | its descendents are essentially useless. |
| 114 | |
| 115 | \begin{seealso} |
| 116 | \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{Document Object |
| 117 | Model (DOM) Level 1 Specification} |
| 118 | {The W3C recommendation for the |
| 119 | DOM supported by \module{xml.dom.minidom}.} |
| 120 | \end{seealso} |
| 121 | |
| 122 | |
Fred Drake | 50276ab | 2002-10-24 19:36:04 +0000 | [diff] [blame] | 123 | \subsection{DOM Objects \label{dom-objects}} |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 124 | |
| 125 | The definition of the DOM API for Python is given as part of the |
| 126 | \refmodule{xml.dom} module documentation. This section lists the |
| 127 | differences between the API and \refmodule{xml.dom.minidom}. |
| 128 | |
| 129 | |
| 130 | \begin{methoddesc}{unlink}{} |
| 131 | Break internal references within the DOM so that it will be garbage |
| 132 | collected on versions of Python without cyclic GC. Even when cyclic |
| 133 | GC is available, using this can make large amounts of memory available |
| 134 | sooner, so calling this on DOM objects as soon as they are no longer |
| 135 | needed is good practice. This only needs to be called on the |
| 136 | \class{Document} object, but may be called on child nodes to discard |
| 137 | children of that node. |
| 138 | \end{methoddesc} |
| 139 | |
| 140 | \begin{methoddesc}{writexml}{writer} |
| 141 | Write XML to the writer object. The writer should have a |
| 142 | \method{write()} method which matches that of the file object |
| 143 | interface. |
Martin v. Löwis | 7d650ca | 2002-06-30 15:05:00 +0000 | [diff] [blame] | 144 | |
| 145 | \versionadded[To support pretty output, new keyword parameters indent, |
| 146 | addindent, and newl have been added]{2.1} |
| 147 | |
| 148 | \versionadded[For the \class{Document} node, an additional keyword |
| 149 | argument encoding can be used to specify the encoding field of the XML |
| 150 | header]{2.3} |
| 151 | |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 152 | \end{methoddesc} |
| 153 | |
Martin v. Löwis | 7d650ca | 2002-06-30 15:05:00 +0000 | [diff] [blame] | 154 | \begin{methoddesc}{toxml}{\optional{encoding}} |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 155 | Return the XML that the DOM represents as a string. |
Martin v. Löwis | 7d650ca | 2002-06-30 15:05:00 +0000 | [diff] [blame] | 156 | |
| 157 | \versionadded[the \var{encoding} argument]{2.3} |
| 158 | |
| 159 | With no argument, the XML header does not specify an encoding, and the |
| 160 | result is Unicode string if the default encoding cannot represent all |
| 161 | characters in the document. Encoding this string in an encoding other |
| 162 | than UTF-8 is likely incorrect, since UTF-8 is the default encoding of |
| 163 | XML. |
| 164 | |
| 165 | With an explicit \var{encoding} argument, the result is a byte string |
| 166 | in the specified encoding. It is recommended that this argument is |
| 167 | always specified. To avoid UnicodeError exceptions in case of |
| 168 | unrepresentable text data, the encoding argument should be specified |
| 169 | as "utf-8". |
| 170 | |
| 171 | \end{methoddesc} |
| 172 | |
| 173 | \begin{methoddesc}{toprettyxml}{\optional{indent\optional{, newl}}} |
| 174 | |
| 175 | Return a pretty-printed version of the document. \var{indent} specifies |
| 176 | the indentation string and defaults to a tabulator; \var{newl} specifies |
| 177 | the string emitted at the end of each line and defaults to \\n. |
| 178 | |
| 179 | \versionadded{2.1} |
| 180 | |
| 181 | \versionadded[the encoding argument; see \method{toxml}]{2.3} |
| 182 | |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 183 | \end{methoddesc} |
| 184 | |
| 185 | The following standard DOM methods have special considerations with |
| 186 | \refmodule{xml.dom.minidom}: |
| 187 | |
| 188 | \begin{methoddesc}{cloneNode}{deep} |
| 189 | Although this method was present in the version of |
| 190 | \refmodule{xml.dom.minidom} packaged with Python 2.0, it was seriously |
| 191 | broken. This has been corrected for subsequent releases. |
| 192 | \end{methoddesc} |
| 193 | |
| 194 | |
| 195 | \subsection{DOM Example \label{dom-example}} |
| 196 | |
| 197 | This example program is a fairly realistic example of a simple |
| 198 | program. In this particular case, we do not take much advantage |
| 199 | of the flexibility of the DOM. |
| 200 | |
Fred Drake | b866770 | 2001-09-02 06:07:36 +0000 | [diff] [blame] | 201 | \verbatiminput{minidom-example.py} |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 202 | |
| 203 | |
| 204 | \subsection{minidom and the DOM standard \label{minidom-and-dom}} |
| 205 | |
Fred Drake | 0f564ea | 2001-01-22 19:06:20 +0000 | [diff] [blame] | 206 | The \refmodule{xml.dom.minidom} module is essentially a DOM |
| 207 | 1.0-compatible DOM with some DOM 2 features (primarily namespace |
| 208 | features). |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 209 | |
| 210 | Usage of the DOM interface in Python is straight-forward. The |
| 211 | following mapping rules apply: |
| 212 | |
| 213 | \begin{itemize} |
| 214 | \item Interfaces are accessed through instance objects. Applications |
| 215 | should not instantiate the classes themselves; they should use |
| 216 | the creator functions available on the \class{Document} object. |
| 217 | Derived interfaces support all operations (and attributes) from |
| 218 | the base interfaces, plus any new operations. |
| 219 | |
| 220 | \item Operations are used as methods. Since the DOM uses only |
| 221 | \keyword{in} parameters, the arguments are passed in normal |
| 222 | order (from left to right). There are no optional |
| 223 | arguments. \keyword{void} operations return \code{None}. |
| 224 | |
| 225 | \item IDL attributes map to instance attributes. For compatibility |
| 226 | with the OMG IDL language mapping for Python, an attribute |
| 227 | \code{foo} can also be accessed through accessor methods |
| 228 | \method{_get_foo()} and \method{_set_foo()}. \keyword{readonly} |
| 229 | attributes must not be changed; this is not enforced at |
| 230 | runtime. |
| 231 | |
| 232 | \item The types \code{short int}, \code{unsigned int}, \code{unsigned |
| 233 | long long}, and \code{boolean} all map to Python integer |
| 234 | objects. |
| 235 | |
| 236 | \item The type \code{DOMString} maps to Python strings. |
| 237 | \refmodule{xml.dom.minidom} supports either byte or Unicode |
Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 238 | strings, but will normally produce Unicode strings. Values |
| 239 | of type \code{DOMString} may also be \code{None} where allowed |
| 240 | to have the IDL \code{null} value by the DOM specification from |
| 241 | the W3C. |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 242 | |
| 243 | \item \keyword{const} declarations map to variables in their |
| 244 | respective scope |
| 245 | (e.g. \code{xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE}); |
| 246 | they must not be changed. |
| 247 | |
| 248 | \item \code{DOMException} is currently not supported in |
| 249 | \refmodule{xml.dom.minidom}. Instead, |
| 250 | \refmodule{xml.dom.minidom} uses standard Python exceptions such |
| 251 | as \exception{TypeError} and \exception{AttributeError}. |
| 252 | |
Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 253 | \item \class{NodeList} objects are implemented using Python's built-in |
| 254 | list type. Starting with Python 2.2, these objects provide the |
| 255 | interface defined in the DOM specification, but with earlier |
| 256 | versions of Python they do not support the official API. They |
| 257 | are, however, much more ``Pythonic'' than the interface defined |
| 258 | in the W3C recommendations. |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 259 | \end{itemize} |
| 260 | |
| 261 | |
| 262 | The following interfaces have no implementation in |
| 263 | \refmodule{xml.dom.minidom}: |
| 264 | |
| 265 | \begin{itemize} |
Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 266 | \item \class{DOMTimeStamp} |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 267 | |
Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 268 | \item \class{DocumentType} (added in Python 2.1) |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 269 | |
Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 270 | \item \class{DOMImplementation} (added in Python 2.1) |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 271 | |
Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 272 | \item \class{CharacterData} |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 273 | |
Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 274 | \item \class{CDATASection} |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 275 | |
Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 276 | \item \class{Notation} |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 277 | |
Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 278 | \item \class{Entity} |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 279 | |
Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 280 | \item \class{EntityReference} |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 281 | |
Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 282 | \item \class{DocumentFragment} |
Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 283 | \end{itemize} |
| 284 | |
| 285 | Most of these reflect information in the XML document that is not of |
| 286 | general utility to most DOM users. |