| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 1 | \section{\module{xml.dom.minidom} --- | 
|  | 2 | Lightweight DOM implementation} | 
|  | 3 |  | 
|  | 4 | \declaremodule{standard}{xml.dom.minidom} | 
|  | 5 | \modulesynopsis{Lightweight Document Object Model (DOM) implementation.} | 
|  | 6 | \moduleauthor{Paul Prescod}{paul@prescod.net} | 
|  | 7 | \sectionauthor{Paul Prescod}{paul@prescod.net} | 
| Martin v. Löwis | 338bcbc | 2003-04-18 22:04:34 +0000 | [diff] [blame] | 8 | \sectionauthor{Martin v. L\"owis}{martin@v.loewis.de} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 9 |  | 
|  | 10 | \versionadded{2.0} | 
|  | 11 |  | 
|  | 12 | \module{xml.dom.minidom} is a light-weight implementation of the | 
|  | 13 | Document Object Model interface.  It is intended to be | 
|  | 14 | simpler than the full DOM and also significantly smaller. | 
|  | 15 |  | 
|  | 16 | DOM applications typically start by parsing some XML into a DOM.  With | 
|  | 17 | \module{xml.dom.minidom}, this is done through the parse functions: | 
|  | 18 |  | 
|  | 19 | \begin{verbatim} | 
|  | 20 | from xml.dom.minidom import parse, parseString | 
|  | 21 |  | 
|  | 22 | dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name | 
|  | 23 |  | 
|  | 24 | datasource = open('c:\\temp\\mydata.xml') | 
|  | 25 | dom2 = parse(datasource)   # parse an open file | 
|  | 26 |  | 
|  | 27 | dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>') | 
|  | 28 | \end{verbatim} | 
|  | 29 |  | 
| Fred Drake | 50276ab | 2002-10-24 19:36:04 +0000 | [diff] [blame] | 30 | The \function{parse()} function can take either a filename or an open | 
|  | 31 | file object. | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 32 |  | 
|  | 33 | \begin{funcdesc}{parse}{filename_or_file{, parser}} | 
|  | 34 | Return a \class{Document} from the given input. \var{filename_or_file} | 
|  | 35 | may be either a file name, or a file-like object. \var{parser}, if | 
|  | 36 | given, must be a SAX2 parser object. This function will change the | 
|  | 37 | document handler of the parser and activate namespace support; other | 
|  | 38 | parser configuration (like setting an entity resolver) must have been | 
|  | 39 | done in advance. | 
|  | 40 | \end{funcdesc} | 
|  | 41 |  | 
|  | 42 | If you have XML in a string, you can use the | 
|  | 43 | \function{parseString()} function instead: | 
|  | 44 |  | 
|  | 45 | \begin{funcdesc}{parseString}{string\optional{, parser}} | 
|  | 46 | Return a \class{Document} that represents the \var{string}. This | 
|  | 47 | method creates a \class{StringIO} object for the string and passes | 
|  | 48 | that on to \function{parse}. | 
|  | 49 | \end{funcdesc} | 
|  | 50 |  | 
|  | 51 | Both functions return a \class{Document} object representing the | 
|  | 52 | content of the document. | 
|  | 53 |  | 
| Fred Drake | 50276ab | 2002-10-24 19:36:04 +0000 | [diff] [blame] | 54 | What the \function{parse()} and \function{parseString()} functions do | 
|  | 55 | is connect an XML parser with a ``DOM builder'' that can accept parse | 
|  | 56 | events from any SAX parser and convert them into a DOM tree.  The name | 
|  | 57 | of the functions are perhaps misleading, but are easy to grasp when | 
|  | 58 | learning the interfaces.  The parsing of the document will be | 
|  | 59 | completed before these functions return; it's simply that these | 
|  | 60 | functions do not provide a parser implementation themselves. | 
|  | 61 |  | 
|  | 62 | You can also create a \class{Document} by calling a method on a ``DOM | 
|  | 63 | Implementation'' object.  You can get this object either by calling | 
|  | 64 | the \function{getDOMImplementation()} function in the | 
|  | 65 | \refmodule{xml.dom} package or the \module{xml.dom.minidom} module. | 
|  | 66 | Using the implementation from the \module{xml.dom.minidom} module will | 
|  | 67 | always return a \class{Document} instance from the minidom | 
|  | 68 | implementation, while the version from \refmodule{xml.dom} may provide | 
|  | 69 | an alternate implementation (this is likely if you have the | 
|  | 70 | \ulink{PyXML package}{http://pyxml.sourceforge.net/} installed).  Once | 
|  | 71 | you have a \class{Document}, you can add child nodes to it to populate | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 72 | the DOM: | 
|  | 73 |  | 
|  | 74 | \begin{verbatim} | 
| Fred Drake | 50276ab | 2002-10-24 19:36:04 +0000 | [diff] [blame] | 75 | from xml.dom.minidom import getDOMImplementation | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 76 |  | 
| Fred Drake | 50276ab | 2002-10-24 19:36:04 +0000 | [diff] [blame] | 77 | impl = getDOMImplementation() | 
|  | 78 |  | 
|  | 79 | newdoc = impl.createDocument(None, "some_tag", None) | 
|  | 80 | top_element = newdoc.documentElement | 
|  | 81 | text = newdoc.createTextNode('Some textual content.') | 
|  | 82 | top_element.appendChild(text) | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 83 | \end{verbatim} | 
|  | 84 |  | 
|  | 85 | Once you have a DOM document object, you can access the parts of your | 
|  | 86 | XML document through its properties and methods.  These properties are | 
|  | 87 | defined in the DOM specification.  The main property of the document | 
|  | 88 | object is the \member{documentElement} property.  It gives you the | 
|  | 89 | main element in the XML document: the one that holds all others.  Here | 
|  | 90 | is an example program: | 
|  | 91 |  | 
|  | 92 | \begin{verbatim} | 
|  | 93 | dom3 = parseString("<myxml>Some data</myxml>") | 
|  | 94 | assert dom3.documentElement.tagName == "myxml" | 
|  | 95 | \end{verbatim} | 
|  | 96 |  | 
|  | 97 | When you are finished with a DOM, you should clean it up.  This is | 
|  | 98 | necessary because some versions of Python do not support garbage | 
|  | 99 | collection of objects that refer to each other in a cycle.  Until this | 
|  | 100 | restriction is removed from all versions of Python, it is safest to | 
|  | 101 | write your code as if cycles would not be cleaned up. | 
|  | 102 |  | 
|  | 103 | The way to clean up a DOM is to call its \method{unlink()} method: | 
|  | 104 |  | 
|  | 105 | \begin{verbatim} | 
|  | 106 | dom1.unlink() | 
|  | 107 | dom2.unlink() | 
|  | 108 | dom3.unlink() | 
|  | 109 | \end{verbatim} | 
|  | 110 |  | 
|  | 111 | \method{unlink()} is a \module{xml.dom.minidom}-specific extension to | 
|  | 112 | the DOM API.  After calling \method{unlink()} on a node, the node and | 
| Raymond Hettinger | 6880431 | 2005-01-01 00:28:46 +0000 | [diff] [blame] | 113 | its descendants are essentially useless. | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 114 |  | 
|  | 115 | \begin{seealso} | 
|  | 116 | \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{Document Object | 
|  | 117 | Model (DOM) Level 1 Specification} | 
|  | 118 | {The W3C recommendation for the | 
|  | 119 | DOM supported by \module{xml.dom.minidom}.} | 
|  | 120 | \end{seealso} | 
|  | 121 |  | 
|  | 122 |  | 
| Fred Drake | 50276ab | 2002-10-24 19:36:04 +0000 | [diff] [blame] | 123 | \subsection{DOM Objects \label{dom-objects}} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 124 |  | 
|  | 125 | The definition of the DOM API for Python is given as part of the | 
|  | 126 | \refmodule{xml.dom} module documentation.  This section lists the | 
|  | 127 | differences between the API and \refmodule{xml.dom.minidom}. | 
|  | 128 |  | 
|  | 129 |  | 
| Fred Drake | 267b062 | 2004-03-25 16:39:46 +0000 | [diff] [blame] | 130 | \begin{methoddesc}[Node]{unlink}{} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 131 | Break internal references within the DOM so that it will be garbage | 
|  | 132 | collected on versions of Python without cyclic GC.  Even when cyclic | 
|  | 133 | GC is available, using this can make large amounts of memory available | 
|  | 134 | sooner, so calling this on DOM objects as soon as they are no longer | 
|  | 135 | needed is good practice.  This only needs to be called on the | 
|  | 136 | \class{Document} object, but may be called on child nodes to discard | 
|  | 137 | children of that node. | 
|  | 138 | \end{methoddesc} | 
|  | 139 |  | 
| Skip Montanaro | 5497fee | 2004-09-28 18:40:42 +0000 | [diff] [blame] | 140 | \begin{methoddesc}[Node]{writexml}{writer\optional{,indent=""\optional{,addindent=""\optional{,newl=""}}}} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 141 | Write XML to the writer object.  The writer should have a | 
|  | 142 | \method{write()} method which matches that of the file object | 
| Skip Montanaro | 5497fee | 2004-09-28 18:40:42 +0000 | [diff] [blame] | 143 | interface.  The \var{indent} parameter is the indentation of the current | 
|  | 144 | node.  The \var{addindent} parameter is the incremental indentation to use | 
|  | 145 | for subnodes of the current one.  The \var{newl} parameter specifies the | 
|  | 146 | string to use to terminate newlines. | 
| Martin v. Löwis | 7d650ca | 2002-06-30 15:05:00 +0000 | [diff] [blame] | 147 |  | 
| Skip Montanaro | 5497fee | 2004-09-28 18:40:42 +0000 | [diff] [blame] | 148 | \versionchanged[The optional keyword parameters | 
|  | 149 | \var{indent}, \var{addindent}, and \var{newl} were added to support pretty | 
|  | 150 | output]{2.1} | 
| Martin v. Löwis | 7d650ca | 2002-06-30 15:05:00 +0000 | [diff] [blame] | 151 |  | 
| Fred Drake | 267b062 | 2004-03-25 16:39:46 +0000 | [diff] [blame] | 152 | \versionchanged[For the \class{Document} node, an additional keyword | 
| Skip Montanaro | 5497fee | 2004-09-28 18:40:42 +0000 | [diff] [blame] | 153 | argument \var{encoding} can be used to specify the encoding field of the XML | 
| Martin v. Löwis | 7d650ca | 2002-06-30 15:05:00 +0000 | [diff] [blame] | 154 | header]{2.3} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 155 | \end{methoddesc} | 
|  | 156 |  | 
| Fred Drake | 267b062 | 2004-03-25 16:39:46 +0000 | [diff] [blame] | 157 | \begin{methoddesc}[Node]{toxml}{\optional{encoding}} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 158 | Return the XML that the DOM represents as a string. | 
| Martin v. Löwis | 7d650ca | 2002-06-30 15:05:00 +0000 | [diff] [blame] | 159 |  | 
| Martin v. Löwis | 7d650ca | 2002-06-30 15:05:00 +0000 | [diff] [blame] | 160 | With no argument, the XML header does not specify an encoding, and the | 
|  | 161 | result is Unicode string if the default encoding cannot represent all | 
|  | 162 | characters in the document. Encoding this string in an encoding other | 
|  | 163 | than UTF-8 is likely incorrect, since UTF-8 is the default encoding of | 
|  | 164 | XML. | 
|  | 165 |  | 
|  | 166 | With an explicit \var{encoding} argument, the result is a byte string | 
|  | 167 | in the specified encoding. It is recommended that this argument is | 
|  | 168 | always specified. To avoid UnicodeError exceptions in case of | 
|  | 169 | unrepresentable text data, the encoding argument should be specified | 
|  | 170 | as "utf-8". | 
|  | 171 |  | 
| Fred Drake | 267b062 | 2004-03-25 16:39:46 +0000 | [diff] [blame] | 172 | \versionchanged[the \var{encoding} argument was introduced]{2.3} | 
| Martin v. Löwis | 7d650ca | 2002-06-30 15:05:00 +0000 | [diff] [blame] | 173 | \end{methoddesc} | 
|  | 174 |  | 
| Fred Drake | 267b062 | 2004-03-25 16:39:46 +0000 | [diff] [blame] | 175 | \begin{methoddesc}[Node]{toprettyxml}{\optional{indent\optional{, newl}}} | 
| Martin v. Löwis | 7d650ca | 2002-06-30 15:05:00 +0000 | [diff] [blame] | 176 | Return a pretty-printed version of the document. \var{indent} specifies | 
|  | 177 | the indentation string and defaults to a tabulator; \var{newl} specifies | 
| Hye-Shik Chang | 8c147c3 | 2005-10-30 03:05:27 +0000 | [diff] [blame] | 178 | the string emitted at the end of each line and defaults to \code{\e n}. | 
| Martin v. Löwis | 7d650ca | 2002-06-30 15:05:00 +0000 | [diff] [blame] | 179 |  | 
|  | 180 | \versionadded{2.1} | 
| Fred Drake | 267b062 | 2004-03-25 16:39:46 +0000 | [diff] [blame] | 181 | \versionchanged[the encoding argument; see \method{toxml()}]{2.3} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 182 | \end{methoddesc} | 
|  | 183 |  | 
|  | 184 | The following standard DOM methods have special considerations with | 
|  | 185 | \refmodule{xml.dom.minidom}: | 
|  | 186 |  | 
| Fred Drake | 267b062 | 2004-03-25 16:39:46 +0000 | [diff] [blame] | 187 | \begin{methoddesc}[Node]{cloneNode}{deep} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 188 | Although this method was present in the version of | 
|  | 189 | \refmodule{xml.dom.minidom} packaged with Python 2.0, it was seriously | 
|  | 190 | broken.  This has been corrected for subsequent releases. | 
|  | 191 | \end{methoddesc} | 
|  | 192 |  | 
|  | 193 |  | 
|  | 194 | \subsection{DOM Example \label{dom-example}} | 
|  | 195 |  | 
|  | 196 | This example program is a fairly realistic example of a simple | 
|  | 197 | program. In this particular case, we do not take much advantage | 
|  | 198 | of the flexibility of the DOM. | 
|  | 199 |  | 
| Fred Drake | b866770 | 2001-09-02 06:07:36 +0000 | [diff] [blame] | 200 | \verbatiminput{minidom-example.py} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 201 |  | 
|  | 202 |  | 
|  | 203 | \subsection{minidom and the DOM standard \label{minidom-and-dom}} | 
|  | 204 |  | 
| Fred Drake | 0f564ea | 2001-01-22 19:06:20 +0000 | [diff] [blame] | 205 | The \refmodule{xml.dom.minidom} module is essentially a DOM | 
|  | 206 | 1.0-compatible DOM with some DOM 2 features (primarily namespace | 
|  | 207 | features). | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 208 |  | 
|  | 209 | Usage of the DOM interface in Python is straight-forward.  The | 
|  | 210 | following mapping rules apply: | 
|  | 211 |  | 
|  | 212 | \begin{itemize} | 
|  | 213 | \item Interfaces are accessed through instance objects. Applications | 
|  | 214 | should not instantiate the classes themselves; they should use | 
|  | 215 | the creator functions available on the \class{Document} object. | 
|  | 216 | Derived interfaces support all operations (and attributes) from | 
|  | 217 | the base interfaces, plus any new operations. | 
|  | 218 |  | 
|  | 219 | \item Operations are used as methods. Since the DOM uses only | 
|  | 220 | \keyword{in} parameters, the arguments are passed in normal | 
|  | 221 | order (from left to right).   There are no optional | 
|  | 222 | arguments. \keyword{void} operations return \code{None}. | 
|  | 223 |  | 
|  | 224 | \item IDL attributes map to instance attributes. For compatibility | 
|  | 225 | with the OMG IDL language mapping for Python, an attribute | 
|  | 226 | \code{foo} can also be accessed through accessor methods | 
|  | 227 | \method{_get_foo()} and \method{_set_foo()}.  \keyword{readonly} | 
|  | 228 | attributes must not be changed; this is not enforced at | 
|  | 229 | runtime. | 
|  | 230 |  | 
|  | 231 | \item The types \code{short int}, \code{unsigned int}, \code{unsigned | 
|  | 232 | long long}, and \code{boolean} all map to Python integer | 
|  | 233 | objects. | 
|  | 234 |  | 
|  | 235 | \item The type \code{DOMString} maps to Python strings. | 
|  | 236 | \refmodule{xml.dom.minidom} supports either byte or Unicode | 
| Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 237 | strings, but will normally produce Unicode strings.  Values | 
|  | 238 | of type \code{DOMString} may also be \code{None} where allowed | 
|  | 239 | to have the IDL \code{null} value by the DOM specification from | 
|  | 240 | the W3C. | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 241 |  | 
|  | 242 | \item \keyword{const} declarations map to variables in their | 
|  | 243 | respective scope | 
|  | 244 | (e.g. \code{xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE}); | 
|  | 245 | they must not be changed. | 
|  | 246 |  | 
|  | 247 | \item \code{DOMException} is currently not supported in | 
|  | 248 | \refmodule{xml.dom.minidom}.  Instead, | 
|  | 249 | \refmodule{xml.dom.minidom} uses standard Python exceptions such | 
|  | 250 | as \exception{TypeError} and \exception{AttributeError}. | 
|  | 251 |  | 
| Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 252 | \item \class{NodeList} objects are implemented using Python's built-in | 
|  | 253 | list type.  Starting with Python 2.2, these objects provide the | 
|  | 254 | interface defined in the DOM specification, but with earlier | 
|  | 255 | versions of Python they do not support the official API.  They | 
|  | 256 | are, however, much more ``Pythonic'' than the interface defined | 
|  | 257 | in the W3C recommendations. | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 258 | \end{itemize} | 
|  | 259 |  | 
|  | 260 |  | 
|  | 261 | The following interfaces have no implementation in | 
|  | 262 | \refmodule{xml.dom.minidom}: | 
|  | 263 |  | 
|  | 264 | \begin{itemize} | 
| Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 265 | \item \class{DOMTimeStamp} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 266 |  | 
| Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 267 | \item \class{DocumentType} (added in Python 2.1) | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 268 |  | 
| Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 269 | \item \class{DOMImplementation} (added in Python 2.1) | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 270 |  | 
| Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 271 | \item \class{CharacterData} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 272 |  | 
| Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 273 | \item \class{CDATASection} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 274 |  | 
| Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 275 | \item \class{Notation} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 276 |  | 
| Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 277 | \item \class{Entity} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 278 |  | 
| Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 279 | \item \class{EntityReference} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 280 |  | 
| Fred Drake | e21e2bb | 2001-10-26 20:09:49 +0000 | [diff] [blame] | 281 | \item \class{DocumentFragment} | 
| Fred Drake | eaf57aa | 2000-11-29 06:10:22 +0000 | [diff] [blame] | 282 | \end{itemize} | 
|  | 283 |  | 
|  | 284 | Most of these reflect information in the XML document that is not of | 
|  | 285 | general utility to most DOM users. |