blob: 0d5bfeaaa60a5f30e8b150fa5f3a0905400b3937 [file] [log] [blame]
Fred Drakeeaf57aa2000-11-29 06:10:22 +00001\section{\module{xml.dom.minidom} ---
2 Lightweight DOM implementation}
3
4\declaremodule{standard}{xml.dom.minidom}
5\modulesynopsis{Lightweight Document Object Model (DOM) implementation.}
6\moduleauthor{Paul Prescod}{paul@prescod.net}
7\sectionauthor{Paul Prescod}{paul@prescod.net}
8\sectionauthor{Martin v. L\"owis}{loewis@informatik.hu-berlin.de}
9
10\versionadded{2.0}
11
12\module{xml.dom.minidom} is a light-weight implementation of the
13Document Object Model interface. It is intended to be
14simpler than the full DOM and also significantly smaller.
15
16DOM applications typically start by parsing some XML into a DOM. With
17\module{xml.dom.minidom}, this is done through the parse functions:
18
19\begin{verbatim}
20from xml.dom.minidom import parse, parseString
21
22dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name
23
24datasource = open('c:\\temp\\mydata.xml')
25dom2 = parse(datasource) # parse an open file
26
27dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
28\end{verbatim}
29
30The parse function can take either a filename or an open file object.
31
32\begin{funcdesc}{parse}{filename_or_file{, parser}}
33 Return a \class{Document} from the given input. \var{filename_or_file}
34 may be either a file name, or a file-like object. \var{parser}, if
35 given, must be a SAX2 parser object. This function will change the
36 document handler of the parser and activate namespace support; other
37 parser configuration (like setting an entity resolver) must have been
38 done in advance.
39\end{funcdesc}
40
41If you have XML in a string, you can use the
42\function{parseString()} function instead:
43
44\begin{funcdesc}{parseString}{string\optional{, parser}}
45 Return a \class{Document} that represents the \var{string}. This
46 method creates a \class{StringIO} object for the string and passes
47 that on to \function{parse}.
48\end{funcdesc}
49
50Both functions return a \class{Document} object representing the
51content of the document.
52
53You can also create a \class{Document} node merely by instantiating a
54document object. Then you could add child nodes to it to populate
55the DOM:
56
57\begin{verbatim}
58from xml.dom.minidom import Document
59
60newdoc = Document()
61newel = newdoc.createElement("some_tag")
62newdoc.appendChild(newel)
63\end{verbatim}
64
65Once you have a DOM document object, you can access the parts of your
66XML document through its properties and methods. These properties are
67defined in the DOM specification. The main property of the document
68object is the \member{documentElement} property. It gives you the
69main element in the XML document: the one that holds all others. Here
70is an example program:
71
72\begin{verbatim}
73dom3 = parseString("<myxml>Some data</myxml>")
74assert dom3.documentElement.tagName == "myxml"
75\end{verbatim}
76
77When you are finished with a DOM, you should clean it up. This is
78necessary because some versions of Python do not support garbage
79collection of objects that refer to each other in a cycle. Until this
80restriction is removed from all versions of Python, it is safest to
81write your code as if cycles would not be cleaned up.
82
83The way to clean up a DOM is to call its \method{unlink()} method:
84
85\begin{verbatim}
86dom1.unlink()
87dom2.unlink()
88dom3.unlink()
89\end{verbatim}
90
91\method{unlink()} is a \module{xml.dom.minidom}-specific extension to
92the DOM API. After calling \method{unlink()} on a node, the node and
93its descendents are essentially useless.
94
95\begin{seealso}
96 \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{Document Object
97 Model (DOM) Level 1 Specification}
98 {The W3C recommendation for the
99 DOM supported by \module{xml.dom.minidom}.}
100\end{seealso}
101
102
103\subsection{DOM objects \label{dom-objects}}
104
105The definition of the DOM API for Python is given as part of the
106\refmodule{xml.dom} module documentation. This section lists the
107differences between the API and \refmodule{xml.dom.minidom}.
108
109
110\begin{methoddesc}{unlink}{}
111Break internal references within the DOM so that it will be garbage
112collected on versions of Python without cyclic GC. Even when cyclic
113GC is available, using this can make large amounts of memory available
114sooner, so calling this on DOM objects as soon as they are no longer
115needed is good practice. This only needs to be called on the
116\class{Document} object, but may be called on child nodes to discard
117children of that node.
118\end{methoddesc}
119
120\begin{methoddesc}{writexml}{writer}
121Write XML to the writer object. The writer should have a
122\method{write()} method which matches that of the file object
123interface.
Martin v. Löwis7d650ca2002-06-30 15:05:00 +0000124
125\versionadded[To support pretty output, new keyword parameters indent,
126addindent, and newl have been added]{2.1}
127
128\versionadded[For the \class{Document} node, an additional keyword
129argument encoding can be used to specify the encoding field of the XML
130header]{2.3}
131
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000132\end{methoddesc}
133
Martin v. Löwis7d650ca2002-06-30 15:05:00 +0000134\begin{methoddesc}{toxml}{\optional{encoding}}
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000135Return the XML that the DOM represents as a string.
Martin v. Löwis7d650ca2002-06-30 15:05:00 +0000136
137\versionadded[the \var{encoding} argument]{2.3}
138
139With no argument, the XML header does not specify an encoding, and the
140result is Unicode string if the default encoding cannot represent all
141characters in the document. Encoding this string in an encoding other
142than UTF-8 is likely incorrect, since UTF-8 is the default encoding of
143XML.
144
145With an explicit \var{encoding} argument, the result is a byte string
146in the specified encoding. It is recommended that this argument is
147always specified. To avoid UnicodeError exceptions in case of
148unrepresentable text data, the encoding argument should be specified
149as "utf-8".
150
151\end{methoddesc}
152
153\begin{methoddesc}{toprettyxml}{\optional{indent\optional{, newl}}}
154
155Return a pretty-printed version of the document. \var{indent} specifies
156the indentation string and defaults to a tabulator; \var{newl} specifies
157the string emitted at the end of each line and defaults to \\n.
158
159\versionadded{2.1}
160
161\versionadded[the encoding argument; see \method{toxml}]{2.3}
162
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000163\end{methoddesc}
164
165The following standard DOM methods have special considerations with
166\refmodule{xml.dom.minidom}:
167
168\begin{methoddesc}{cloneNode}{deep}
169Although this method was present in the version of
170\refmodule{xml.dom.minidom} packaged with Python 2.0, it was seriously
171broken. This has been corrected for subsequent releases.
172\end{methoddesc}
173
174
175\subsection{DOM Example \label{dom-example}}
176
177This example program is a fairly realistic example of a simple
178program. In this particular case, we do not take much advantage
179of the flexibility of the DOM.
180
Fred Drakeb8667702001-09-02 06:07:36 +0000181\verbatiminput{minidom-example.py}
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000182
183
184\subsection{minidom and the DOM standard \label{minidom-and-dom}}
185
Fred Drake0f564ea2001-01-22 19:06:20 +0000186The \refmodule{xml.dom.minidom} module is essentially a DOM
1871.0-compatible DOM with some DOM 2 features (primarily namespace
188features).
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000189
190Usage of the DOM interface in Python is straight-forward. The
191following mapping rules apply:
192
193\begin{itemize}
194\item Interfaces are accessed through instance objects. Applications
195 should not instantiate the classes themselves; they should use
196 the creator functions available on the \class{Document} object.
197 Derived interfaces support all operations (and attributes) from
198 the base interfaces, plus any new operations.
199
200\item Operations are used as methods. Since the DOM uses only
201 \keyword{in} parameters, the arguments are passed in normal
202 order (from left to right). There are no optional
203 arguments. \keyword{void} operations return \code{None}.
204
205\item IDL attributes map to instance attributes. For compatibility
206 with the OMG IDL language mapping for Python, an attribute
207 \code{foo} can also be accessed through accessor methods
208 \method{_get_foo()} and \method{_set_foo()}. \keyword{readonly}
209 attributes must not be changed; this is not enforced at
210 runtime.
211
212\item The types \code{short int}, \code{unsigned int}, \code{unsigned
213 long long}, and \code{boolean} all map to Python integer
214 objects.
215
216\item The type \code{DOMString} maps to Python strings.
217 \refmodule{xml.dom.minidom} supports either byte or Unicode
Fred Drakee21e2bb2001-10-26 20:09:49 +0000218 strings, but will normally produce Unicode strings. Values
219 of type \code{DOMString} may also be \code{None} where allowed
220 to have the IDL \code{null} value by the DOM specification from
221 the W3C.
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000222
223\item \keyword{const} declarations map to variables in their
224 respective scope
225 (e.g. \code{xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE});
226 they must not be changed.
227
228\item \code{DOMException} is currently not supported in
229 \refmodule{xml.dom.minidom}. Instead,
230 \refmodule{xml.dom.minidom} uses standard Python exceptions such
231 as \exception{TypeError} and \exception{AttributeError}.
232
Fred Drakee21e2bb2001-10-26 20:09:49 +0000233\item \class{NodeList} objects are implemented using Python's built-in
234 list type. Starting with Python 2.2, these objects provide the
235 interface defined in the DOM specification, but with earlier
236 versions of Python they do not support the official API. They
237 are, however, much more ``Pythonic'' than the interface defined
238 in the W3C recommendations.
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000239\end{itemize}
240
241
242The following interfaces have no implementation in
243\refmodule{xml.dom.minidom}:
244
245\begin{itemize}
Fred Drakee21e2bb2001-10-26 20:09:49 +0000246\item \class{DOMTimeStamp}
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000247
Fred Drakee21e2bb2001-10-26 20:09:49 +0000248\item \class{DocumentType} (added in Python 2.1)
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000249
Fred Drakee21e2bb2001-10-26 20:09:49 +0000250\item \class{DOMImplementation} (added in Python 2.1)
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000251
Fred Drakee21e2bb2001-10-26 20:09:49 +0000252\item \class{CharacterData}
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000253
Fred Drakee21e2bb2001-10-26 20:09:49 +0000254\item \class{CDATASection}
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000255
Fred Drakee21e2bb2001-10-26 20:09:49 +0000256\item \class{Notation}
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000257
Fred Drakee21e2bb2001-10-26 20:09:49 +0000258\item \class{Entity}
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000259
Fred Drakee21e2bb2001-10-26 20:09:49 +0000260\item \class{EntityReference}
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000261
Fred Drakee21e2bb2001-10-26 20:09:49 +0000262\item \class{DocumentFragment}
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000263\end{itemize}
264
265Most of these reflect information in the XML document that is not of
266general utility to most DOM users.