blob: 055711305e084d5f8b8c3eb4d39881907e0863c7 [file] [log] [blame]
Fred Drakeeaf57aa2000-11-29 06:10:22 +00001\section{\module{xml.dom.minidom} ---
2 Lightweight DOM implementation}
3
4\declaremodule{standard}{xml.dom.minidom}
5\modulesynopsis{Lightweight Document Object Model (DOM) implementation.}
6\moduleauthor{Paul Prescod}{paul@prescod.net}
7\sectionauthor{Paul Prescod}{paul@prescod.net}
8\sectionauthor{Martin v. L\"owis}{loewis@informatik.hu-berlin.de}
9
10\versionadded{2.0}
11
12\module{xml.dom.minidom} is a light-weight implementation of the
13Document Object Model interface. It is intended to be
14simpler than the full DOM and also significantly smaller.
15
16DOM applications typically start by parsing some XML into a DOM. With
17\module{xml.dom.minidom}, this is done through the parse functions:
18
19\begin{verbatim}
20from xml.dom.minidom import parse, parseString
21
22dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name
23
24datasource = open('c:\\temp\\mydata.xml')
25dom2 = parse(datasource) # parse an open file
26
27dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
28\end{verbatim}
29
30The parse function can take either a filename or an open file object.
31
32\begin{funcdesc}{parse}{filename_or_file{, parser}}
33 Return a \class{Document} from the given input. \var{filename_or_file}
34 may be either a file name, or a file-like object. \var{parser}, if
35 given, must be a SAX2 parser object. This function will change the
36 document handler of the parser and activate namespace support; other
37 parser configuration (like setting an entity resolver) must have been
38 done in advance.
39\end{funcdesc}
40
41If you have XML in a string, you can use the
42\function{parseString()} function instead:
43
44\begin{funcdesc}{parseString}{string\optional{, parser}}
45 Return a \class{Document} that represents the \var{string}. This
46 method creates a \class{StringIO} object for the string and passes
47 that on to \function{parse}.
48\end{funcdesc}
49
50Both functions return a \class{Document} object representing the
51content of the document.
52
53You can also create a \class{Document} node merely by instantiating a
54document object. Then you could add child nodes to it to populate
55the DOM:
56
57\begin{verbatim}
58from xml.dom.minidom import Document
59
60newdoc = Document()
61newel = newdoc.createElement("some_tag")
62newdoc.appendChild(newel)
63\end{verbatim}
64
65Once you have a DOM document object, you can access the parts of your
66XML document through its properties and methods. These properties are
67defined in the DOM specification. The main property of the document
68object is the \member{documentElement} property. It gives you the
69main element in the XML document: the one that holds all others. Here
70is an example program:
71
72\begin{verbatim}
73dom3 = parseString("<myxml>Some data</myxml>")
74assert dom3.documentElement.tagName == "myxml"
75\end{verbatim}
76
77When you are finished with a DOM, you should clean it up. This is
78necessary because some versions of Python do not support garbage
79collection of objects that refer to each other in a cycle. Until this
80restriction is removed from all versions of Python, it is safest to
81write your code as if cycles would not be cleaned up.
82
83The way to clean up a DOM is to call its \method{unlink()} method:
84
85\begin{verbatim}
86dom1.unlink()
87dom2.unlink()
88dom3.unlink()
89\end{verbatim}
90
91\method{unlink()} is a \module{xml.dom.minidom}-specific extension to
92the DOM API. After calling \method{unlink()} on a node, the node and
93its descendents are essentially useless.
94
95\begin{seealso}
96 \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{Document Object
97 Model (DOM) Level 1 Specification}
98 {The W3C recommendation for the
99 DOM supported by \module{xml.dom.minidom}.}
100\end{seealso}
101
102
103\subsection{DOM objects \label{dom-objects}}
104
105The definition of the DOM API for Python is given as part of the
106\refmodule{xml.dom} module documentation. This section lists the
107differences between the API and \refmodule{xml.dom.minidom}.
108
109
110\begin{methoddesc}{unlink}{}
111Break internal references within the DOM so that it will be garbage
112collected on versions of Python without cyclic GC. Even when cyclic
113GC is available, using this can make large amounts of memory available
114sooner, so calling this on DOM objects as soon as they are no longer
115needed is good practice. This only needs to be called on the
116\class{Document} object, but may be called on child nodes to discard
117children of that node.
118\end{methoddesc}
119
120\begin{methoddesc}{writexml}{writer}
121Write XML to the writer object. The writer should have a
122\method{write()} method which matches that of the file object
123interface.
124\end{methoddesc}
125
126\begin{methoddesc}{toxml}{}
127Return the XML that the DOM represents as a string.
128\end{methoddesc}
129
130The following standard DOM methods have special considerations with
131\refmodule{xml.dom.minidom}:
132
133\begin{methoddesc}{cloneNode}{deep}
134Although this method was present in the version of
135\refmodule{xml.dom.minidom} packaged with Python 2.0, it was seriously
136broken. This has been corrected for subsequent releases.
137\end{methoddesc}
138
139
140\subsection{DOM Example \label{dom-example}}
141
142This example program is a fairly realistic example of a simple
143program. In this particular case, we do not take much advantage
144of the flexibility of the DOM.
145
146\begin{verbatim}
147import xml.dom.minidom
148
149document = """\
150<slideshow>
151<title>Demo slideshow</title>
152<slide><title>Slide title</title>
153<point>This is a demo</point>
154<point>Of a program for processing slides</point>
155</slide>
156
157<slide><title>Another demo slide</title>
158<point>It is important</point>
159<point>To have more than</point>
160<point>one slide</point>
161</slide>
162</slideshow>
163"""
164
165dom = xml.dom.minidom.parseString(document)
166
167space = " "
168def getText(nodelist):
169 rc = ""
170 for node in nodelist:
171 if node.nodeType == node.TEXT_NODE:
172 rc = rc + node.data
173 return rc
174
175def handleSlideshow(slideshow):
176 print "<html>"
177 handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
178 slides = slideshow.getElementsByTagName("slide")
179 handleToc(slides)
180 handleSlides(slides)
181 print "</html>"
182
183def handleSlides(slides):
184 for slide in slides:
185 handleSlide(slide)
186
187def handleSlide(slide):
188 handleSlideTitle(slide.getElementsByTagName("title")[0])
189 handlePoints(slide.getElementsByTagName("point"))
190
191def handleSlideshowTitle(title):
192 print "<title>%s</title>" % getText(title.childNodes)
193
194def handleSlideTitle(title):
195 print "<h2>%s</h2>" % getText(title.childNodes)
196
197def handlePoints(points):
198 print "<ul>"
199 for point in points:
200 handlePoint(point)
201 print "</ul>"
202
203def handlePoint(point):
204 print "<li>%s</li>" % getText(point.childNodes)
205
206def handleToc(slides):
207 for slide in slides:
208 title = slide.getElementsByTagName("title")[0]
209 print "<p>%s</p>" % getText(title.childNodes)
210
211handleSlideshow(dom)
212\end{verbatim}
213
214
215\subsection{minidom and the DOM standard \label{minidom-and-dom}}
216
217\refmodule{xml.dom.minidom} is basically a DOM 1.0-compatible DOM with
218some DOM 2 features (primarily namespace features).
219
220Usage of the DOM interface in Python is straight-forward. The
221following mapping rules apply:
222
223\begin{itemize}
224\item Interfaces are accessed through instance objects. Applications
225 should not instantiate the classes themselves; they should use
226 the creator functions available on the \class{Document} object.
227 Derived interfaces support all operations (and attributes) from
228 the base interfaces, plus any new operations.
229
230\item Operations are used as methods. Since the DOM uses only
231 \keyword{in} parameters, the arguments are passed in normal
232 order (from left to right). There are no optional
233 arguments. \keyword{void} operations return \code{None}.
234
235\item IDL attributes map to instance attributes. For compatibility
236 with the OMG IDL language mapping for Python, an attribute
237 \code{foo} can also be accessed through accessor methods
238 \method{_get_foo()} and \method{_set_foo()}. \keyword{readonly}
239 attributes must not be changed; this is not enforced at
240 runtime.
241
242\item The types \code{short int}, \code{unsigned int}, \code{unsigned
243 long long}, and \code{boolean} all map to Python integer
244 objects.
245
246\item The type \code{DOMString} maps to Python strings.
247 \refmodule{xml.dom.minidom} supports either byte or Unicode
248 strings, but will normally produce Unicode strings. Attributes
249 of type \code{DOMString} may also be \code{None}.
250
251\item \keyword{const} declarations map to variables in their
252 respective scope
253 (e.g. \code{xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE});
254 they must not be changed.
255
256\item \code{DOMException} is currently not supported in
257 \refmodule{xml.dom.minidom}. Instead,
258 \refmodule{xml.dom.minidom} uses standard Python exceptions such
259 as \exception{TypeError} and \exception{AttributeError}.
260
261\item \class{NodeList} objects are implemented as Python's built-in
262 list type, so don't support the official API, but are much more
263 ``Pythonic.''
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000264\end{itemize}
265
266
267The following interfaces have no implementation in
268\refmodule{xml.dom.minidom}:
269
270\begin{itemize}
271\item DOMTimeStamp
272
Fred Drake16942f22000-12-07 04:47:51 +0000273\item DocumentType (added in Python 2.1)
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000274
Fred Drake16942f22000-12-07 04:47:51 +0000275\item DOMImplementation (added in Python 2.1)
Fred Drakeeaf57aa2000-11-29 06:10:22 +0000276
277\item CharacterData
278
279\item CDATASection
280
281\item Notation
282
283\item Entity
284
285\item EntityReference
286
287\item DocumentFragment
288\end{itemize}
289
290Most of these reflect information in the XML document that is not of
291general utility to most DOM users.