blob: c2945a4ac370570b308c09f158b95e32444d2260 [file] [log] [blame]
Fred Drake669d36f2000-10-24 02:34:45 +00001\section{\module{xml.dom.minidom} ---
2 The Document Object Model}
3
4\declaremodule{standard}{xml.dom.minidom}
5\modulesynopsis{Lightweight Document Object Model (DOM) implementation.}
6\moduleauthor{Paul Prescod}{paul@prescod.net}
7\sectionauthor{Paul Prescod}{paul@prescod.net}
8\sectionauthor{Martin v. L\"owis}{loewis@informatik.hu-berlin.de}
9
10\versionadded{2.0}
11
12The \module{xml.dom.minidom} provides a light-weight implementation of
13the W3C Document Object Model. The DOM is a cross-language API from
14the Web Consortium (W3C) for accessing and modifying XML documents. A
15DOM implementation allows to convert an XML document into a tree-like
16structure, or to build such a structure from scratch. It then gives
17access to the structure through a set of objects which provided
18well-known interfaces. Minidom is intended to be simpler than the full
19DOM and also significantly smaller.
20
21The DOM is extremely useful for random-access applications. SAX only
22allows you a view of one bit of the document at a time. If you are
23looking at one SAX element, you have no access to another. If you are
24looking at a text node, you have no access to a containing
25element. When you write a SAX application, you need to keep track of
26your program's position in the document somewhere in your own
27code. Sax does not do it for you. Also, if you need to look ahead in
28the XML document, you are just out of luck.
29
30Some applications are simply impossible in an event driven model with
31no access to a tree. Of course you could build some sort of tree
32yourself in SAX events, but the DOM allows you to avoid writing that
33code. The DOM is a standard tree representation for XML data.
34
35%What if your needs are somewhere between SAX and the DOM? Perhaps you cannot
36%afford to load the entire tree in memory but you find the SAX model
37%somewhat cumbersome and low-level. There is also an experimental module
38%called pulldom that allows you to build trees of only the parts of a
39%document that you need structured access to. It also has features that allow
40%you to find your way around the DOM.
41% See http://www.prescod.net/python/pulldom
42
43DOM applications typically start by parsing some XML into a DOM. This
44is done through the parse functions:
45
46\begin{verbatim}
47from xml.dom.minidom import parse, parseString
48
49dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name
50
51datasource = open('c:\\temp\\mydata.xml')
52dom2 = parse(datasource) # parse an open file
53
54dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
55\end{verbatim}
56
57The parse function can take either a filename or an open file object.
58
59\begin{funcdesc}{parse}{filename_or_file{, parser}}
60 Return a \class{Document} from the given input. \var{filename_or_file}
61 may be either a file name, or a file-like object. \var{parser}, if
62 given, must be a SAX2 parser object. This function will change the
63 document handler of the parser and activate namespace support; other
64 parser configuration (like setting an entity resolver) must have been
65 done in advance.
66\end{funcdesc}
67
68If you have XML in a string, you can use the parseString function
69instead:
70
71\begin{funcdesc}{parseString}{string\optional{, parser}}
72 Return a \class{Document} that represents the \var{string}. This
73 method creates a \class{StringIO} object for the string and passes
74 that on to \function{parse}.
75\end{funcdesc}
76
77Both functions return a document object representing the content of
78the document.
79
80You can also create a document node merely by instantiating a
81document object. Then you could add child nodes to it to populate
82the DOM.
83
84\begin{verbatim}
85from xml.dom.minidom import Document
86
87newdoc = Document()
88newel = newdoc.createElement("some_tag")
89newdoc.appendChild(newel)
90\end{verbatim}
91
92Once you have a DOM document object, you can access the parts of your
93XML document through its properties and methods. These properties are
94defined in the DOM specification. The main property of the document
95object is the documentElement property. It gives you the main element
96in the XML document: the one that holds all others. Here is an
97example program:
98
99\begin{verbatim}
100dom3 = parseString("<myxml>Some data</myxml>")
101assert dom3.documentElement.tagName == "myxml"
102\end{verbatim}
103
104When you are finished with a DOM, you should clean it up. This is
105necessary because some versions of Python do not support garbage
106collection of objects that refer to each other in a cycle. Until this
107restriction is removed from all versions of Python, it is safest to
108write your code as if cycles would not be cleaned up.
109
110The way to clean up a DOM is to call its \method{unlink()} method:
111
112\begin{verbatim}
113dom1.unlink()
114dom2.unlink()
115dom3.unlink()
116\end{verbatim}
117
118\method{unlink()} is a \module{minidom}-specific extension to the DOM
119API. After calling \method{unlink()}, a DOM is basically useless.
120
121\begin{seealso}
122 \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{DOM Specification}
123 {This is the canonical specification for the level of the
124 DOM supported by \module{xml.dom.minidom}.}
125 \seetitle[http://pyxml.sourceforge.net]{PyXML}{Users that require a
126 full-featured implementation of DOM should use the PyXML
127 package.}
128\end{seealso}
129
130
131\subsection{DOM objects \label{dom-objects}}
132
133The definitive documentation for the DOM is the DOM specification from
134the W3C. This section lists the properties and methods supported by
135\refmodule{xml.dom.minidom}.
136
137\begin{classdesc}{Node}{}
138All of the components of an XML document are subclasses of
139\class{Node}.
140
141\begin{memberdesc}{nodeType}
142An integer representing the node type. Symbolic constants for the
143types are on the \class{Node} object: \constant{DOCUMENT_NODE},
144\constant{ELEMENT_NODE}, \constant{ATTRIBUTE_NODE},
145\constant{TEXT_NODE}, \constant{CDATA_SECTION_NODE},
146\constant{ENTITY_NODE}, \constant{PROCESSING_INSTRUCTION_NODE},
147\constant{COMMENT_NODE}, \constant{DOCUMENT_NODE},
148\constant{DOCUMENT_TYPE_NODE}, \constant{NOTATION_NODE}.
149\end{memberdesc}
150
151\begin{memberdesc}{parentNode}
152The parent of the current node. \code{None} for the document node.
153\end{memberdesc}
154
155\begin{memberdesc}{attributes}
156An \class{AttributeList} of attribute objects. Only
157elements have this attribute. Others return \code{None}.
158\end{memberdesc}
159
160\begin{memberdesc}{previousSibling}
161The node that immediately precedes this one with the same parent. For
162instance the element with an end-tag that comes just before the
163\var{self} element's start-tag. Of course, XML documents are made
164up of more than just elements so the previous sibling could be text, a
165comment, or something else.
166\end{memberdesc}
167
168\begin{memberdesc}{nextSibling}
169The node that immediately follows this one with the same parent. See
170also \member{previousSibling}.
171\end{memberdesc}
172
173\begin{memberdesc}{childNodes}
174A list of nodes contained within this node.
175\end{memberdesc}
176
177\begin{memberdesc}{firstChild}
178Equivalent to \code{childNodes[0]}.
179\end{memberdesc}
180
181\begin{memberdesc}{lastChild}
182Equivalent to \code{childNodes[-1]}.
183\end{memberdesc}
184
185\begin{memberdesc}{nodeName}
186Has a different meaning for each node type. See the DOM specification
187for details. You can always get the information you would get here
188from another property such as the \member{tagName} property for
189elements or the \member{name} property for attributes.
190\end{memberdesc}
191
192\begin{memberdesc}{nodeValue}
193Has a different meaning for each node type. See the DOM specification
194for details. The situation is similar to that with \member{nodeName}.
195\end{memberdesc}
196
197\begin{methoddesc}{unlink}{}
198Break internal references within the DOM so that it will be garbage
199collected on versions of Python without cyclic GC.
200\end{methoddesc}
201
202\begin{methoddesc}{writexml}{writer}
203Write XML to the writer object. The writer should have a
204\method{write()} method which matches that of the file object
205interface.
206\end{methoddesc}
207
208\begin{methoddesc}{toxml}{}
209Return the XML string that the DOM represents.
210\end{methoddesc}
211
212\begin{methoddesc}{hasChildNodes}{}
213Returns true the node has any child nodes.
214\end{methoddesc}
215
216\begin{methoddesc}{insertBefore}{newChild, refChild}
217Insert a new child node before an existing child. It must be the case
218that \var{refChild} is a child of this node; if not,
219\exception{ValueError} is raised.
220\end{methoddesc}
221
222\begin{methoddesc}{replaceChild}{newChild, oldChild}
223Replace an existing node with a new node. It must be the case that
224\var{oldChild} is a child of this node; if not,
225\exception{ValueError} is raised.
226\end{methoddesc}
227
228\begin{methoddesc}{removeChild}{oldChild}
229Remove a child node. \var{oldChild} must be a child of this node; if
230not, \exception{ValueError} is raised.
231\end{methoddesc}
232
233\begin{methoddesc}{appendChild}{newChild}
234Add a new child node to this node list.
235\end{methoddesc}
236
237\begin{methoddesc}{cloneNode}{deep}
238Clone this node. Deep means to clone all children also. Deep cloning
239is not implemented in Python 2 so the deep parameter should always be
2400 for now.
241\end{methoddesc}
242
243\end{classdesc}
244
245
246\begin{classdesc}{Document}{}
247Represents an entire XML document, including its constituent elements,
248attributes, processing instructions, comments etc. Remeber that it
249inherits properties from \class{Node}.
250
251\begin{memberdesc}{documentElement}
252The one and only root element of the document.
253\end{memberdesc}
254
255\begin{methoddesc}{createElement}{tagName}
256Create a new element. The element is not inserted into the document
257when it is created. You need to explicitly insert it with one of the
258other methods such as \method{insertBefore()} or
259\method{appendChild()}.
260\end{methoddesc}
261
262\begin{methoddesc}{createTextNode}{data}
263Create a text node containing the data passed as a parameter. As with
264the other creation methods, this one does not insert the node into the
265tree.
266\end{methoddesc}
267
268\begin{methoddesc}{createComment}{data}
269Create a comment node containing the data passed as a parameter. As
270with the other creation methods, this one does not insert the node
271into the tree.
272\end{methoddesc}
273
274\begin{methoddesc}{createProcessingInstruction}{target, data}
275Create a processing instruction node containing the \var{target} and
276\var{data} passed as parameters. As with the other creation methods,
277this one does not insert the node into the tree.
278\end{methoddesc}
279
280\begin{methoddesc}{createAttribute}{name}
281Create an attribute node. This method does not associate the
282attribute node with any particular element. You must use
283\method{setAttributeNode()} on the appropriate \class{Element} object
284to use the newly created attribute instance.
285\end{methoddesc}
286
287\begin{methoddesc}{createElementNS}{namespaceURI, tagName}
288Create a new element with a namespace. The \var{tagName} may have a
289prefix. The element is not inserted into the document when it is
290created. You need to explicitly insert it with one of the other
291methods such as \method{insertBefore()} or \method{appendChild()}.
292\end{methoddesc}
293
294
295\begin{methoddesc}{createAttributeNS}{namespaceURI, qualifiedName}
296Create an attribute node with a namespace. The \var{tagName} may have
297a prefix. This method does not associate the attribute node with any
298particular element. You must use \method{setAttributeNode()} on the
299appropriate \class{Element} object to use the newly created attribute
300instance.
301\end{methoddesc}
302
303\begin{methoddesc}{getElementsByTagName}{tagName}
304Search for all descendants (direct children, children's children,
305etc.) with a particular element type name.
306\end{methoddesc}
307
308\begin{methoddesc}{getElementsByTagNameNS}{namespaceURI, localName}
309Search for all descendants (direct children, children's children,
310etc.) with a particular namespace URI and localname. The localname is
311the part of the namespace after the prefix.
312\end{methoddesc}
313
314\end{classdesc}
315
316
317\begin{classdesc}{Element}{}
318\begin{memberdesc}{tagName}
319The element type name. In a namespace-using document it may have
320colons in it.
321\end{memberdesc}
322
323\begin{memberdesc}{localName}
324The part of the \member{tagName} following the colon if there is one,
325else the entire \member{tagName}.
326\end{memberdesc}
327
328\begin{memberdesc}{prefix}
329The part of the \member{tagName} preceding the colon if there is one,
330else the empty string.
331\end{memberdesc}
332
333\begin{memberdesc}{namespaceURI}
334The namespace associated with the tagName.
335\end{memberdesc}
336
337\begin{methoddesc}{getAttribute}{attname}
338Return an attribute value as a string.
339\end{methoddesc}
340
341\begin{methoddesc}{setAttribute}{attname, value}
342Set an attribute value from a string.
343\end{methoddesc}
344
345\begin{methoddesc}{removeAttribute}{attname}
346Remove an attribute by name.
347\end{methoddesc}
348
349\begin{methoddesc}{getAttributeNS}{namespaceURI, localName}
350Return an attribute value as a string, given a \var{namespaceURI} and
351\var{localName}. Note that a localname is the part of a prefixed
352attribute name after the colon (if there is one).
353\end{methoddesc}
354
355\begin{methoddesc}{setAttributeNS}{namespaceURI, qname, value}
356Set an attribute value from a string, given a \var{namespaceURI} and a
357\var{qname}. Note that a qname is the whole attribute name. This is
358different than above.
359\end{methoddesc}
360
361\begin{methoddesc}{removeAttributeNS}{namespaceURI, localName}
362Remove an attribute by name. Note that it uses a localName, not a
363qname.
364\end{methoddesc}
365
366\begin{methoddesc}{getElementsByTagName}{tagName}
367Same as equivalent method in the \class{Document} class.
368\end{methoddesc}
369
370\begin{methoddesc}{getElementsByTagNameNS}{tagName}
371Same as equivalent method in the \class{Document} class.
372\end{methoddesc}
373
374\end{classdesc}
375
376
377\begin{classdesc}{Attribute}{}
378
379\begin{memberdesc}{name}
380The attribute name. In a namespace-using document it may have colons
381in it.
382\end{memberdesc}
383
384\begin{memberdesc}{localName}
385The part of the name following the colon if there is one, else the
386entire name.
387\end{memberdesc}
388
389\begin{memberdesc}{prefix}
390The part of the name preceding the colon if there is one, else the
391empty string.
392\end{memberdesc}
393
394\begin{memberdesc}{namespaceURI}
395The namespace associated with the attribute name.
396\end{memberdesc}
397
398\end{classdesc}
399
400
401\begin{classdesc}{AttributeList}{}
402
403\begin{memberdesc}{length}
404The length of the attribute list.
405\end{memberdesc}
406
407\begin{methoddesc}{item}{index}
408Return an attribute with a particular index. The order you get the
409attributes in is arbitrary but will be consistent for the life of a
410DOM. Each item is an attribute node. Get its value with the
411\member{value} attribbute.
412\end{methoddesc}
413
414There are also experimental methods that give this class more
415dictionary-like behavior. You can use them or you can use the
416standardized \method{getAttribute*()}-family methods.
417
418\end{classdesc}
419
420
421\begin{classdesc}{Comment}{}
422Represents a comment in the XML document.
423
424\begin{memberdesc}{data}
425The content of the comment.
426\end{memberdesc}
427\end{classdesc}
428
429
430\begin{classdesc}{Text}{}
431Represents text in the XML document.
432
433\begin{memberdesc}{data}
434The content of the text node.
435\end{memberdesc}
436\end{classdesc}
437
438
439\begin{classdesc}{ProcessingInstruction}{}
440Represents a processing instruction in the XML document.
441
442\begin{memberdesc}{target}
443The content of the processing instruction up to the first whitespace
444character.
445\end{memberdesc}
446
447\begin{memberdesc}{data}
448The content of the processing instruction following the first
449whitespace character.
450\end{memberdesc}
451\end{classdesc}
452
453Note that DOM attributes may also be manipulated as nodes instead of as
454simple strings. It is fairly rare that you must do this, however, so this
455usage is not yet documented here.
456
457
458\begin{seealso}
459 \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{DOM Specification}
460 {This is the canonical specification for the level of the
461 DOM supported by \module{xml.dom.minidom}.}
462\end{seealso}
463
464
465\subsection{DOM Example \label{dom-example}}
466
467This example program is a fairly realistic example of a simple
468program. In this particular case, we do not take much advantage
469of the flexibility of the DOM.
470
471\begin{verbatim}
472from xml.dom.minidom import parse, parseString
473
474document="""
475<slideshow>
476<title>Demo slideshow</title>
477<slide><title>Slide title</title>
478<point>This is a demo</point>
479<point>Of a program for processing slides</point>
480</slide>
481
482<slide><title>Another demo slide</title>
483<point>It is important</point>
484<point>To have more than</point>
485<point>one slide</point>
486</slide>
487</slideshow>
488"""
489
490dom = parseString(document)
491
492space=" "
493def getText(nodelist):
494 rc=""
495 for node in nodelist:
496 if node.nodeType==node.TEXT_NODE:
497 rc=rc+node.data
498 return rc
499
500def handleSlideshow(slideshow):
501 print "<html>"
502 handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
503 slides = slideshow.getElementsByTagName("slide")
504 handleToc(slides)
505 handleSlides(slides)
506 print "</html>"
507
508def handleSlides(slides):
509 for slide in slides:
510 handleSlide(slide)
511
512def handleSlide(slide):
513 handleSlideTitle(slide.getElementsByTagName("title")[0])
514 handlePoints(slide.getElementsByTagName("point"))
515
516def handleSlideshowTitle(title):
517 print "<title>%s</title>"%getText(title.childNodes)
518
519def handleSlideTitle(title):
520 print "<h2>%s</h2>"%getText(title.childNodes)
521
522def handlePoints(points):
523 print "<ul>"
524 for point in points:
525 handlePoint(point)
526 print "</ul>"
527
528def handlePoint(point):
529 print "<li>%s</li>"%getText(point.childNodes)
530
531def handleToc(slides):
532 for slide in slides:
533 title = slide.getElementsByTagName("title")[0]
534 print "<p>%s</p>"%getText(title.childNodes)
535
536handleSlideshow(dom)
537\end{verbatim}
538
539\subsection{minidom and the DOM standard \label{minidom-and-dom}}
540
541Minidom is basically a DOM 1.0-compatible DOM with some DOM 2 features
542(primarily namespace features).
543
544Usage of the other DOM interfaces in Python is straight-forward. The
545following mapping rules apply:
546
547\begin{itemize}
548
549\item Interfaces are accessed through instance objects. Applications
550should
551not instantiate the classes themselves; they should use the creator
552functions. Derived interfaces support all operations (and attributes)
553from the base interfaces, plus any new operations.
554
555\item Operations are used as methods. Since the DOM uses only
556\code{in}
557parameters, the arguments are passed in normal order (from left to
558right).
559There are no optional arguments. \code{void} operations return
560\code{None}.
561
562\item IDL attributes map to instance attributes. For compatibility
563with
564the OMG IDL language mapping for Python, an attribute \code{foo} can
565also be accessed through accessor functions \code{_get_foo} and
566\code{_set_foo}. \code{readonly} attributes must not be changed.
567
568\item The types \code{short int},\code{unsigned int},\code{unsigned
569long long},
570and \code{boolean} all map to Python integer objects.
571
572\item The type \code{DOMString} maps to Python strings. \code{minidom}
573supports either byte or Unicode strings, but will normally produce
574Unicode
575strings. Attributes of type \code{DOMString} may also be \code{None}.
576
577\item \code{const} declarations map to variables in their respective
578scope
579(e.g. \code{xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE}); they
580must
581not be changed.
582
583\item \code{DOMException} is currently not supported in
584\module{minidom}. Instead, minidom returns standard Python exceptions
585such as TypeError and AttributeError.
586
587\end{itemize}
588
589The following interfaces have no equivalent in minidom:
590
591\begin{itemize}
592
593\item DOMTimeStamp
594
595\item DocumentType
596
597\item DOMImplementation
598
599\item CharacterData
600
601\item CDATASection
602
603\item Notation
604
605\item Entity
606
607\item EntityReference
608
609\item DocumentFragment
610
611\end{itemize}
612
613Most of these reflect information in the XML document that is not of
614general utility to most DOM users.