| \section{\module{xml.dom.minidom} --- |
| The Document Object Model} |
| |
| \declaremodule{standard}{xml.dom.minidom} |
| \modulesynopsis{Lightweight Document Object Model (DOM) implementation.} |
| \moduleauthor{Paul Prescod}{paul@prescod.net} |
| \sectionauthor{Paul Prescod}{paul@prescod.net} |
| \sectionauthor{Martin v. L\"owis}{loewis@informatik.hu-berlin.de} |
| |
| \versionadded{2.0} |
| |
| The \module{xml.dom.minidom} provides a light-weight implementation of |
| the W3C Document Object Model. The DOM is a cross-language API from |
| the Web Consortium (W3C) for accessing and modifying XML documents. A |
| DOM implementation allows to convert an XML document into a tree-like |
| structure, or to build such a structure from scratch. It then gives |
| access to the structure through a set of objects which provided |
| well-known interfaces. Minidom is intended to be simpler than the full |
| DOM and also significantly smaller. |
| |
| The DOM is extremely useful for random-access applications. SAX only |
| allows you a view of one bit of the document at a time. If you are |
| looking at one SAX element, you have no access to another. If you are |
| looking at a text node, you have no access to a containing |
| element. When you write a SAX application, you need to keep track of |
| your program's position in the document somewhere in your own |
| code. Sax does not do it for you. Also, if you need to look ahead in |
| the XML document, you are just out of luck. |
| |
| Some applications are simply impossible in an event driven model with |
| no access to a tree. Of course you could build some sort of tree |
| yourself in SAX events, but the DOM allows you to avoid writing that |
| code. The DOM is a standard tree representation for XML data. |
| |
| %What if your needs are somewhere between SAX and the DOM? Perhaps you cannot |
| %afford to load the entire tree in memory but you find the SAX model |
| %somewhat cumbersome and low-level. There is also an experimental module |
| %called pulldom that allows you to build trees of only the parts of a |
| %document that you need structured access to. It also has features that allow |
| %you to find your way around the DOM. |
| % See http://www.prescod.net/python/pulldom |
| |
| DOM applications typically start by parsing some XML into a DOM. This |
| is done through the parse functions: |
| |
| \begin{verbatim} |
| from xml.dom.minidom import parse, parseString |
| |
| dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name |
| |
| datasource = open('c:\\temp\\mydata.xml') |
| dom2 = parse(datasource) # parse an open file |
| |
| dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>') |
| \end{verbatim} |
| |
| The parse function can take either a filename or an open file object. |
| |
| \begin{funcdesc}{parse}{filename_or_file{, parser}} |
| Return a \class{Document} from the given input. \var{filename_or_file} |
| may be either a file name, or a file-like object. \var{parser}, if |
| given, must be a SAX2 parser object. This function will change the |
| document handler of the parser and activate namespace support; other |
| parser configuration (like setting an entity resolver) must have been |
| done in advance. |
| \end{funcdesc} |
| |
| If you have XML in a string, you can use the parseString function |
| instead: |
| |
| \begin{funcdesc}{parseString}{string\optional{, parser}} |
| Return a \class{Document} that represents the \var{string}. This |
| method creates a \class{StringIO} object for the string and passes |
| that on to \function{parse}. |
| \end{funcdesc} |
| |
| Both functions return a document object representing the content of |
| the document. |
| |
| You can also create a document node merely by instantiating a |
| document object. Then you could add child nodes to it to populate |
| the DOM. |
| |
| \begin{verbatim} |
| from xml.dom.minidom import Document |
| |
| newdoc = Document() |
| newel = newdoc.createElement("some_tag") |
| newdoc.appendChild(newel) |
| \end{verbatim} |
| |
| Once you have a DOM document object, you can access the parts of your |
| XML document through its properties and methods. These properties are |
| defined in the DOM specification. The main property of the document |
| object is the documentElement property. It gives you the main element |
| in the XML document: the one that holds all others. Here is an |
| example program: |
| |
| \begin{verbatim} |
| dom3 = parseString("<myxml>Some data</myxml>") |
| assert dom3.documentElement.tagName == "myxml" |
| \end{verbatim} |
| |
| When you are finished with a DOM, you should clean it up. This is |
| necessary because some versions of Python do not support garbage |
| collection of objects that refer to each other in a cycle. Until this |
| restriction is removed from all versions of Python, it is safest to |
| write your code as if cycles would not be cleaned up. |
| |
| The way to clean up a DOM is to call its \method{unlink()} method: |
| |
| \begin{verbatim} |
| dom1.unlink() |
| dom2.unlink() |
| dom3.unlink() |
| \end{verbatim} |
| |
| \method{unlink()} is a \module{minidom}-specific extension to the DOM |
| API. After calling \method{unlink()}, a DOM is basically useless. |
| |
| \begin{seealso} |
| \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{DOM Specification} |
| {This is the canonical specification for the level of the |
| DOM supported by \module{xml.dom.minidom}.} |
| \seetitle[http://pyxml.sourceforge.net]{PyXML}{Users that require a |
| full-featured implementation of DOM should use the PyXML |
| package.} |
| \end{seealso} |
| |
| |
| \subsection{DOM objects \label{dom-objects}} |
| |
| The definitive documentation for the DOM is the DOM specification from |
| the W3C. This section lists the properties and methods supported by |
| \refmodule{xml.dom.minidom}. |
| |
| \begin{classdesc}{Node}{} |
| All of the components of an XML document are subclasses of |
| \class{Node}. |
| |
| \begin{memberdesc}{nodeType} |
| An integer representing the node type. Symbolic constants for the |
| types are on the \class{Node} object: \constant{DOCUMENT_NODE}, |
| \constant{ELEMENT_NODE}, \constant{ATTRIBUTE_NODE}, |
| \constant{TEXT_NODE}, \constant{CDATA_SECTION_NODE}, |
| \constant{ENTITY_NODE}, \constant{PROCESSING_INSTRUCTION_NODE}, |
| \constant{COMMENT_NODE}, \constant{DOCUMENT_NODE}, |
| \constant{DOCUMENT_TYPE_NODE}, \constant{NOTATION_NODE}. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{parentNode} |
| The parent of the current node. \code{None} for the document node. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{attributes} |
| An \class{AttributeList} of attribute objects. Only |
| elements have this attribute. Others return \code{None}. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{previousSibling} |
| The node that immediately precedes this one with the same parent. For |
| instance the element with an end-tag that comes just before the |
| \var{self} element's start-tag. Of course, XML documents are made |
| up of more than just elements so the previous sibling could be text, a |
| comment, or something else. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{nextSibling} |
| The node that immediately follows this one with the same parent. See |
| also \member{previousSibling}. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{childNodes} |
| A list of nodes contained within this node. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{firstChild} |
| Equivalent to \code{childNodes[0]}. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{lastChild} |
| Equivalent to \code{childNodes[-1]}. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{nodeName} |
| Has a different meaning for each node type. See the DOM specification |
| for details. You can always get the information you would get here |
| from another property such as the \member{tagName} property for |
| elements or the \member{name} property for attributes. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{nodeValue} |
| Has a different meaning for each node type. See the DOM specification |
| for details. The situation is similar to that with \member{nodeName}. |
| \end{memberdesc} |
| |
| \begin{methoddesc}{unlink}{} |
| Break internal references within the DOM so that it will be garbage |
| collected on versions of Python without cyclic GC. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{writexml}{writer} |
| Write XML to the writer object. The writer should have a |
| \method{write()} method which matches that of the file object |
| interface. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{toxml}{} |
| Return the XML string that the DOM represents. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{hasChildNodes}{} |
| Returns true the node has any child nodes. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{insertBefore}{newChild, refChild} |
| Insert a new child node before an existing child. It must be the case |
| that \var{refChild} is a child of this node; if not, |
| \exception{ValueError} is raised. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{replaceChild}{newChild, oldChild} |
| Replace an existing node with a new node. It must be the case that |
| \var{oldChild} is a child of this node; if not, |
| \exception{ValueError} is raised. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{removeChild}{oldChild} |
| Remove a child node. \var{oldChild} must be a child of this node; if |
| not, \exception{ValueError} is raised. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{appendChild}{newChild} |
| Add a new child node to this node list. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{cloneNode}{deep} |
| Clone this node. Deep means to clone all children also. Deep cloning |
| is not implemented in Python 2 so the deep parameter should always be |
| 0 for now. |
| \end{methoddesc} |
| |
| \end{classdesc} |
| |
| |
| \begin{classdesc}{Document}{} |
| Represents an entire XML document, including its constituent elements, |
| attributes, processing instructions, comments etc. Remeber that it |
| inherits properties from \class{Node}. |
| |
| \begin{memberdesc}{documentElement} |
| The one and only root element of the document. |
| \end{memberdesc} |
| |
| \begin{methoddesc}{createElement}{tagName} |
| Create a new element. The element is not inserted into the document |
| when it is created. You need to explicitly insert it with one of the |
| other methods such as \method{insertBefore()} or |
| \method{appendChild()}. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{createTextNode}{data} |
| Create a text node containing the data passed as a parameter. As with |
| the other creation methods, this one does not insert the node into the |
| tree. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{createComment}{data} |
| Create a comment node containing the data passed as a parameter. As |
| with the other creation methods, this one does not insert the node |
| into the tree. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{createProcessingInstruction}{target, data} |
| Create a processing instruction node containing the \var{target} and |
| \var{data} passed as parameters. As with the other creation methods, |
| this one does not insert the node into the tree. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{createAttribute}{name} |
| Create an attribute node. This method does not associate the |
| attribute node with any particular element. You must use |
| \method{setAttributeNode()} on the appropriate \class{Element} object |
| to use the newly created attribute instance. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{createElementNS}{namespaceURI, tagName} |
| Create a new element with a namespace. The \var{tagName} may have a |
| prefix. The element is not inserted into the document when it is |
| created. You need to explicitly insert it with one of the other |
| methods such as \method{insertBefore()} or \method{appendChild()}. |
| \end{methoddesc} |
| |
| |
| \begin{methoddesc}{createAttributeNS}{namespaceURI, qualifiedName} |
| Create an attribute node with a namespace. The \var{tagName} may have |
| a prefix. This method does not associate the attribute node with any |
| particular element. You must use \method{setAttributeNode()} on the |
| appropriate \class{Element} object to use the newly created attribute |
| instance. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{getElementsByTagName}{tagName} |
| Search for all descendants (direct children, children's children, |
| etc.) with a particular element type name. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{getElementsByTagNameNS}{namespaceURI, localName} |
| Search for all descendants (direct children, children's children, |
| etc.) with a particular namespace URI and localname. The localname is |
| the part of the namespace after the prefix. |
| \end{methoddesc} |
| |
| \end{classdesc} |
| |
| |
| \begin{classdesc}{Element}{} |
| \begin{memberdesc}{tagName} |
| The element type name. In a namespace-using document it may have |
| colons in it. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{localName} |
| The part of the \member{tagName} following the colon if there is one, |
| else the entire \member{tagName}. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{prefix} |
| The part of the \member{tagName} preceding the colon if there is one, |
| else the empty string. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{namespaceURI} |
| The namespace associated with the tagName. |
| \end{memberdesc} |
| |
| \begin{methoddesc}{getAttribute}{attname} |
| Return an attribute value as a string. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{setAttribute}{attname, value} |
| Set an attribute value from a string. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{removeAttribute}{attname} |
| Remove an attribute by name. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{getAttributeNS}{namespaceURI, localName} |
| Return an attribute value as a string, given a \var{namespaceURI} and |
| \var{localName}. Note that a localname is the part of a prefixed |
| attribute name after the colon (if there is one). |
| \end{methoddesc} |
| |
| \begin{methoddesc}{setAttributeNS}{namespaceURI, qname, value} |
| Set an attribute value from a string, given a \var{namespaceURI} and a |
| \var{qname}. Note that a qname is the whole attribute name. This is |
| different than above. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{removeAttributeNS}{namespaceURI, localName} |
| Remove an attribute by name. Note that it uses a localName, not a |
| qname. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{getElementsByTagName}{tagName} |
| Same as equivalent method in the \class{Document} class. |
| \end{methoddesc} |
| |
| \begin{methoddesc}{getElementsByTagNameNS}{tagName} |
| Same as equivalent method in the \class{Document} class. |
| \end{methoddesc} |
| |
| \end{classdesc} |
| |
| |
| \begin{classdesc}{Attribute}{} |
| |
| \begin{memberdesc}{name} |
| The attribute name. In a namespace-using document it may have colons |
| in it. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{localName} |
| The part of the name following the colon if there is one, else the |
| entire name. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{prefix} |
| The part of the name preceding the colon if there is one, else the |
| empty string. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{namespaceURI} |
| The namespace associated with the attribute name. |
| \end{memberdesc} |
| |
| \end{classdesc} |
| |
| |
| \begin{classdesc}{AttributeList}{} |
| |
| \begin{memberdesc}{length} |
| The length of the attribute list. |
| \end{memberdesc} |
| |
| \begin{methoddesc}{item}{index} |
| Return an attribute with a particular index. The order you get the |
| attributes in is arbitrary but will be consistent for the life of a |
| DOM. Each item is an attribute node. Get its value with the |
| \member{value} attribbute. |
| \end{methoddesc} |
| |
| There are also experimental methods that give this class more |
| dictionary-like behavior. You can use them or you can use the |
| standardized \method{getAttribute*()}-family methods. |
| |
| \end{classdesc} |
| |
| |
| \begin{classdesc}{Comment}{} |
| Represents a comment in the XML document. |
| |
| \begin{memberdesc}{data} |
| The content of the comment. |
| \end{memberdesc} |
| \end{classdesc} |
| |
| |
| \begin{classdesc}{Text}{} |
| Represents text in the XML document. |
| |
| \begin{memberdesc}{data} |
| The content of the text node. |
| \end{memberdesc} |
| \end{classdesc} |
| |
| |
| \begin{classdesc}{ProcessingInstruction}{} |
| Represents a processing instruction in the XML document. |
| |
| \begin{memberdesc}{target} |
| The content of the processing instruction up to the first whitespace |
| character. |
| \end{memberdesc} |
| |
| \begin{memberdesc}{data} |
| The content of the processing instruction following the first |
| whitespace character. |
| \end{memberdesc} |
| \end{classdesc} |
| |
| Note that DOM attributes may also be manipulated as nodes instead of as |
| simple strings. It is fairly rare that you must do this, however, so this |
| usage is not yet documented here. |
| |
| |
| \begin{seealso} |
| \seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{DOM Specification} |
| {This is the canonical specification for the level of the |
| DOM supported by \module{xml.dom.minidom}.} |
| \end{seealso} |
| |
| |
| \subsection{DOM Example \label{dom-example}} |
| |
| This example program is a fairly realistic example of a simple |
| program. In this particular case, we do not take much advantage |
| of the flexibility of the DOM. |
| |
| \begin{verbatim} |
| from xml.dom.minidom import parse, parseString |
| |
| document=""" |
| <slideshow> |
| <title>Demo slideshow</title> |
| <slide><title>Slide title</title> |
| <point>This is a demo</point> |
| <point>Of a program for processing slides</point> |
| </slide> |
| |
| <slide><title>Another demo slide</title> |
| <point>It is important</point> |
| <point>To have more than</point> |
| <point>one slide</point> |
| </slide> |
| </slideshow> |
| """ |
| |
| dom = parseString(document) |
| |
| space=" " |
| def getText(nodelist): |
| rc="" |
| for node in nodelist: |
| if node.nodeType==node.TEXT_NODE: |
| rc=rc+node.data |
| return rc |
| |
| def handleSlideshow(slideshow): |
| print "<html>" |
| handleSlideshowTitle(slideshow.getElementsByTagName("title")[0]) |
| slides = slideshow.getElementsByTagName("slide") |
| handleToc(slides) |
| handleSlides(slides) |
| print "</html>" |
| |
| def handleSlides(slides): |
| for slide in slides: |
| handleSlide(slide) |
| |
| def handleSlide(slide): |
| handleSlideTitle(slide.getElementsByTagName("title")[0]) |
| handlePoints(slide.getElementsByTagName("point")) |
| |
| def handleSlideshowTitle(title): |
| print "<title>%s</title>"%getText(title.childNodes) |
| |
| def handleSlideTitle(title): |
| print "<h2>%s</h2>"%getText(title.childNodes) |
| |
| def handlePoints(points): |
| print "<ul>" |
| for point in points: |
| handlePoint(point) |
| print "</ul>" |
| |
| def handlePoint(point): |
| print "<li>%s</li>"%getText(point.childNodes) |
| |
| def handleToc(slides): |
| for slide in slides: |
| title = slide.getElementsByTagName("title")[0] |
| print "<p>%s</p>"%getText(title.childNodes) |
| |
| handleSlideshow(dom) |
| \end{verbatim} |
| |
| \subsection{minidom and the DOM standard \label{minidom-and-dom}} |
| |
| Minidom is basically a DOM 1.0-compatible DOM with some DOM 2 features |
| (primarily namespace features). |
| |
| Usage of the other DOM interfaces in Python is straight-forward. The |
| following mapping rules apply: |
| |
| \begin{itemize} |
| |
| \item Interfaces are accessed through instance objects. Applications |
| should |
| not instantiate the classes themselves; they should use the creator |
| functions. Derived interfaces support all operations (and attributes) |
| from the base interfaces, plus any new operations. |
| |
| \item Operations are used as methods. Since the DOM uses only |
| \code{in} |
| parameters, the arguments are passed in normal order (from left to |
| right). |
| There are no optional arguments. \code{void} operations return |
| \code{None}. |
| |
| \item IDL attributes map to instance attributes. For compatibility |
| with |
| the OMG IDL language mapping for Python, an attribute \code{foo} can |
| also be accessed through accessor functions \code{_get_foo} and |
| \code{_set_foo}. \code{readonly} attributes must not be changed. |
| |
| \item The types \code{short int},\code{unsigned int},\code{unsigned |
| long long}, |
| and \code{boolean} all map to Python integer objects. |
| |
| \item The type \code{DOMString} maps to Python strings. \code{minidom} |
| supports either byte or Unicode strings, but will normally produce |
| Unicode |
| strings. Attributes of type \code{DOMString} may also be \code{None}. |
| |
| \item \code{const} declarations map to variables in their respective |
| scope |
| (e.g. \code{xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE}); they |
| must |
| not be changed. |
| |
| \item \code{DOMException} is currently not supported in |
| \module{minidom}. Instead, minidom returns standard Python exceptions |
| such as TypeError and AttributeError. |
| |
| \end{itemize} |
| |
| The following interfaces have no equivalent in minidom: |
| |
| \begin{itemize} |
| |
| \item DOMTimeStamp |
| |
| \item DocumentType |
| |
| \item DOMImplementation |
| |
| \item CharacterData |
| |
| \item CDATASection |
| |
| \item Notation |
| |
| \item Entity |
| |
| \item EntityReference |
| |
| \item DocumentFragment |
| |
| \end{itemize} |
| |
| Most of these reflect information in the XML document that is not of |
| general utility to most DOM users. |