Blame - Doc/lib/xmldom.tex - platform/external/python/cpython3

blob: c2945a4ac370570b308c09f158b95e32444d2260 [file] [log] [blame]

Fred Drake	669d36f	2000-10-24 02:34:45 +0000	[diff] [blame]	1	\section{\module{xml.dom.minidom} ---
				2	The Document Object Model}
				3
				4	\declaremodule{standard}{xml.dom.minidom}
				5	\modulesynopsis{Lightweight Document Object Model (DOM) implementation.}
				6	\moduleauthor{Paul Prescod}{paul@prescod.net}
				7	\sectionauthor{Paul Prescod}{paul@prescod.net}
				8	\sectionauthor{Martin v. L\"owis}{loewis@informatik.hu-berlin.de}
				9
				10	\versionadded{2.0}
				11
				12	The \module{xml.dom.minidom} provides a light-weight implementation of
				13	the W3C Document Object Model. The DOM is a cross-language API from
				14	the Web Consortium (W3C) for accessing and modifying XML documents. A
				15	DOM implementation allows to convert an XML document into a tree-like
				16	structure, or to build such a structure from scratch. It then gives
				17	access to the structure through a set of objects which provided
				18	well-known interfaces. Minidom is intended to be simpler than the full
				19	DOM and also significantly smaller.
				20
				21	The DOM is extremely useful for random-access applications. SAX only
				22	allows you a view of one bit of the document at a time. If you are
				23	looking at one SAX element, you have no access to another. If you are
				24	looking at a text node, you have no access to a containing
				25	element. When you write a SAX application, you need to keep track of
				26	your program's position in the document somewhere in your own
				27	code. Sax does not do it for you. Also, if you need to look ahead in
				28	the XML document, you are just out of luck.
				29
				30	Some applications are simply impossible in an event driven model with
				31	no access to a tree. Of course you could build some sort of tree
				32	yourself in SAX events, but the DOM allows you to avoid writing that
				33	code. The DOM is a standard tree representation for XML data.
				34
				35	%What if your needs are somewhere between SAX and the DOM? Perhaps you cannot
				36	%afford to load the entire tree in memory but you find the SAX model
				37	%somewhat cumbersome and low-level. There is also an experimental module
				38	%called pulldom that allows you to build trees of only the parts of a
				39	%document that you need structured access to. It also has features that allow
				40	%you to find your way around the DOM.
				41	% See http://www.prescod.net/python/pulldom
				42
				43	DOM applications typically start by parsing some XML into a DOM. This
				44	is done through the parse functions:
				45
				46	\begin{verbatim}
				47	from xml.dom.minidom import parse, parseString
				48
				49	dom1 = parse('c:\\temp\\mydata.xml') # parse an XML file by name
				50
				51	datasource = open('c:\\temp\\mydata.xml')
				52	dom2 = parse(datasource) # parse an open file
				53
				54	dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
				55	\end{verbatim}
				56
				57	The parse function can take either a filename or an open file object.
				58
				59	\begin{funcdesc}{parse}{filename_or_file{, parser}}
				60	Return a \class{Document} from the given input. \var{filename_or_file}
				61	may be either a file name, or a file-like object. \var{parser}, if
				62	given, must be a SAX2 parser object. This function will change the
				63	document handler of the parser and activate namespace support; other
				64	parser configuration (like setting an entity resolver) must have been
				65	done in advance.
				66	\end{funcdesc}
				67
				68	If you have XML in a string, you can use the parseString function
				69	instead:
				70
				71	\begin{funcdesc}{parseString}{string\optional{, parser}}
				72	Return a \class{Document} that represents the \var{string}. This
				73	method creates a \class{StringIO} object for the string and passes
				74	that on to \function{parse}.
				75	\end{funcdesc}
				76
				77	Both functions return a document object representing the content of
				78	the document.
				79
				80	You can also create a document node merely by instantiating a
				81	document object. Then you could add child nodes to it to populate
				82	the DOM.
				83
				84	\begin{verbatim}
				85	from xml.dom.minidom import Document
				86
				87	newdoc = Document()
				88	newel = newdoc.createElement("some_tag")
				89	newdoc.appendChild(newel)
				90	\end{verbatim}
				91
				92	Once you have a DOM document object, you can access the parts of your
				93	XML document through its properties and methods. These properties are
				94	defined in the DOM specification. The main property of the document
				95	object is the documentElement property. It gives you the main element
				96	in the XML document: the one that holds all others. Here is an
				97	example program:
				98
				99	\begin{verbatim}
				100	dom3 = parseString("<myxml>Some data</myxml>")
				101	assert dom3.documentElement.tagName == "myxml"
				102	\end{verbatim}
				103
				104	When you are finished with a DOM, you should clean it up. This is
				105	necessary because some versions of Python do not support garbage
				106	collection of objects that refer to each other in a cycle. Until this
				107	restriction is removed from all versions of Python, it is safest to
				108	write your code as if cycles would not be cleaned up.
				109
				110	The way to clean up a DOM is to call its \method{unlink()} method:
				111
				112	\begin{verbatim}
				113	dom1.unlink()
				114	dom2.unlink()
				115	dom3.unlink()
				116	\end{verbatim}
				117
				118	\method{unlink()} is a \module{minidom}-specific extension to the DOM
				119	API. After calling \method{unlink()}, a DOM is basically useless.
				120
				121	\begin{seealso}
				122	\seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{DOM Specification}
				123	{This is the canonical specification for the level of the
				124	DOM supported by \module{xml.dom.minidom}.}
				125	\seetitle[http://pyxml.sourceforge.net]{PyXML}{Users that require a
				126	full-featured implementation of DOM should use the PyXML
				127	package.}
				128	\end{seealso}
				129
				130
				131	\subsection{DOM objects \label{dom-objects}}
				132
				133	The definitive documentation for the DOM is the DOM specification from
				134	the W3C. This section lists the properties and methods supported by
				135	\refmodule{xml.dom.minidom}.
				136
				137	\begin{classdesc}{Node}{}
				138	All of the components of an XML document are subclasses of
				139	\class{Node}.
				140
				141	\begin{memberdesc}{nodeType}
				142	An integer representing the node type. Symbolic constants for the
				143	types are on the \class{Node} object: \constant{DOCUMENT_NODE},
				144	\constant{ELEMENT_NODE}, \constant{ATTRIBUTE_NODE},
				145	\constant{TEXT_NODE}, \constant{CDATA_SECTION_NODE},
				146	\constant{ENTITY_NODE}, \constant{PROCESSING_INSTRUCTION_NODE},
				147	\constant{COMMENT_NODE}, \constant{DOCUMENT_NODE},
				148	\constant{DOCUMENT_TYPE_NODE}, \constant{NOTATION_NODE}.
				149	\end{memberdesc}
				150
				151	\begin{memberdesc}{parentNode}
				152	The parent of the current node. \code{None} for the document node.
				153	\end{memberdesc}
				154
				155	\begin{memberdesc}{attributes}
				156	An \class{AttributeList} of attribute objects. Only
				157	elements have this attribute. Others return \code{None}.
				158	\end{memberdesc}
				159
				160	\begin{memberdesc}{previousSibling}
				161	The node that immediately precedes this one with the same parent. For
				162	instance the element with an end-tag that comes just before the
				163	\var{self} element's start-tag. Of course, XML documents are made
				164	up of more than just elements so the previous sibling could be text, a
				165	comment, or something else.
				166	\end{memberdesc}
				167
				168	\begin{memberdesc}{nextSibling}
				169	The node that immediately follows this one with the same parent. See
				170	also \member{previousSibling}.
				171	\end{memberdesc}
				172
				173	\begin{memberdesc}{childNodes}
				174	A list of nodes contained within this node.
				175	\end{memberdesc}
				176
				177	\begin{memberdesc}{firstChild}
				178	Equivalent to \code{childNodes[0]}.
				179	\end{memberdesc}
				180
				181	\begin{memberdesc}{lastChild}
				182	Equivalent to \code{childNodes[-1]}.
				183	\end{memberdesc}
				184
				185	\begin{memberdesc}{nodeName}
				186	Has a different meaning for each node type. See the DOM specification
				187	for details. You can always get the information you would get here
				188	from another property such as the \member{tagName} property for
				189	elements or the \member{name} property for attributes.
				190	\end{memberdesc}
				191
				192	\begin{memberdesc}{nodeValue}
				193	Has a different meaning for each node type. See the DOM specification
				194	for details. The situation is similar to that with \member{nodeName}.
				195	\end{memberdesc}
				196
				197	\begin{methoddesc}{unlink}{}
				198	Break internal references within the DOM so that it will be garbage
				199	collected on versions of Python without cyclic GC.
				200	\end{methoddesc}
				201
				202	\begin{methoddesc}{writexml}{writer}
				203	Write XML to the writer object. The writer should have a
				204	\method{write()} method which matches that of the file object
				205	interface.
				206	\end{methoddesc}
				207
				208	\begin{methoddesc}{toxml}{}
				209	Return the XML string that the DOM represents.
				210	\end{methoddesc}
				211
				212	\begin{methoddesc}{hasChildNodes}{}
				213	Returns true the node has any child nodes.
				214	\end{methoddesc}
				215
				216	\begin{methoddesc}{insertBefore}{newChild, refChild}
				217	Insert a new child node before an existing child. It must be the case
				218	that \var{refChild} is a child of this node; if not,
				219	\exception{ValueError} is raised.
				220	\end{methoddesc}
				221
				222	\begin{methoddesc}{replaceChild}{newChild, oldChild}
				223	Replace an existing node with a new node. It must be the case that
				224	\var{oldChild} is a child of this node; if not,
				225	\exception{ValueError} is raised.
				226	\end{methoddesc}
				227
				228	\begin{methoddesc}{removeChild}{oldChild}
				229	Remove a child node. \var{oldChild} must be a child of this node; if
				230	not, \exception{ValueError} is raised.
				231	\end{methoddesc}
				232
				233	\begin{methoddesc}{appendChild}{newChild}
				234	Add a new child node to this node list.
				235	\end{methoddesc}
				236
				237	\begin{methoddesc}{cloneNode}{deep}
				238	Clone this node. Deep means to clone all children also. Deep cloning
				239	is not implemented in Python 2 so the deep parameter should always be
				240	0 for now.
				241	\end{methoddesc}
				242
				243	\end{classdesc}
				244
				245
				246	\begin{classdesc}{Document}{}
				247	Represents an entire XML document, including its constituent elements,
				248	attributes, processing instructions, comments etc. Remeber that it
				249	inherits properties from \class{Node}.
				250
				251	\begin{memberdesc}{documentElement}
				252	The one and only root element of the document.
				253	\end{memberdesc}
				254
				255	\begin{methoddesc}{createElement}{tagName}
				256	Create a new element. The element is not inserted into the document
				257	when it is created. You need to explicitly insert it with one of the
				258	other methods such as \method{insertBefore()} or
				259	\method{appendChild()}.
				260	\end{methoddesc}
				261
				262	\begin{methoddesc}{createTextNode}{data}
				263	Create a text node containing the data passed as a parameter. As with
				264	the other creation methods, this one does not insert the node into the
				265	tree.
				266	\end{methoddesc}
				267
				268	\begin{methoddesc}{createComment}{data}
				269	Create a comment node containing the data passed as a parameter. As
				270	with the other creation methods, this one does not insert the node
				271	into the tree.
				272	\end{methoddesc}
				273
				274	\begin{methoddesc}{createProcessingInstruction}{target, data}
				275	Create a processing instruction node containing the \var{target} and
				276	\var{data} passed as parameters. As with the other creation methods,
				277	this one does not insert the node into the tree.
				278	\end{methoddesc}
				279
				280	\begin{methoddesc}{createAttribute}{name}
				281	Create an attribute node. This method does not associate the
				282	attribute node with any particular element. You must use
				283	\method{setAttributeNode()} on the appropriate \class{Element} object
				284	to use the newly created attribute instance.
				285	\end{methoddesc}
				286
				287	\begin{methoddesc}{createElementNS}{namespaceURI, tagName}
				288	Create a new element with a namespace. The \var{tagName} may have a
				289	prefix. The element is not inserted into the document when it is
				290	created. You need to explicitly insert it with one of the other
				291	methods such as \method{insertBefore()} or \method{appendChild()}.
				292	\end{methoddesc}
				293
				294
				295	\begin{methoddesc}{createAttributeNS}{namespaceURI, qualifiedName}
				296	Create an attribute node with a namespace. The \var{tagName} may have
				297	a prefix. This method does not associate the attribute node with any
				298	particular element. You must use \method{setAttributeNode()} on the
				299	appropriate \class{Element} object to use the newly created attribute
				300	instance.
				301	\end{methoddesc}
				302
				303	\begin{methoddesc}{getElementsByTagName}{tagName}
				304	Search for all descendants (direct children, children's children,
				305	etc.) with a particular element type name.
				306	\end{methoddesc}
				307
				308	\begin{methoddesc}{getElementsByTagNameNS}{namespaceURI, localName}
				309	Search for all descendants (direct children, children's children,
				310	etc.) with a particular namespace URI and localname. The localname is
				311	the part of the namespace after the prefix.
				312	\end{methoddesc}
				313
				314	\end{classdesc}
				315
				316
				317	\begin{classdesc}{Element}{}
				318	\begin{memberdesc}{tagName}
				319	The element type name. In a namespace-using document it may have
				320	colons in it.
				321	\end{memberdesc}
				322
				323	\begin{memberdesc}{localName}
				324	The part of the \member{tagName} following the colon if there is one,
				325	else the entire \member{tagName}.
				326	\end{memberdesc}
				327
				328	\begin{memberdesc}{prefix}
				329	The part of the \member{tagName} preceding the colon if there is one,
				330	else the empty string.
				331	\end{memberdesc}
				332
				333	\begin{memberdesc}{namespaceURI}
				334	The namespace associated with the tagName.
				335	\end{memberdesc}
				336
				337	\begin{methoddesc}{getAttribute}{attname}
				338	Return an attribute value as a string.
				339	\end{methoddesc}
				340
				341	\begin{methoddesc}{setAttribute}{attname, value}
				342	Set an attribute value from a string.
				343	\end{methoddesc}
				344
				345	\begin{methoddesc}{removeAttribute}{attname}
				346	Remove an attribute by name.
				347	\end{methoddesc}
				348
				349	\begin{methoddesc}{getAttributeNS}{namespaceURI, localName}
				350	Return an attribute value as a string, given a \var{namespaceURI} and
				351	\var{localName}. Note that a localname is the part of a prefixed
				352	attribute name after the colon (if there is one).
				353	\end{methoddesc}
				354
				355	\begin{methoddesc}{setAttributeNS}{namespaceURI, qname, value}
				356	Set an attribute value from a string, given a \var{namespaceURI} and a
				357	\var{qname}. Note that a qname is the whole attribute name. This is
				358	different than above.
				359	\end{methoddesc}
				360
				361	\begin{methoddesc}{removeAttributeNS}{namespaceURI, localName}
				362	Remove an attribute by name. Note that it uses a localName, not a
				363	qname.
				364	\end{methoddesc}
				365
				366	\begin{methoddesc}{getElementsByTagName}{tagName}
				367	Same as equivalent method in the \class{Document} class.
				368	\end{methoddesc}
				369
				370	\begin{methoddesc}{getElementsByTagNameNS}{tagName}
				371	Same as equivalent method in the \class{Document} class.
				372	\end{methoddesc}
				373
				374	\end{classdesc}
				375
				376
				377	\begin{classdesc}{Attribute}{}
				378
				379	\begin{memberdesc}{name}
				380	The attribute name. In a namespace-using document it may have colons
				381	in it.
				382	\end{memberdesc}
				383
				384	\begin{memberdesc}{localName}
				385	The part of the name following the colon if there is one, else the
				386	entire name.
				387	\end{memberdesc}
				388
				389	\begin{memberdesc}{prefix}
				390	The part of the name preceding the colon if there is one, else the
				391	empty string.
				392	\end{memberdesc}
				393
				394	\begin{memberdesc}{namespaceURI}
				395	The namespace associated with the attribute name.
				396	\end{memberdesc}
				397
				398	\end{classdesc}
				399
				400
				401	\begin{classdesc}{AttributeList}{}
				402
				403	\begin{memberdesc}{length}
				404	The length of the attribute list.
				405	\end{memberdesc}
				406
				407	\begin{methoddesc}{item}{index}
				408	Return an attribute with a particular index. The order you get the
				409	attributes in is arbitrary but will be consistent for the life of a
				410	DOM. Each item is an attribute node. Get its value with the
				411	\member{value} attribbute.
				412	\end{methoddesc}
				413
				414	There are also experimental methods that give this class more
				415	dictionary-like behavior. You can use them or you can use the
				416	standardized \method{getAttribute*()}-family methods.
				417
				418	\end{classdesc}
				419
				420
				421	\begin{classdesc}{Comment}{}
				422	Represents a comment in the XML document.
				423
				424	\begin{memberdesc}{data}
				425	The content of the comment.
				426	\end{memberdesc}
				427	\end{classdesc}
				428
				429
				430	\begin{classdesc}{Text}{}
				431	Represents text in the XML document.
				432
				433	\begin{memberdesc}{data}
				434	The content of the text node.
				435	\end{memberdesc}
				436	\end{classdesc}
				437
				438
				439	\begin{classdesc}{ProcessingInstruction}{}
				440	Represents a processing instruction in the XML document.
				441
				442	\begin{memberdesc}{target}
				443	The content of the processing instruction up to the first whitespace
				444	character.
				445	\end{memberdesc}
				446
				447	\begin{memberdesc}{data}
				448	The content of the processing instruction following the first
				449	whitespace character.
				450	\end{memberdesc}
				451	\end{classdesc}
				452
				453	Note that DOM attributes may also be manipulated as nodes instead of as
				454	simple strings. It is fairly rare that you must do this, however, so this
				455	usage is not yet documented here.
				456
				457
				458	\begin{seealso}
				459	\seetitle[http://www.w3.org/TR/REC-DOM-Level-1/]{DOM Specification}
				460	{This is the canonical specification for the level of the
				461	DOM supported by \module{xml.dom.minidom}.}
				462	\end{seealso}
				463
				464
				465	\subsection{DOM Example \label{dom-example}}
				466
				467	This example program is a fairly realistic example of a simple
				468	program. In this particular case, we do not take much advantage
				469	of the flexibility of the DOM.
				470
				471	\begin{verbatim}
				472	from xml.dom.minidom import parse, parseString
				473
				474	document="""
				475	<slideshow>
				476	<title>Demo slideshow</title>
				477	<slide><title>Slide title</title>
				478	<point>This is a demo</point>
				479	<point>Of a program for processing slides</point>
				480	</slide>
				481
				482	<slide><title>Another demo slide</title>
				483	<point>It is important</point>
				484	<point>To have more than</point>
				485	<point>one slide</point>
				486	</slide>
				487	</slideshow>
				488	"""
				489
				490	dom = parseString(document)
				491
				492	space=" "
				493	def getText(nodelist):
				494	rc=""
				495	for node in nodelist:
				496	if node.nodeType==node.TEXT_NODE:
				497	rc=rc+node.data
				498	return rc
				499
				500	def handleSlideshow(slideshow):
				501	print "<html>"
				502	handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
				503	slides = slideshow.getElementsByTagName("slide")
				504	handleToc(slides)
				505	handleSlides(slides)
				506	print "</html>"
				507
				508	def handleSlides(slides):
				509	for slide in slides:
				510	handleSlide(slide)
				511
				512	def handleSlide(slide):
				513	handleSlideTitle(slide.getElementsByTagName("title")[0])
				514	handlePoints(slide.getElementsByTagName("point"))
				515
				516	def handleSlideshowTitle(title):
				517	print "<title>%s</title>"%getText(title.childNodes)
				518
				519	def handleSlideTitle(title):
				520	print "<h2>%s</h2>"%getText(title.childNodes)
				521
				522	def handlePoints(points):
				523	print "<ul>"
				524	for point in points:
				525	handlePoint(point)
				526	print "</ul>"
				527
				528	def handlePoint(point):
				529	print "<li>%s</li>"%getText(point.childNodes)
				530
				531	def handleToc(slides):
				532	for slide in slides:
				533	title = slide.getElementsByTagName("title")[0]
				534	print "<p>%s</p>"%getText(title.childNodes)
				535
				536	handleSlideshow(dom)
				537	\end{verbatim}
				538
				539	\subsection{minidom and the DOM standard \label{minidom-and-dom}}
				540
				541	Minidom is basically a DOM 1.0-compatible DOM with some DOM 2 features
				542	(primarily namespace features).
				543
				544	Usage of the other DOM interfaces in Python is straight-forward. The
				545	following mapping rules apply:
				546
				547	\begin{itemize}
				548
				549	\item Interfaces are accessed through instance objects. Applications
				550	should
				551	not instantiate the classes themselves; they should use the creator
				552	functions. Derived interfaces support all operations (and attributes)
				553	from the base interfaces, plus any new operations.
				554
				555	\item Operations are used as methods. Since the DOM uses only
				556	\code{in}
				557	parameters, the arguments are passed in normal order (from left to
				558	right).
				559	There are no optional arguments. \code{void} operations return
				560	\code{None}.
				561
				562	\item IDL attributes map to instance attributes. For compatibility
				563	with
				564	the OMG IDL language mapping for Python, an attribute \code{foo} can
				565	also be accessed through accessor functions \code{_get_foo} and
				566	\code{_set_foo}. \code{readonly} attributes must not be changed.
				567
				568	\item The types \code{short int},\code{unsigned int},\code{unsigned
				569	long long},
				570	and \code{boolean} all map to Python integer objects.
				571
				572	\item The type \code{DOMString} maps to Python strings. \code{minidom}
				573	supports either byte or Unicode strings, but will normally produce
				574	Unicode
				575	strings. Attributes of type \code{DOMString} may also be \code{None}.
				576
				577	\item \code{const} declarations map to variables in their respective
				578	scope
				579	(e.g. \code{xml.dom.minidom.Node.PROCESSING_INSTRUCTION_NODE}); they
				580	must
				581	not be changed.
				582
				583	\item \code{DOMException} is currently not supported in
				584	\module{minidom}. Instead, minidom returns standard Python exceptions
				585	such as TypeError and AttributeError.
				586
				587	\end{itemize}
				588
				589	The following interfaces have no equivalent in minidom:
				590
				591	\begin{itemize}
				592
				593	\item DOMTimeStamp
				594
				595	\item DocumentType
				596
				597	\item DOMImplementation
				598
				599	\item CharacterData
				600
				601	\item CDATASection
				602
				603	\item Notation
				604
				605	\item Entity
				606
				607	\item EntityReference
				608
				609	\item DocumentFragment
				610
				611	\end{itemize}
				612
				613	Most of these reflect information in the XML document that is not of
				614	general utility to most DOM users.