Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 1 | :mod:`xml.dom.pulldom` --- Support for building partial DOM trees |
| 2 | ================================================================= |
| 3 | |
| 4 | .. module:: xml.dom.pulldom |
| 5 | :synopsis: Support for building partial DOM trees from SAX events. |
Terry Jan Reedy | fa089b9 | 2016-06-11 15:02:54 -0400 | [diff] [blame] | 6 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 7 | .. moduleauthor:: Paul Prescod <paul@prescod.net> |
| 8 | |
Raymond Hettinger | 3029aff | 2011-02-10 08:09:36 +0000 | [diff] [blame] | 9 | **Source code:** :source:`Lib/xml/dom/pulldom.py` |
| 10 | |
| 11 | -------------- |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 12 | |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 13 | The :mod:`xml.dom.pulldom` module provides a "pull parser" which can also be |
| 14 | asked to produce DOM-accessible fragments of the document where necessary. The |
| 15 | basic concept involves pulling "events" from a stream of incoming XML and |
| 16 | processing them. In contrast to SAX which also employs an event-driven |
| 17 | processing model together with callbacks, the user of a pull parser is |
| 18 | responsible for explicitly pulling events from the stream, looping over those |
| 19 | events until either processing is finished or an error condition occurs. |
| 20 | |
Christian Heimes | 7380a67 | 2013-03-26 17:35:55 +0100 | [diff] [blame] | 21 | |
| 22 | .. warning:: |
| 23 | |
| 24 | The :mod:`xml.dom.pulldom` module is not secure against |
| 25 | maliciously constructed data. If you need to parse untrusted or |
| 26 | unauthenticated data see :ref:`xml-vulnerabilities`. |
| 27 | |
Serhiy Storchaka | bf99bcf | 2018-12-19 15:29:04 +0200 | [diff] [blame] | 28 | .. versionchanged:: 3.7.1 |
Christian Heimes | 17b1d5d | 2018-09-23 09:50:25 +0200 | [diff] [blame] | 29 | |
| 30 | The SAX parser no longer processes general external entities by default to |
| 31 | increase security by default. To enable processing of external entities, |
| 32 | pass a custom parser instance in:: |
| 33 | |
| 34 | from xml.dom.pulldom import parse |
| 35 | from xml.sax import make_parser |
| 36 | from xml.sax.handler import feature_external_ges |
| 37 | |
| 38 | parser = make_parser() |
| 39 | parser.setFeature(feature_external_ges, True) |
| 40 | parse(filename, parser=parser) |
| 41 | |
Christian Heimes | 7380a67 | 2013-03-26 17:35:55 +0100 | [diff] [blame] | 42 | |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 43 | Example:: |
| 44 | |
| 45 | from xml.dom import pulldom |
| 46 | |
| 47 | doc = pulldom.parse('sales_items.xml') |
| 48 | for event, node in doc: |
| 49 | if event == pulldom.START_ELEMENT and node.tagName == 'item': |
| 50 | if int(node.getAttribute('price')) > 50: |
| 51 | doc.expandNode(node) |
| 52 | print(node.toxml()) |
| 53 | |
| 54 | ``event`` is a constant and can be one of: |
| 55 | |
| 56 | * :data:`START_ELEMENT` |
| 57 | * :data:`END_ELEMENT` |
| 58 | * :data:`COMMENT` |
| 59 | * :data:`START_DOCUMENT` |
| 60 | * :data:`END_DOCUMENT` |
| 61 | * :data:`CHARACTERS` |
| 62 | * :data:`PROCESSING_INSTRUCTION` |
| 63 | * :data:`IGNORABLE_WHITESPACE` |
| 64 | |
Martin Panter | 7462b649 | 2015-11-02 03:37:02 +0000 | [diff] [blame] | 65 | ``node`` is an object of type :class:`xml.dom.minidom.Document`, |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 66 | :class:`xml.dom.minidom.Element` or :class:`xml.dom.minidom.Text`. |
| 67 | |
| 68 | Since the document is treated as a "flat" stream of events, the document "tree" |
| 69 | is implicitly traversed and the desired elements are found regardless of their |
Eli Bendersky | 969b8da | 2012-03-16 16:49:58 +0200 | [diff] [blame] | 70 | depth in the tree. In other words, one does not need to consider hierarchical |
| 71 | issues such as recursive searching of the document nodes, although if the |
| 72 | context of elements were important, one would either need to maintain some |
| 73 | context-related state (i.e. remembering where one is in the document at any |
| 74 | given point) or to make use of the :func:`DOMEventStream.expandNode` method |
| 75 | and switch to DOM-related processing. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 76 | |
| 77 | |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 78 | .. class:: PullDom(documentFactory=None) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 79 | |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 80 | Subclass of :class:`xml.sax.handler.ContentHandler`. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 81 | |
| 82 | |
Georg Brandl | 7f01a13 | 2009-09-16 15:58:14 +0000 | [diff] [blame] | 83 | .. class:: SAX2DOM(documentFactory=None) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 84 | |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 85 | Subclass of :class:`xml.sax.handler.ContentHandler`. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 86 | |
| 87 | |
Georg Brandl | 7f01a13 | 2009-09-16 15:58:14 +0000 | [diff] [blame] | 88 | .. function:: parse(stream_or_string, parser=None, bufsize=None) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 89 | |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 90 | Return a :class:`DOMEventStream` from the given input. *stream_or_string* may be |
Martin Panter | d210a70 | 2016-08-20 08:03:06 +0000 | [diff] [blame] | 91 | either a file name, or a file-like object. *parser*, if given, must be an |
Serhiy Storchaka | 15e6590 | 2013-08-29 10:28:44 +0300 | [diff] [blame] | 92 | :class:`~xml.sax.xmlreader.XMLReader` object. This function will change the |
| 93 | document handler of the |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 94 | parser and activate namespace support; other parser configuration (like |
| 95 | setting an entity resolver) must have been done in advance. |
| 96 | |
| 97 | If you have XML in a string, you can use the :func:`parseString` function instead: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 98 | |
Georg Brandl | 7f01a13 | 2009-09-16 15:58:14 +0000 | [diff] [blame] | 99 | .. function:: parseString(string, parser=None) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 100 | |
Eli Bendersky | 969b8da | 2012-03-16 16:49:58 +0200 | [diff] [blame] | 101 | Return a :class:`DOMEventStream` that represents the (Unicode) *string*. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 102 | |
| 103 | .. data:: default_bufsize |
| 104 | |
| 105 | Default value for the *bufsize* parameter to :func:`parse`. |
| 106 | |
Georg Brandl | 55ac8f0 | 2007-09-01 13:51:09 +0000 | [diff] [blame] | 107 | The value of this variable can be changed before calling :func:`parse` and |
| 108 | the new value will take effect. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 109 | |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 110 | .. _domeventstream-objects: |
| 111 | |
| 112 | DOMEventStream Objects |
| 113 | ---------------------- |
| 114 | |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 115 | .. class:: DOMEventStream(stream, parser, bufsize) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 116 | |
Berker Peksag | 84a13fb | 2018-08-11 09:05:04 +0300 | [diff] [blame] | 117 | .. deprecated:: 3.8 |
| 118 | Support for :meth:`sequence protocol <__getitem__>` is deprecated. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 119 | |
Eli Bendersky | 969b8da | 2012-03-16 16:49:58 +0200 | [diff] [blame] | 120 | .. method:: getEvent() |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 121 | |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 122 | Return a tuple containing *event* and the current *node* as |
Eli Bendersky | 969b8da | 2012-03-16 16:49:58 +0200 | [diff] [blame] | 123 | :class:`xml.dom.minidom.Document` if event equals :data:`START_DOCUMENT`, |
| 124 | :class:`xml.dom.minidom.Element` if event equals :data:`START_ELEMENT` or |
| 125 | :data:`END_ELEMENT` or :class:`xml.dom.minidom.Text` if event equals |
| 126 | :data:`CHARACTERS`. |
delirious-lettuce | 3378b20 | 2017-05-19 14:37:57 -0600 | [diff] [blame] | 127 | The current node does not contain information about its children, unless |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 128 | :func:`expandNode` is called. |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 129 | |
Eli Bendersky | 969b8da | 2012-03-16 16:49:58 +0200 | [diff] [blame] | 130 | .. method:: expandNode(node) |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 131 | |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 132 | Expands all children of *node* into *node*. Example:: |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 133 | |
Berker Peksag | 13b3acd | 2016-03-30 16:28:43 +0300 | [diff] [blame] | 134 | from xml.dom import pulldom |
| 135 | |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 136 | xml = '<html><title>Foo</title> <p>Some text <div>and more</div></p> </html>' |
| 137 | doc = pulldom.parseString(xml) |
| 138 | for event, node in doc: |
| 139 | if event == pulldom.START_ELEMENT and node.tagName == 'p': |
| 140 | # Following statement only prints '<p/>' |
| 141 | print(node.toxml()) |
Berker Peksag | 13b3acd | 2016-03-30 16:28:43 +0300 | [diff] [blame] | 142 | doc.expandNode(node) |
Eli Bendersky | 3fb05a9 | 2012-03-16 14:37:14 +0200 | [diff] [blame] | 143 | # Following statement prints node with all its children '<p>Some text <div>and more</div></p>' |
| 144 | print(node.toxml()) |
| 145 | |
| 146 | .. method:: DOMEventStream.reset() |
Georg Brandl | 116aa62 | 2007-08-15 14:28:22 +0000 | [diff] [blame] | 147 | |