Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1 | :mod:`sgmllib` --- Simple SGML parser |
| 2 | ===================================== |
| 3 | |
| 4 | .. module:: sgmllib |
| 5 | :synopsis: Only as much of an SGML parser as needed to parse HTML. |
Georg Brandl | ac19d85 | 2008-06-01 21:19:14 +0000 | [diff] [blame] | 6 | :deprecated: |
| 7 | |
| 8 | .. deprecated:: 2.6 |
| 9 | The :mod:`sgmllib` module has been removed in Python 3.0. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 10 | |
| 11 | .. index:: single: SGML |
| 12 | |
| 13 | This module defines a class :class:`SGMLParser` which serves as the basis for |
| 14 | parsing text files formatted in SGML (Standard Generalized Mark-up Language). |
| 15 | In fact, it does not provide a full SGML parser --- it only parses SGML insofar |
| 16 | as it is used by HTML, and the module only exists as a base for the |
| 17 | :mod:`htmllib` module. Another HTML parser which supports XHTML and offers a |
| 18 | somewhat different interface is available in the :mod:`HTMLParser` module. |
| 19 | |
| 20 | |
| 21 | .. class:: SGMLParser() |
| 22 | |
| 23 | The :class:`SGMLParser` class is instantiated without arguments. The parser is |
| 24 | hardcoded to recognize the following constructs: |
| 25 | |
| 26 | * Opening and closing tags of the form ``<tag attr="value" ...>`` and |
| 27 | ``</tag>``, respectively. |
| 28 | |
| 29 | * Numeric character references of the form ``&#name;``. |
| 30 | |
| 31 | * Entity references of the form ``&name;``. |
| 32 | |
| 33 | * SGML comments of the form ``<!--text-->``. Note that spaces, tabs, and |
| 34 | newlines are allowed between the trailing ``>`` and the immediately preceding |
| 35 | ``--``. |
| 36 | |
| 37 | A single exception is defined as well: |
| 38 | |
| 39 | |
| 40 | .. exception:: SGMLParseError |
| 41 | |
| 42 | Exception raised by the :class:`SGMLParser` class when it encounters an error |
| 43 | while parsing. |
| 44 | |
| 45 | .. versionadded:: 2.1 |
| 46 | |
| 47 | :class:`SGMLParser` instances have the following methods: |
| 48 | |
| 49 | |
| 50 | .. method:: SGMLParser.reset() |
| 51 | |
| 52 | Reset the instance. Loses all unprocessed data. This is called implicitly at |
| 53 | instantiation time. |
| 54 | |
| 55 | |
| 56 | .. method:: SGMLParser.setnomoretags() |
| 57 | |
| 58 | Stop processing tags. Treat all following input as literal input (CDATA). |
| 59 | (This is only provided so the HTML tag ``<PLAINTEXT>`` can be implemented.) |
| 60 | |
| 61 | |
| 62 | .. method:: SGMLParser.setliteral() |
| 63 | |
| 64 | Enter literal mode (CDATA mode). |
| 65 | |
| 66 | |
| 67 | .. method:: SGMLParser.feed(data) |
| 68 | |
| 69 | Feed some text to the parser. It is processed insofar as it consists of |
| 70 | complete elements; incomplete data is buffered until more data is fed or |
| 71 | :meth:`close` is called. |
| 72 | |
| 73 | |
| 74 | .. method:: SGMLParser.close() |
| 75 | |
| 76 | Force processing of all buffered data as if it were followed by an end-of-file |
| 77 | mark. This method may be redefined by a derived class to define additional |
| 78 | processing at the end of the input, but the redefined version should always call |
| 79 | :meth:`close`. |
| 80 | |
| 81 | |
| 82 | .. method:: SGMLParser.get_starttag_text() |
| 83 | |
| 84 | Return the text of the most recently opened start tag. This should not normally |
| 85 | be needed for structured processing, but may be useful in dealing with HTML "as |
| 86 | deployed" or for re-generating input with minimal changes (whitespace between |
| 87 | attributes can be preserved, etc.). |
| 88 | |
| 89 | |
| 90 | .. method:: SGMLParser.handle_starttag(tag, method, attributes) |
| 91 | |
| 92 | This method is called to handle start tags for which either a :meth:`start_tag` |
| 93 | or :meth:`do_tag` method has been defined. The *tag* argument is the name of |
| 94 | the tag converted to lower case, and the *method* argument is the bound method |
| 95 | which should be used to support semantic interpretation of the start tag. The |
| 96 | *attributes* argument is a list of ``(name, value)`` pairs containing the |
| 97 | attributes found inside the tag's ``<>`` brackets. |
| 98 | |
| 99 | The *name* has been translated to lower case. Double quotes and backslashes in |
| 100 | the *value* have been interpreted, as well as known character references and |
| 101 | known entity references terminated by a semicolon (normally, entity references |
| 102 | can be terminated by any non-alphanumerical character, but this would break the |
| 103 | very common case of ``<A HREF="url?spam=1&eggs=2">`` when ``eggs`` is a valid |
| 104 | entity name). |
| 105 | |
| 106 | For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method would |
| 107 | be called as ``unknown_starttag('a', [('href', 'http://www.cwi.nl/')])``. The |
| 108 | base implementation simply calls *method* with *attributes* as the only |
| 109 | argument. |
| 110 | |
| 111 | .. versionadded:: 2.5 |
| 112 | Handling of entity and character references within attribute values. |
| 113 | |
| 114 | |
| 115 | .. method:: SGMLParser.handle_endtag(tag, method) |
| 116 | |
| 117 | This method is called to handle endtags for which an :meth:`end_tag` method has |
| 118 | been defined. The *tag* argument is the name of the tag converted to lower |
| 119 | case, and the *method* argument is the bound method which should be used to |
| 120 | support semantic interpretation of the end tag. If no :meth:`end_tag` method is |
| 121 | defined for the closing element, this handler is not called. The base |
| 122 | implementation simply calls *method*. |
| 123 | |
| 124 | |
| 125 | .. method:: SGMLParser.handle_data(data) |
| 126 | |
| 127 | This method is called to process arbitrary data. It is intended to be |
| 128 | overridden by a derived class; the base class implementation does nothing. |
| 129 | |
| 130 | |
| 131 | .. method:: SGMLParser.handle_charref(ref) |
| 132 | |
| 133 | This method is called to process a character reference of the form ``&#ref;``. |
| 134 | The base implementation uses :meth:`convert_charref` to convert the reference to |
| 135 | a string. If that method returns a string, it is passed to :meth:`handle_data`, |
| 136 | otherwise ``unknown_charref(ref)`` is called to handle the error. |
| 137 | |
| 138 | .. versionchanged:: 2.5 |
| 139 | Use :meth:`convert_charref` instead of hard-coding the conversion. |
| 140 | |
| 141 | |
| 142 | .. method:: SGMLParser.convert_charref(ref) |
| 143 | |
| 144 | Convert a character reference to a string, or ``None``. *ref* is the reference |
| 145 | passed in as a string. In the base implementation, *ref* must be a decimal |
| 146 | number in the range 0-255. It converts the code point found using the |
| 147 | :meth:`convert_codepoint` method. If *ref* is invalid or out of range, this |
| 148 | method returns ``None``. This method is called by the default |
| 149 | :meth:`handle_charref` implementation and by the attribute value parser. |
| 150 | |
| 151 | .. versionadded:: 2.5 |
| 152 | |
| 153 | |
| 154 | .. method:: SGMLParser.convert_codepoint(codepoint) |
| 155 | |
| 156 | Convert a codepoint to a :class:`str` value. Encodings can be handled here if |
| 157 | appropriate, though the rest of :mod:`sgmllib` is oblivious on this matter. |
| 158 | |
| 159 | .. versionadded:: 2.5 |
| 160 | |
| 161 | |
| 162 | .. method:: SGMLParser.handle_entityref(ref) |
| 163 | |
| 164 | This method is called to process a general entity reference of the form |
| 165 | ``&ref;`` where *ref* is an general entity reference. It converts *ref* by |
| 166 | passing it to :meth:`convert_entityref`. If a translation is returned, it calls |
| 167 | the method :meth:`handle_data` with the translation; otherwise, it calls the |
| 168 | method ``unknown_entityref(ref)``. The default :attr:`entitydefs` defines |
| 169 | translations for ``&``, ``&apos``, ``>``, ``<``, and ``"``. |
| 170 | |
| 171 | .. versionchanged:: 2.5 |
| 172 | Use :meth:`convert_entityref` instead of hard-coding the conversion. |
| 173 | |
| 174 | |
| 175 | .. method:: SGMLParser.convert_entityref(ref) |
| 176 | |
| 177 | Convert a named entity reference to a :class:`str` value, or ``None``. The |
| 178 | resulting value will not be parsed. *ref* will be only the name of the entity. |
| 179 | The default implementation looks for *ref* in the instance (or class) variable |
| 180 | :attr:`entitydefs` which should be a mapping from entity names to corresponding |
| 181 | translations. If no translation is available for *ref*, this method returns |
| 182 | ``None``. This method is called by the default :meth:`handle_entityref` |
| 183 | implementation and by the attribute value parser. |
| 184 | |
| 185 | .. versionadded:: 2.5 |
| 186 | |
| 187 | |
| 188 | .. method:: SGMLParser.handle_comment(comment) |
| 189 | |
| 190 | This method is called when a comment is encountered. The *comment* argument is |
| 191 | a string containing the text between the ``<!--`` and ``-->`` delimiters, but |
| 192 | not the delimiters themselves. For example, the comment ``<!--text-->`` will |
| 193 | cause this method to be called with the argument ``'text'``. The default method |
| 194 | does nothing. |
| 195 | |
| 196 | |
| 197 | .. method:: SGMLParser.handle_decl(data) |
| 198 | |
| 199 | Method called when an SGML declaration is read by the parser. In practice, the |
| 200 | ``DOCTYPE`` declaration is the only thing observed in HTML, but the parser does |
| 201 | not discriminate among different (or broken) declarations. Internal subsets in |
| 202 | a ``DOCTYPE`` declaration are not supported. The *data* parameter will be the |
| 203 | entire contents of the declaration inside the ``<!``...\ ``>`` markup. The |
| 204 | default implementation does nothing. |
| 205 | |
| 206 | |
| 207 | .. method:: SGMLParser.report_unbalanced(tag) |
| 208 | |
| 209 | This method is called when an end tag is found which does not correspond to any |
| 210 | open element. |
| 211 | |
| 212 | |
| 213 | .. method:: SGMLParser.unknown_starttag(tag, attributes) |
| 214 | |
| 215 | This method is called to process an unknown start tag. It is intended to be |
| 216 | overridden by a derived class; the base class implementation does nothing. |
| 217 | |
| 218 | |
| 219 | .. method:: SGMLParser.unknown_endtag(tag) |
| 220 | |
| 221 | This method is called to process an unknown end tag. It is intended to be |
| 222 | overridden by a derived class; the base class implementation does nothing. |
| 223 | |
| 224 | |
| 225 | .. method:: SGMLParser.unknown_charref(ref) |
| 226 | |
| 227 | This method is called to process unresolvable numeric character references. |
| 228 | Refer to :meth:`handle_charref` to determine what is handled by default. It is |
| 229 | intended to be overridden by a derived class; the base class implementation does |
| 230 | nothing. |
| 231 | |
| 232 | |
| 233 | .. method:: SGMLParser.unknown_entityref(ref) |
| 234 | |
| 235 | This method is called to process an unknown entity reference. It is intended to |
| 236 | be overridden by a derived class; the base class implementation does nothing. |
| 237 | |
| 238 | Apart from overriding or extending the methods listed above, derived classes may |
| 239 | also define methods of the following form to define processing of specific tags. |
| 240 | Tag names in the input stream are case independent; the *tag* occurring in |
| 241 | method names must be in lower case: |
| 242 | |
| 243 | |
| 244 | .. method:: SGMLParser.start_tag(attributes) |
| 245 | :noindex: |
| 246 | |
| 247 | This method is called to process an opening tag *tag*. It has preference over |
| 248 | :meth:`do_tag`. The *attributes* argument has the same meaning as described for |
| 249 | :meth:`handle_starttag` above. |
| 250 | |
| 251 | |
| 252 | .. method:: SGMLParser.do_tag(attributes) |
| 253 | :noindex: |
| 254 | |
| 255 | This method is called to process an opening tag *tag* for which no |
| 256 | :meth:`start_tag` method is defined. The *attributes* argument has the same |
| 257 | meaning as described for :meth:`handle_starttag` above. |
| 258 | |
| 259 | |
| 260 | .. method:: SGMLParser.end_tag() |
| 261 | :noindex: |
| 262 | |
| 263 | This method is called to process a closing tag *tag*. |
| 264 | |
| 265 | Note that the parser maintains a stack of open elements for which no end tag has |
| 266 | been found yet. Only tags processed by :meth:`start_tag` are pushed on this |
| 267 | stack. Definition of an :meth:`end_tag` method is optional for these tags. For |
| 268 | tags processed by :meth:`do_tag` or by :meth:`unknown_tag`, no :meth:`end_tag` |
| 269 | method must be defined; if defined, it will not be used. If both |
| 270 | :meth:`start_tag` and :meth:`do_tag` methods exist for a tag, the |
| 271 | :meth:`start_tag` method takes precedence. |
| 272 | |