| Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1 |  | 
 | 2 | :mod:`sgmllib` --- Simple SGML parser | 
 | 3 | ===================================== | 
 | 4 |  | 
 | 5 | .. module:: sgmllib | 
 | 6 |    :synopsis: Only as much of an SGML parser as needed to parse HTML. | 
 | 7 |  | 
 | 8 |  | 
 | 9 | .. index:: single: SGML | 
 | 10 |  | 
 | 11 | This module defines a class :class:`SGMLParser` which serves as the basis for | 
 | 12 | parsing text files formatted in SGML (Standard Generalized Mark-up Language). | 
 | 13 | In fact, it does not provide a full SGML parser --- it only parses SGML insofar | 
 | 14 | as it is used by HTML, and the module only exists as a base for the | 
 | 15 | :mod:`htmllib` module.  Another HTML parser which supports XHTML and offers a | 
 | 16 | somewhat different interface is available in the :mod:`HTMLParser` module. | 
 | 17 |  | 
 | 18 |  | 
 | 19 | .. class:: SGMLParser() | 
 | 20 |  | 
 | 21 |    The :class:`SGMLParser` class is instantiated without arguments. The parser is | 
 | 22 |    hardcoded to recognize the following constructs: | 
 | 23 |  | 
 | 24 |    * Opening and closing tags of the form ``<tag attr="value" ...>`` and | 
 | 25 |      ``</tag>``, respectively. | 
 | 26 |  | 
 | 27 |    * Numeric character references of the form ``&#name;``. | 
 | 28 |  | 
 | 29 |    * Entity references of the form ``&name;``. | 
 | 30 |  | 
 | 31 |    * SGML comments of the form ``<!--text-->``.  Note that spaces, tabs, and | 
 | 32 |      newlines are allowed between the trailing ``>`` and the immediately preceding | 
 | 33 |      ``--``. | 
 | 34 |  | 
 | 35 | A single exception is defined as well: | 
 | 36 |  | 
 | 37 |  | 
 | 38 | .. exception:: SGMLParseError | 
 | 39 |  | 
 | 40 |    Exception raised by the :class:`SGMLParser` class when it encounters an error | 
 | 41 |    while parsing. | 
 | 42 |  | 
 | 43 |    .. versionadded:: 2.1 | 
 | 44 |  | 
 | 45 | :class:`SGMLParser` instances have the following methods: | 
 | 46 |  | 
 | 47 |  | 
 | 48 | .. method:: SGMLParser.reset() | 
 | 49 |  | 
 | 50 |    Reset the instance.  Loses all unprocessed data.  This is called implicitly at | 
 | 51 |    instantiation time. | 
 | 52 |  | 
 | 53 |  | 
 | 54 | .. method:: SGMLParser.setnomoretags() | 
 | 55 |  | 
 | 56 |    Stop processing tags.  Treat all following input as literal input (CDATA). | 
 | 57 |    (This is only provided so the HTML tag ``<PLAINTEXT>`` can be implemented.) | 
 | 58 |  | 
 | 59 |  | 
 | 60 | .. method:: SGMLParser.setliteral() | 
 | 61 |  | 
 | 62 |    Enter literal mode (CDATA mode). | 
 | 63 |  | 
 | 64 |  | 
 | 65 | .. method:: SGMLParser.feed(data) | 
 | 66 |  | 
 | 67 |    Feed some text to the parser.  It is processed insofar as it consists of | 
 | 68 |    complete elements; incomplete data is buffered until more data is fed or | 
 | 69 |    :meth:`close` is called. | 
 | 70 |  | 
 | 71 |  | 
 | 72 | .. method:: SGMLParser.close() | 
 | 73 |  | 
 | 74 |    Force processing of all buffered data as if it were followed by an end-of-file | 
 | 75 |    mark.  This method may be redefined by a derived class to define additional | 
 | 76 |    processing at the end of the input, but the redefined version should always call | 
 | 77 |    :meth:`close`. | 
 | 78 |  | 
 | 79 |  | 
 | 80 | .. method:: SGMLParser.get_starttag_text() | 
 | 81 |  | 
 | 82 |    Return the text of the most recently opened start tag.  This should not normally | 
 | 83 |    be needed for structured processing, but may be useful in dealing with HTML "as | 
 | 84 |    deployed" or for re-generating input with minimal changes (whitespace between | 
 | 85 |    attributes can be preserved, etc.). | 
 | 86 |  | 
 | 87 |  | 
 | 88 | .. method:: SGMLParser.handle_starttag(tag, method, attributes) | 
 | 89 |  | 
 | 90 |    This method is called to handle start tags for which either a :meth:`start_tag` | 
 | 91 |    or :meth:`do_tag` method has been defined.  The *tag* argument is the name of | 
 | 92 |    the tag converted to lower case, and the *method* argument is the bound method | 
 | 93 |    which should be used to support semantic interpretation of the start tag. The | 
 | 94 |    *attributes* argument is a list of ``(name, value)`` pairs containing the | 
 | 95 |    attributes found inside the tag's ``<>`` brackets. | 
 | 96 |  | 
 | 97 |    The *name* has been translated to lower case. Double quotes and backslashes in | 
 | 98 |    the *value* have been interpreted, as well as known character references and | 
 | 99 |    known entity references terminated by a semicolon (normally, entity references | 
 | 100 |    can be terminated by any non-alphanumerical character, but this would break the | 
 | 101 |    very common case of ``<A HREF="url?spam=1&eggs=2">`` when ``eggs`` is a valid | 
 | 102 |    entity name). | 
 | 103 |  | 
 | 104 |    For instance, for the tag ``<A HREF="http://www.cwi.nl/">``, this method would | 
 | 105 |    be called as ``unknown_starttag('a', [('href', 'http://www.cwi.nl/')])``.  The | 
 | 106 |    base implementation simply calls *method* with *attributes* as the only | 
 | 107 |    argument. | 
 | 108 |  | 
 | 109 |    .. versionadded:: 2.5 | 
 | 110 |       Handling of entity and character references within attribute values. | 
 | 111 |  | 
 | 112 |  | 
 | 113 | .. method:: SGMLParser.handle_endtag(tag, method) | 
 | 114 |  | 
 | 115 |    This method is called to handle endtags for which an :meth:`end_tag` method has | 
 | 116 |    been defined.  The *tag* argument is the name of the tag converted to lower | 
 | 117 |    case, and the *method* argument is the bound method which should be used to | 
 | 118 |    support semantic interpretation of the end tag.  If no :meth:`end_tag` method is | 
 | 119 |    defined for the closing element, this handler is not called.  The base | 
 | 120 |    implementation simply calls *method*. | 
 | 121 |  | 
 | 122 |  | 
 | 123 | .. method:: SGMLParser.handle_data(data) | 
 | 124 |  | 
 | 125 |    This method is called to process arbitrary data.  It is intended to be | 
 | 126 |    overridden by a derived class; the base class implementation does nothing. | 
 | 127 |  | 
 | 128 |  | 
 | 129 | .. method:: SGMLParser.handle_charref(ref) | 
 | 130 |  | 
 | 131 |    This method is called to process a character reference of the form ``&#ref;``. | 
 | 132 |    The base implementation uses :meth:`convert_charref` to convert the reference to | 
 | 133 |    a string.  If that method returns a string, it is passed to :meth:`handle_data`, | 
 | 134 |    otherwise ``unknown_charref(ref)`` is called to handle the error. | 
 | 135 |  | 
 | 136 |    .. versionchanged:: 2.5 | 
 | 137 |       Use :meth:`convert_charref` instead of hard-coding the conversion. | 
 | 138 |  | 
 | 139 |  | 
 | 140 | .. method:: SGMLParser.convert_charref(ref) | 
 | 141 |  | 
 | 142 |    Convert a character reference to a string, or ``None``.  *ref* is the reference | 
 | 143 |    passed in as a string.  In the base implementation, *ref* must be a decimal | 
 | 144 |    number in the range 0-255.  It converts the code point found using the | 
 | 145 |    :meth:`convert_codepoint` method. If *ref* is invalid or out of range, this | 
 | 146 |    method returns ``None``.  This method is called by the default | 
 | 147 |    :meth:`handle_charref` implementation and by the attribute value parser. | 
 | 148 |  | 
 | 149 |    .. versionadded:: 2.5 | 
 | 150 |  | 
 | 151 |  | 
 | 152 | .. method:: SGMLParser.convert_codepoint(codepoint) | 
 | 153 |  | 
 | 154 |    Convert a codepoint to a :class:`str` value.  Encodings can be handled here if | 
 | 155 |    appropriate, though the rest of :mod:`sgmllib` is oblivious on this matter. | 
 | 156 |  | 
 | 157 |    .. versionadded:: 2.5 | 
 | 158 |  | 
 | 159 |  | 
 | 160 | .. method:: SGMLParser.handle_entityref(ref) | 
 | 161 |  | 
 | 162 |    This method is called to process a general entity reference of the form | 
 | 163 |    ``&ref;`` where *ref* is an general entity reference.  It converts *ref* by | 
 | 164 |    passing it to :meth:`convert_entityref`.  If a translation is returned, it calls | 
 | 165 |    the method :meth:`handle_data` with the translation; otherwise, it calls the | 
 | 166 |    method ``unknown_entityref(ref)``. The default :attr:`entitydefs` defines | 
 | 167 |    translations for ``&``, ``&apos``, ``>``, ``<``, and ``"``. | 
 | 168 |  | 
 | 169 |    .. versionchanged:: 2.5 | 
 | 170 |       Use :meth:`convert_entityref` instead of hard-coding the conversion. | 
 | 171 |  | 
 | 172 |  | 
 | 173 | .. method:: SGMLParser.convert_entityref(ref) | 
 | 174 |  | 
 | 175 |    Convert a named entity reference to a :class:`str` value, or ``None``.  The | 
 | 176 |    resulting value will not be parsed.  *ref* will be only the name of the entity. | 
 | 177 |    The default implementation looks for *ref* in the instance (or class) variable | 
 | 178 |    :attr:`entitydefs` which should be a mapping from entity names to corresponding | 
 | 179 |    translations.  If no translation is available for *ref*, this method returns | 
 | 180 |    ``None``.  This method is called by the default :meth:`handle_entityref` | 
 | 181 |    implementation and by the attribute value parser. | 
 | 182 |  | 
 | 183 |    .. versionadded:: 2.5 | 
 | 184 |  | 
 | 185 |  | 
 | 186 | .. method:: SGMLParser.handle_comment(comment) | 
 | 187 |  | 
 | 188 |    This method is called when a comment is encountered.  The *comment* argument is | 
 | 189 |    a string containing the text between the ``<!--`` and ``-->`` delimiters, but | 
 | 190 |    not the delimiters themselves.  For example, the comment ``<!--text-->`` will | 
 | 191 |    cause this method to be called with the argument ``'text'``.  The default method | 
 | 192 |    does nothing. | 
 | 193 |  | 
 | 194 |  | 
 | 195 | .. method:: SGMLParser.handle_decl(data) | 
 | 196 |  | 
 | 197 |    Method called when an SGML declaration is read by the parser.  In practice, the | 
 | 198 |    ``DOCTYPE`` declaration is the only thing observed in HTML, but the parser does | 
 | 199 |    not discriminate among different (or broken) declarations.  Internal subsets in | 
 | 200 |    a ``DOCTYPE`` declaration are not supported.  The *data* parameter will be the | 
 | 201 |    entire contents of the declaration inside the ``<!``...\ ``>`` markup.  The | 
 | 202 |    default implementation does nothing. | 
 | 203 |  | 
 | 204 |  | 
 | 205 | .. method:: SGMLParser.report_unbalanced(tag) | 
 | 206 |  | 
 | 207 |    This method is called when an end tag is found which does not correspond to any | 
 | 208 |    open element. | 
 | 209 |  | 
 | 210 |  | 
 | 211 | .. method:: SGMLParser.unknown_starttag(tag, attributes) | 
 | 212 |  | 
 | 213 |    This method is called to process an unknown start tag.  It is intended to be | 
 | 214 |    overridden by a derived class; the base class implementation does nothing. | 
 | 215 |  | 
 | 216 |  | 
 | 217 | .. method:: SGMLParser.unknown_endtag(tag) | 
 | 218 |  | 
 | 219 |    This method is called to process an unknown end tag.  It is intended to be | 
 | 220 |    overridden by a derived class; the base class implementation does nothing. | 
 | 221 |  | 
 | 222 |  | 
 | 223 | .. method:: SGMLParser.unknown_charref(ref) | 
 | 224 |  | 
 | 225 |    This method is called to process unresolvable numeric character references. | 
 | 226 |    Refer to :meth:`handle_charref` to determine what is handled by default.  It is | 
 | 227 |    intended to be overridden by a derived class; the base class implementation does | 
 | 228 |    nothing. | 
 | 229 |  | 
 | 230 |  | 
 | 231 | .. method:: SGMLParser.unknown_entityref(ref) | 
 | 232 |  | 
 | 233 |    This method is called to process an unknown entity reference.  It is intended to | 
 | 234 |    be overridden by a derived class; the base class implementation does nothing. | 
 | 235 |  | 
 | 236 | Apart from overriding or extending the methods listed above, derived classes may | 
 | 237 | also define methods of the following form to define processing of specific tags. | 
 | 238 | Tag names in the input stream are case independent; the *tag* occurring in | 
 | 239 | method names must be in lower case: | 
 | 240 |  | 
 | 241 |  | 
 | 242 | .. method:: SGMLParser.start_tag(attributes) | 
 | 243 |    :noindex: | 
 | 244 |  | 
 | 245 |    This method is called to process an opening tag *tag*.  It has preference over | 
 | 246 |    :meth:`do_tag`.  The *attributes* argument has the same meaning as described for | 
 | 247 |    :meth:`handle_starttag` above. | 
 | 248 |  | 
 | 249 |  | 
 | 250 | .. method:: SGMLParser.do_tag(attributes) | 
 | 251 |    :noindex: | 
 | 252 |  | 
 | 253 |    This method is called to process an opening tag *tag*  for which no | 
 | 254 |    :meth:`start_tag` method is defined.   The *attributes* argument has the same | 
 | 255 |    meaning as described for :meth:`handle_starttag` above. | 
 | 256 |  | 
 | 257 |  | 
 | 258 | .. method:: SGMLParser.end_tag() | 
 | 259 |    :noindex: | 
 | 260 |  | 
 | 261 |    This method is called to process a closing tag *tag*. | 
 | 262 |  | 
 | 263 | Note that the parser maintains a stack of open elements for which no end tag has | 
 | 264 | been found yet.  Only tags processed by :meth:`start_tag` are pushed on this | 
 | 265 | stack.  Definition of an :meth:`end_tag` method is optional for these tags.  For | 
 | 266 | tags processed by :meth:`do_tag` or by :meth:`unknown_tag`, no :meth:`end_tag` | 
 | 267 | method must be defined; if defined, it will not be used.  If both | 
 | 268 | :meth:`start_tag` and :meth:`do_tag` methods exist for a tag, the | 
 | 269 | :meth:`start_tag` method takes precedence. | 
 | 270 |  |