Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 1 | :mod:`htmllib` --- A parser for HTML documents |
| 2 | ============================================== |
| 3 | |
| 4 | .. module:: htmllib |
| 5 | :synopsis: A parser for HTML documents. |
Georg Brandl | ac19d85 | 2008-06-01 21:19:14 +0000 | [diff] [blame] | 6 | :deprecated: |
Georg Brandl | c62ef8b | 2009-01-03 20:55:06 +0000 | [diff] [blame] | 7 | |
Georg Brandl | ac19d85 | 2008-06-01 21:19:14 +0000 | [diff] [blame] | 8 | .. deprecated:: 2.6 |
Ezio Melotti | 510ff54 | 2012-05-03 19:21:40 +0300 | [diff] [blame^] | 9 | The :mod:`htmllib` module has been removed in Python 3. |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 10 | |
| 11 | |
| 12 | .. index:: |
| 13 | single: HTML |
| 14 | single: hypertext |
| 15 | |
| 16 | .. index:: |
| 17 | module: sgmllib |
| 18 | module: formatter |
| 19 | single: SGMLParser (in module sgmllib) |
| 20 | |
| 21 | This module defines a class which can serve as a base for parsing text files |
| 22 | formatted in the HyperText Mark-up Language (HTML). The class is not directly |
| 23 | concerned with I/O --- it must be provided with input in string form via a |
| 24 | method, and makes calls to methods of a "formatter" object in order to produce |
| 25 | output. The :class:`HTMLParser` class is designed to be used as a base class |
| 26 | for other classes in order to add functionality, and allows most of its methods |
| 27 | to be extended or overridden. In turn, this class is derived from and extends |
| 28 | the :class:`SGMLParser` class defined in module :mod:`sgmllib`. The |
| 29 | :class:`HTMLParser` implementation supports the HTML 2.0 language as described |
| 30 | in :rfc:`1866`. Two implementations of formatter objects are provided in the |
| 31 | :mod:`formatter` module; refer to the documentation for that module for |
| 32 | information on the formatter interface. |
| 33 | |
| 34 | The following is a summary of the interface defined by |
| 35 | :class:`sgmllib.SGMLParser`: |
| 36 | |
| 37 | * The interface to feed data to an instance is through the :meth:`feed` method, |
| 38 | which takes a string argument. This can be called with as little or as much |
| 39 | text at a time as desired; ``p.feed(a); p.feed(b)`` has the same effect as |
| 40 | ``p.feed(a+b)``. When the data contains complete HTML markup constructs, these |
| 41 | are processed immediately; incomplete constructs are saved in a buffer. To |
| 42 | force processing of all unprocessed data, call the :meth:`close` method. |
| 43 | |
| 44 | For example, to parse the entire contents of a file, use:: |
| 45 | |
| 46 | parser.feed(open('myfile.html').read()) |
| 47 | parser.close() |
| 48 | |
| 49 | * The interface to define semantics for HTML tags is very simple: derive a class |
| 50 | and define methods called :meth:`start_tag`, :meth:`end_tag`, or :meth:`do_tag`. |
| 51 | The parser will call these at appropriate moments: :meth:`start_tag` or |
| 52 | :meth:`do_tag` is called when an opening tag of the form ``<tag ...>`` is |
| 53 | encountered; :meth:`end_tag` is called when a closing tag of the form ``<tag>`` |
| 54 | is encountered. If an opening tag requires a corresponding closing tag, like |
| 55 | ``<H1>`` ... ``</H1>``, the class should define the :meth:`start_tag` method; if |
| 56 | a tag requires no closing tag, like ``<P>``, the class should define the |
| 57 | :meth:`do_tag` method. |
| 58 | |
| 59 | The module defines a parser class and an exception: |
| 60 | |
| 61 | |
| 62 | .. class:: HTMLParser(formatter) |
| 63 | |
| 64 | This is the basic HTML parser class. It supports all entity names required by |
| 65 | the XHTML 1.0 Recommendation (http://www.w3.org/TR/xhtml1). It also defines |
| 66 | handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. |
| 67 | |
| 68 | |
| 69 | .. exception:: HTMLParseError |
| 70 | |
| 71 | Exception raised by the :class:`HTMLParser` class when it encounters an error |
| 72 | while parsing. |
| 73 | |
| 74 | .. versionadded:: 2.4 |
| 75 | |
| 76 | |
| 77 | .. seealso:: |
| 78 | |
| 79 | Module :mod:`formatter` |
| 80 | Interface definition for transforming an abstract flow of formatting events into |
| 81 | specific output events on writer objects. |
| 82 | |
Fred Drake | d995e11 | 2008-05-20 06:08:38 +0000 | [diff] [blame] | 83 | Module :mod:`HTMLParser` |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 84 | Alternate HTML parser that offers a slightly lower-level view of the input, but |
| 85 | is designed to work with XHTML, and does not implement some of the SGML syntax |
| 86 | not used in "HTML as deployed" and which isn't legal for XHTML. |
| 87 | |
Fred Drake | d995e11 | 2008-05-20 06:08:38 +0000 | [diff] [blame] | 88 | Module :mod:`htmlentitydefs` |
Georg Brandl | 8ec7f65 | 2007-08-15 14:28:01 +0000 | [diff] [blame] | 89 | Definition of replacement text for XHTML 1.0 entities. |
| 90 | |
| 91 | Module :mod:`sgmllib` |
| 92 | Base class for :class:`HTMLParser`. |
| 93 | |
| 94 | |
| 95 | .. _html-parser-objects: |
| 96 | |
| 97 | HTMLParser Objects |
| 98 | ------------------ |
| 99 | |
| 100 | In addition to tag methods, the :class:`HTMLParser` class provides some |
| 101 | additional methods and instance variables for use within tag methods. |
| 102 | |
| 103 | |
| 104 | .. attribute:: HTMLParser.formatter |
| 105 | |
| 106 | This is the formatter instance associated with the parser. |
| 107 | |
| 108 | |
| 109 | .. attribute:: HTMLParser.nofill |
| 110 | |
| 111 | Boolean flag which should be true when whitespace should not be collapsed, or |
| 112 | false when it should be. In general, this should only be true when character |
| 113 | data is to be treated as "preformatted" text, as within a ``<PRE>`` element. |
| 114 | The default value is false. This affects the operation of :meth:`handle_data` |
| 115 | and :meth:`save_end`. |
| 116 | |
| 117 | |
| 118 | .. method:: HTMLParser.anchor_bgn(href, name, type) |
| 119 | |
| 120 | This method is called at the start of an anchor region. The arguments |
| 121 | correspond to the attributes of the ``<A>`` tag with the same names. The |
| 122 | default implementation maintains a list of hyperlinks (defined by the ``HREF`` |
| 123 | attribute for ``<A>`` tags) within the document. The list of hyperlinks is |
| 124 | available as the data attribute :attr:`anchorlist`. |
| 125 | |
| 126 | |
| 127 | .. method:: HTMLParser.anchor_end() |
| 128 | |
| 129 | This method is called at the end of an anchor region. The default |
| 130 | implementation adds a textual footnote marker using an index into the list of |
| 131 | hyperlinks created by :meth:`anchor_bgn`. |
| 132 | |
| 133 | |
| 134 | .. method:: HTMLParser.handle_image(source, alt[, ismap[, align[, width[, height]]]]) |
| 135 | |
| 136 | This method is called to handle images. The default implementation simply |
| 137 | passes the *alt* value to the :meth:`handle_data` method. |
| 138 | |
| 139 | |
| 140 | .. method:: HTMLParser.save_bgn() |
| 141 | |
| 142 | Begins saving character data in a buffer instead of sending it to the formatter |
| 143 | object. Retrieve the stored data via :meth:`save_end`. Use of the |
| 144 | :meth:`save_bgn` / :meth:`save_end` pair may not be nested. |
| 145 | |
| 146 | |
| 147 | .. method:: HTMLParser.save_end() |
| 148 | |
| 149 | Ends buffering character data and returns all data saved since the preceding |
| 150 | call to :meth:`save_bgn`. If the :attr:`nofill` flag is false, whitespace is |
| 151 | collapsed to single spaces. A call to this method without a preceding call to |
| 152 | :meth:`save_bgn` will raise a :exc:`TypeError` exception. |
Fred Drake | d995e11 | 2008-05-20 06:08:38 +0000 | [diff] [blame] | 153 | |
| 154 | |
| 155 | :mod:`htmlentitydefs` --- Definitions of HTML general entities |
| 156 | ============================================================== |
| 157 | |
| 158 | .. module:: htmlentitydefs |
| 159 | :synopsis: Definitions of HTML general entities. |
| 160 | .. sectionauthor:: Fred L. Drake, Jr. <fdrake@acm.org> |
| 161 | |
| 162 | .. note:: |
Georg Brandl | 3682dfe | 2008-05-20 07:21:58 +0000 | [diff] [blame] | 163 | |
Fred Drake | d995e11 | 2008-05-20 06:08:38 +0000 | [diff] [blame] | 164 | The :mod:`htmlentitydefs` module has been renamed to :mod:`html.entities` in |
Ezio Melotti | 510ff54 | 2012-05-03 19:21:40 +0300 | [diff] [blame^] | 165 | Python 3. The :term:`2to3` tool will automatically adapt imports when |
| 166 | converting your sources to Python 3. |
Fred Drake | d995e11 | 2008-05-20 06:08:38 +0000 | [diff] [blame] | 167 | |
Éric Araujo | 29a0b57 | 2011-08-19 02:14:03 +0200 | [diff] [blame] | 168 | **Source code:** :source:`Lib/htmlentitydefs.py` |
| 169 | |
| 170 | -------------- |
Fred Drake | d995e11 | 2008-05-20 06:08:38 +0000 | [diff] [blame] | 171 | |
| 172 | This module defines three dictionaries, ``name2codepoint``, ``codepoint2name``, |
| 173 | and ``entitydefs``. ``entitydefs`` is used by the :mod:`htmllib` module to |
Senthil Kumaran | 6f18b98 | 2011-07-04 12:50:02 -0700 | [diff] [blame] | 174 | provide the :attr:`entitydefs` attribute of the :class:`HTMLParser` class. The |
Fred Drake | d995e11 | 2008-05-20 06:08:38 +0000 | [diff] [blame] | 175 | definition provided here contains all the entities defined by XHTML 1.0 that |
| 176 | can be handled using simple textual substitution in the Latin-1 character set |
| 177 | (ISO-8859-1). |
| 178 | |
| 179 | |
| 180 | .. data:: entitydefs |
| 181 | |
| 182 | A dictionary mapping XHTML 1.0 entity definitions to their replacement text in |
| 183 | ISO Latin-1. |
| 184 | |
| 185 | |
| 186 | .. data:: name2codepoint |
| 187 | |
| 188 | A dictionary that maps HTML entity names to the Unicode codepoints. |
| 189 | |
| 190 | .. versionadded:: 2.3 |
| 191 | |
| 192 | |
| 193 | .. data:: codepoint2name |
| 194 | |
| 195 | A dictionary that maps Unicode codepoints to HTML entity names. |
| 196 | |
| 197 | .. versionadded:: 2.3 |
| 198 | |