Fred Drake | 295da24 | 1998-08-10 19:42:37 +0000 | [diff] [blame] | 1 | \section{\module{htmllib} --- |
Fred Drake | 4e28c59 | 1999-04-22 18:25:47 +0000 | [diff] [blame] | 2 | A parser for HTML documents} |
Fred Drake | b91e934 | 1998-07-23 17:59:49 +0000 | [diff] [blame] | 3 | |
Fred Drake | 4e28c59 | 1999-04-22 18:25:47 +0000 | [diff] [blame] | 4 | \declaremodule{standard}{htmllib} |
Fred Drake | b91e934 | 1998-07-23 17:59:49 +0000 | [diff] [blame] | 5 | \modulesynopsis{A parser for HTML documents.} |
| 6 | |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 7 | \index{HTML} |
| 8 | \index{hypertext} |
| 9 | |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 10 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 11 | This module defines a class which can serve as a base for parsing text |
| 12 | files formatted in the HyperText Mark-up Language (HTML). The class |
| 13 | is not directly concerned with I/O --- it must be provided with input |
| 14 | in string form via a method, and makes calls to methods of a |
| 15 | ``formatter'' object in order to produce output. The |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 16 | \class{HTMLParser} class is designed to be used as a base class for |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 17 | other classes in order to add functionality, and allows most of its |
| 18 | methods to be extended or overridden. In turn, this class is derived |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 19 | from and extends the \class{SGMLParser} class defined in module |
Fred Drake | 4e28c59 | 1999-04-22 18:25:47 +0000 | [diff] [blame] | 20 | \refmodule{sgmllib}\refstmodindex{sgmllib}. The \class{HTMLParser} |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 21 | implementation supports the HTML 2.0 language as described in |
| 22 | \rfc{1866}. Two implementations of formatter objects are provided in |
Fred Drake | 4e28c59 | 1999-04-22 18:25:47 +0000 | [diff] [blame] | 23 | the \refmodule{formatter}\refstmodindex{formatter} module; refer to the |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 24 | documentation for that module for information on the formatter |
| 25 | interface. |
Fred Drake | 51375ae | 1998-03-12 14:39:09 +0000 | [diff] [blame] | 26 | \withsubitem{(in module sgmllib)}{\ttindex{SGMLParser}} |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 27 | |
| 28 | The following is a summary of the interface defined by |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 29 | \class{sgmllib.SGMLParser}: |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 30 | |
| 31 | \begin{itemize} |
| 32 | |
| 33 | \item |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 34 | The interface to feed data to an instance is through the \method{feed()} |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 35 | method, which takes a string argument. This can be called with as |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 36 | little or as much text at a time as desired; \samp{p.feed(a); |
| 37 | p.feed(b)} has the same effect as \samp{p.feed(a+b)}. When the data |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 38 | contains complete HTML tags, these are processed immediately; |
| 39 | incomplete elements are saved in a buffer. To force processing of all |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 40 | unprocessed data, call the \method{close()} method. |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 41 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 42 | For example, to parse the entire contents of a file, use: |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 43 | \begin{verbatim} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 44 | parser.feed(open('myfile.html').read()) |
| 45 | parser.close() |
Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 46 | \end{verbatim} |
Fred Drake | 51375ae | 1998-03-12 14:39:09 +0000 | [diff] [blame] | 47 | |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 48 | \item |
| 49 | The interface to define semantics for HTML tags is very simple: derive |
Fred Drake | 4e28c59 | 1999-04-22 18:25:47 +0000 | [diff] [blame] | 50 | a class and define methods called \method{start_\var{tag}()}, |
| 51 | \method{end_\var{tag}()}, or \method{do_\var{tag}()}. The parser will |
| 52 | call these at appropriate moments: \method{start_\var{tag}} or |
| 53 | \method{do_\var{tag}()} is called when an opening tag of the form |
| 54 | \code{<\var{tag} ...>} is encountered; \method{end_\var{tag}()} is called |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 55 | when a closing tag of the form \code{<\var{tag}>} is encountered. If |
| 56 | an opening tag requires a corresponding closing tag, like \code{<H1>} |
Fred Drake | 4e28c59 | 1999-04-22 18:25:47 +0000 | [diff] [blame] | 57 | ... \code{</H1>}, the class should define the \method{start_\var{tag}()} |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 58 | method; if a tag requires no closing tag, like \code{<P>}, the class |
Fred Drake | 4e28c59 | 1999-04-22 18:25:47 +0000 | [diff] [blame] | 59 | should define the \method{do_\var{tag}()} method. |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 60 | |
| 61 | \end{itemize} |
| 62 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 63 | The module defines a single class: |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 64 | |
Fred Drake | 51375ae | 1998-03-12 14:39:09 +0000 | [diff] [blame] | 65 | \begin{classdesc}{HTMLParser}{formatter} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 66 | This is the basic HTML parser class. It supports all entity names |
Fred Drake | c589124 | 1998-02-09 19:16:20 +0000 | [diff] [blame] | 67 | required by the HTML 2.0 specification (\rfc{1866}). It also defines |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 68 | handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. |
Fred Drake | 51375ae | 1998-03-12 14:39:09 +0000 | [diff] [blame] | 69 | \end{classdesc} |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 70 | |
Fred Drake | 0f871dc | 1999-06-21 21:20:56 +0000 | [diff] [blame] | 71 | |
| 72 | \begin{seealso} |
Fred Drake | 25211f5 | 2001-07-05 16:34:36 +0000 | [diff] [blame] | 73 | \seemodule{HTMLParser}{Alternate HTML parser that offers a slightly |
| 74 | lower-level view of the input, but is |
| 75 | designed to work with XHTML, and does not |
| 76 | implement some of the SGML syntax not used in |
| 77 | ``HTML as deployed'' and which isn't legal |
| 78 | for XHTML.} |
Fred Drake | 0f871dc | 1999-06-21 21:20:56 +0000 | [diff] [blame] | 79 | \seemodule{htmlentitydefs}{Definition of replacement text for HTML |
| 80 | 2.0 entities.} |
| 81 | \seemodule{sgmllib}{Base class for \class{HTMLParser}.} |
| 82 | \end{seealso} |
| 83 | |
| 84 | |
| 85 | \subsection{HTMLParser Objects \label{html-parser-objects}} |
| 86 | |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 87 | In addition to tag methods, the \class{HTMLParser} class provides some |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 88 | additional methods and instance variables for use within tag methods. |
| 89 | |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 90 | \begin{memberdesc}{formatter} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 91 | This is the formatter instance associated with the parser. |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 92 | \end{memberdesc} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 93 | |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 94 | \begin{memberdesc}{nofill} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 95 | Boolean flag which should be true when whitespace should not be |
| 96 | collapsed, or false when it should be. In general, this should only |
| 97 | be true when character data is to be treated as ``preformatted'' text, |
| 98 | as within a \code{<PRE>} element. The default value is false. This |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 99 | affects the operation of \method{handle_data()} and \method{save_end()}. |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 100 | \end{memberdesc} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 101 | |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 102 | |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 103 | \begin{methoddesc}{anchor_bgn}{href, name, type} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 104 | This method is called at the start of an anchor region. The arguments |
| 105 | correspond to the attributes of the \code{<A>} tag with the same |
| 106 | names. The default implementation maintains a list of hyperlinks |
Fred Drake | 4e28c59 | 1999-04-22 18:25:47 +0000 | [diff] [blame] | 107 | (defined by the \code{HREF} attribute for \code{<A>} tags) within the |
| 108 | document. The list of hyperlinks is available as the data attribute |
| 109 | \member{anchorlist}. |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 110 | \end{methoddesc} |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 111 | |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 112 | \begin{methoddesc}{anchor_end}{} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 113 | This method is called at the end of an anchor region. The default |
| 114 | implementation adds a textual footnote marker using an index into the |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 115 | list of hyperlinks created by \method{anchor_bgn()}. |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 116 | \end{methoddesc} |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 117 | |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 118 | \begin{methoddesc}{handle_image}{source, alt\optional{, ismap\optional{, align\optional{, width\optional{, height}}}}} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 119 | This method is called to handle images. The default implementation |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 120 | simply passes the \var{alt} value to the \method{handle_data()} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 121 | method. |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 122 | \end{methoddesc} |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 123 | |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 124 | \begin{methoddesc}{save_bgn}{} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 125 | Begins saving character data in a buffer instead of sending it to the |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 126 | formatter object. Retrieve the stored data via \method{save_end()}. |
| 127 | Use of the \method{save_bgn()} / \method{save_end()} pair may not be |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 128 | nested. |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 129 | \end{methoddesc} |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 130 | |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 131 | \begin{methoddesc}{save_end}{} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 132 | Ends buffering character data and returns all data saved since the |
Thomas Wouters | f831663 | 2000-07-16 19:01:10 +0000 | [diff] [blame] | 133 | preceding call to \method{save_bgn()}. If the \member{nofill} flag is |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 134 | false, whitespace is collapsed to single spaces. A call to this |
Thomas Wouters | f831663 | 2000-07-16 19:01:10 +0000 | [diff] [blame] | 135 | method without a preceding call to \method{save_bgn()} will raise a |
Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 136 | \exception{TypeError} exception. |
Fred Drake | 8fe533e | 1998-03-27 05:27:08 +0000 | [diff] [blame] | 137 | \end{methoddesc} |
Fred Drake | 0f871dc | 1999-06-21 21:20:56 +0000 | [diff] [blame] | 138 | |
| 139 | |
| 140 | |
| 141 | \section{\module{htmlentitydefs} --- |
| 142 | Definitions of HTML general entities} |
| 143 | |
| 144 | \declaremodule{standard}{htmlentitydefs} |
| 145 | \modulesynopsis{Definitions of HTML general entities.} |
| 146 | \sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org} |
| 147 | |
| 148 | This module defines a single dictionary, \code{entitydefs}, which is |
| 149 | used by the \refmodule{htmllib} module to provide the |
| 150 | \member{entitydefs} member of the \class{HTMLParser} class. The |
| 151 | definition provided here contains all the entities defined by HTML 2.0 |
| 152 | that can be handled using simple textual substitution in the Latin-1 |
| 153 | character set (ISO-8859-1). |
| 154 | |
| 155 | |
| 156 | \begin{datadesc}{entitydefs} |
| 157 | A dictionary mapping HTML 2.0 entity definitions to their |
| 158 | replacement text in ISO Latin-1. |
| 159 | \end{datadesc} |