Guido van Rossum | 470be14 | 1995-03-17 16:07:09 +0000 | [diff] [blame] | 1 | \section{Standard Module \sectcode{htmllib}} |
Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 2 | \stmodindex{htmllib} |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 3 | \index{HTML} |
| 4 | \index{hypertext} |
| 5 | |
| 6 | \renewcommand{\indexsubitem}{(in module htmllib)} |
| 7 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 8 | This module defines a class which can serve as a base for parsing text |
| 9 | files formatted in the HyperText Mark-up Language (HTML). The class |
| 10 | is not directly concerned with I/O --- it must be provided with input |
| 11 | in string form via a method, and makes calls to methods of a |
| 12 | ``formatter'' object in order to produce output. The |
| 13 | \code{HTMLParser} class is designed to be used as a base class for |
| 14 | other classes in order to add functionality, and allows most of its |
| 15 | methods to be extended or overridden. In turn, this class is derived |
| 16 | from and extends the \code{SGMLParser} class defined in module |
| 17 | \code{sgmllib}. Two implementations of formatter objects are |
| 18 | provided in the \code{formatter} module; refer to the documentation |
| 19 | for that module for information on the formatter interface. |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 20 | \index{SGML} |
| 21 | \stmodindex{sgmllib} |
| 22 | \ttindex{SGMLParser} |
| 23 | \index{formatter} |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 24 | \stmodindex{formatter} |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 25 | |
| 26 | The following is a summary of the interface defined by |
| 27 | \code{sgmllib.SGMLParser}: |
| 28 | |
| 29 | \begin{itemize} |
| 30 | |
| 31 | \item |
| 32 | The interface to feed data to an instance is through the \code{feed()} |
| 33 | method, which takes a string argument. This can be called with as |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 34 | little or as much text at a time as desired; \code{p.feed(a); |
| 35 | p.feed(b)} has the same effect as \code{p.feed(a+b)}. When the data |
| 36 | contains complete HTML tags, these are processed immediately; |
| 37 | incomplete elements are saved in a buffer. To force processing of all |
| 38 | unprocessed data, call the \code{close()} method. |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 39 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 40 | For example, to parse the entire contents of a file, use: |
| 41 | \begin{verbatim} |
| 42 | parser.feed(open('myfile.html').read()) |
| 43 | parser.close() |
| 44 | \end{verbatim} |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 45 | |
| 46 | \item |
| 47 | The interface to define semantics for HTML tags is very simple: derive |
| 48 | a class and define methods called \code{start_\var{tag}()}, |
| 49 | \code{end_\var{tag}()}, or \code{do_\var{tag}()}. The parser will |
| 50 | call these at appropriate moments: \code{start_\var{tag}} or |
| 51 | \code{do_\var{tag}} is called when an opening tag of the form |
| 52 | \code{<\var{tag} ...>} is encountered; \code{end_\var{tag}} is called |
| 53 | when a closing tag of the form \code{<\var{tag}>} is encountered. If |
| 54 | an opening tag requires a corresponding closing tag, like \code{<H1>} |
| 55 | ... \code{</H1>}, the class should define the \code{start_\var{tag}} |
| 56 | method; if a tag requires no closing tag, like \code{<P>}, the class |
| 57 | should define the \code{do_\var{tag}} method. |
| 58 | |
| 59 | \end{itemize} |
| 60 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 61 | The module defines a single class: |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 62 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 63 | \begin{funcdesc}{HTMLParser}{formatter} |
| 64 | This is the basic HTML parser class. It supports all entity names |
| 65 | required by the HTML 2.0 specification (RFC 1866). It also defines |
| 66 | handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 67 | \end{funcdesc} |
| 68 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 69 | In addition to tag methods, the \code{HTMLParser} class provides some |
| 70 | additional methods and instance variables for use within tag methods. |
| 71 | |
| 72 | \begin{datadesc}{formatter} |
| 73 | This is the formatter instance associated with the parser. |
| 74 | \end{datadesc} |
| 75 | |
| 76 | \begin{datadesc}{nofill} |
| 77 | Boolean flag which should be true when whitespace should not be |
| 78 | collapsed, or false when it should be. In general, this should only |
| 79 | be true when character data is to be treated as ``preformatted'' text, |
| 80 | as within a \code{<PRE>} element. The default value is false. This |
| 81 | affects the operation of \code{handle_data()} and \code{save_end()}. |
| 82 | \end{datadesc} |
| 83 | |
| 84 | \begin{funcdesc}{anchor_bgn}{href\, name\, type} |
| 85 | This method is called at the start of an anchor region. The arguments |
| 86 | correspond to the attributes of the \code{<A>} tag with the same |
| 87 | names. The default implementation maintains a list of hyperlinks |
| 88 | (defined by the \code{href} argument) within the document. The list |
| 89 | of hyperlinks is available as the data attribute \code{anchorlist}. |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 90 | \end{funcdesc} |
| 91 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 92 | \begin{funcdesc}{anchor_end}{} |
| 93 | This method is called at the end of an anchor region. The default |
| 94 | implementation adds a textual footnote marker using an index into the |
| 95 | list of hyperlinks created by \code{anchor_bgn()}. |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 96 | \end{funcdesc} |
| 97 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 98 | \begin{funcdesc}{handle_image}{source\, alt\optional{\, ismap\optional{\, align\optional{\, width\optional{\, height}}}}} |
| 99 | This method is called to handle images. The default implementation |
| 100 | simply passes the \code{alt} value to the \code{handle_data()} |
| 101 | method. |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 102 | \end{funcdesc} |
| 103 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 104 | \begin{funcdesc}{save_bgn}{} |
| 105 | Begins saving character data in a buffer instead of sending it to the |
| 106 | formatter object. Retrieve the stored data via \code{save_end()} |
| 107 | Use of the \code{save_bgn()} / \code{save_end()} pair may not be |
| 108 | nested. |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 109 | \end{funcdesc} |
| 110 | |
Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 111 | \begin{funcdesc}{save_end}{} |
| 112 | Ends buffering character data and returns all data saved since the |
| 113 | preceeding call to \code{save_bgn()}. If \code{nofill} flag is false, |
| 114 | whitespace is collapsed to single spaces. A call to this method |
| 115 | without a preceeding call to \code{save_bgn()} will raise a |
| 116 | \code{TypeError} exception. |
Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 117 | \end{funcdesc} |