| Guido van Rossum | 470be14 | 1995-03-17 16:07:09 +0000 | [diff] [blame] | 1 | \section{Standard Module \sectcode{htmllib}} | 
| Guido van Rossum | e47da0a | 1997-07-17 16:34:52 +0000 | [diff] [blame] | 2 | \label{module-htmllib} | 
| Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 3 | \stmodindex{htmllib} | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 4 | \index{HTML} | 
|  | 5 | \index{hypertext} | 
|  | 6 |  | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 7 |  | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 8 | This module defines a class which can serve as a base for parsing text | 
|  | 9 | files formatted in the HyperText Mark-up Language (HTML).  The class | 
|  | 10 | is not directly concerned with I/O --- it must be provided with input | 
|  | 11 | in string form via a method, and makes calls to methods of a | 
|  | 12 | ``formatter'' object in order to produce output.  The | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 13 | \class{HTMLParser} class is designed to be used as a base class for | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 14 | other classes in order to add functionality, and allows most of its | 
|  | 15 | methods to be extended or overridden.  In turn, this class is derived | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 16 | from and extends the \class{SGMLParser} class defined in module | 
|  | 17 | \module{sgmllib}\refstmodindex{sgmllib}.  The \class{HTMLParser} | 
|  | 18 | implementation supports the HTML 2.0 language as described in | 
|  | 19 | \rfc{1866}.  Two implementations of formatter objects are provided in | 
|  | 20 | the \module{formatter}\refstmodindex{formatter} module; refer to the | 
|  | 21 | documentation for that module for information on the formatter | 
|  | 22 | interface. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 23 | \index{SGML} | 
| Fred Drake | 51375ae | 1998-03-12 14:39:09 +0000 | [diff] [blame^] | 24 | \withsubitem{(in module sgmllib)}{\ttindex{SGMLParser}} | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 25 | \index{formatter} | 
|  | 26 |  | 
|  | 27 | The following is a summary of the interface defined by | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 28 | \class{sgmllib.SGMLParser}: | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 29 |  | 
|  | 30 | \begin{itemize} | 
|  | 31 |  | 
|  | 32 | \item | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 33 | The interface to feed data to an instance is through the \method{feed()} | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 34 | method, which takes a string argument.  This can be called with as | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 35 | little or as much text at a time as desired; \samp{p.feed(a); | 
|  | 36 | p.feed(b)} has the same effect as \samp{p.feed(a+b)}.  When the data | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 37 | contains complete HTML tags, these are processed immediately; | 
|  | 38 | incomplete elements are saved in a buffer.  To force processing of all | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 39 | unprocessed data, call the \method{close()} method. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 40 |  | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 41 | For example, to parse the entire contents of a file, use: | 
| Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 42 | \begin{verbatim} | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 43 | parser.feed(open('myfile.html').read()) | 
|  | 44 | parser.close() | 
| Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 45 | \end{verbatim} | 
| Fred Drake | 51375ae | 1998-03-12 14:39:09 +0000 | [diff] [blame^] | 46 |  | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 47 | \item | 
|  | 48 | The interface to define semantics for HTML tags is very simple: derive | 
|  | 49 | a class and define methods called \code{start_\var{tag}()}, | 
|  | 50 | \code{end_\var{tag}()}, or \code{do_\var{tag}()}.  The parser will | 
|  | 51 | call these at appropriate moments: \code{start_\var{tag}} or | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 52 | \code{do_\var{tag}()} is called when an opening tag of the form | 
|  | 53 | \code{<\var{tag} ...>} is encountered; \code{end_\var{tag}()} is called | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 54 | when a closing tag of the form \code{<\var{tag}>} is encountered.  If | 
|  | 55 | an opening tag requires a corresponding closing tag, like \code{<H1>} | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 56 | ... \code{</H1>}, the class should define the \code{start_\var{tag}()} | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 57 | method; if a tag requires no closing tag, like \code{<P>}, the class | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 58 | should define the \code{do_\var{tag}()} method. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 59 |  | 
|  | 60 | \end{itemize} | 
|  | 61 |  | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 62 | The module defines a single class: | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 63 |  | 
| Fred Drake | 51375ae | 1998-03-12 14:39:09 +0000 | [diff] [blame^] | 64 | \begin{classdesc}{HTMLParser}{formatter} | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 65 | This is the basic HTML parser class.  It supports all entity names | 
| Fred Drake | c589124 | 1998-02-09 19:16:20 +0000 | [diff] [blame] | 66 | required by the HTML 2.0 specification (\rfc{1866}).  It also defines | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 67 | handlers for all HTML 2.0 and many HTML 3.0 and 3.2 elements. | 
| Fred Drake | 51375ae | 1998-03-12 14:39:09 +0000 | [diff] [blame^] | 68 | \end{classdesc} | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 69 |  | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 70 | In addition to tag methods, the \class{HTMLParser} class provides some | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 71 | additional methods and instance variables for use within tag methods. | 
|  | 72 |  | 
| Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 73 | \setindexsubitem{(HTMLParser attribute)} | 
| Fred Drake | 8f92595 | 1996-10-09 16:13:22 +0000 | [diff] [blame] | 74 |  | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 75 | \begin{datadesc}{formatter} | 
|  | 76 | This is the formatter instance associated with the parser. | 
|  | 77 | \end{datadesc} | 
|  | 78 |  | 
|  | 79 | \begin{datadesc}{nofill} | 
|  | 80 | Boolean flag which should be true when whitespace should not be | 
|  | 81 | collapsed, or false when it should be.  In general, this should only | 
|  | 82 | be true when character data is to be treated as ``preformatted'' text, | 
|  | 83 | as within a \code{<PRE>} element.  The default value is false.  This | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 84 | affects the operation of \method{handle_data()} and \method{save_end()}. | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 85 | \end{datadesc} | 
|  | 86 |  | 
| Fred Drake | 1947991 | 1998-02-13 06:58:54 +0000 | [diff] [blame] | 87 | \setindexsubitem{(HTMLParser method)} | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 88 |  | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 89 | \begin{funcdesc}{anchor_bgn}{href\, name\, type} | 
|  | 90 | This method is called at the start of an anchor region.  The arguments | 
|  | 91 | correspond to the attributes of the \code{<A>} tag with the same | 
|  | 92 | names.  The default implementation maintains a list of hyperlinks | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 93 | (defined by the \code{href} attribute) within the document.  The list | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 94 | of hyperlinks is available as the data attribute \code{anchorlist}. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 95 | \end{funcdesc} | 
|  | 96 |  | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 97 | \begin{funcdesc}{anchor_end}{} | 
|  | 98 | This method is called at the end of an anchor region.  The default | 
|  | 99 | implementation adds a textual footnote marker using an index into the | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 100 | list of hyperlinks created by \method{anchor_bgn()}. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 101 | \end{funcdesc} | 
|  | 102 |  | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 103 | \begin{funcdesc}{handle_image}{source\, alt\optional{\, ismap\optional{\, align\optional{\, width\optional{\, height}}}}} | 
|  | 104 | This method is called to handle images.  The default implementation | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 105 | simply passes the \var{alt} value to the \method{handle_data()} | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 106 | method. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 107 | \end{funcdesc} | 
|  | 108 |  | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 109 | \begin{funcdesc}{save_bgn}{} | 
|  | 110 | Begins saving character data in a buffer instead of sending it to the | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 111 | formatter object.  Retrieve the stored data via \method{save_end()}. | 
|  | 112 | Use of the \method{save_bgn()} / \method{save_end()} pair may not be | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 113 | nested. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 114 | \end{funcdesc} | 
|  | 115 |  | 
| Fred Drake | 58d7f69 | 1996-10-08 21:52:23 +0000 | [diff] [blame] | 116 | \begin{funcdesc}{save_end}{} | 
|  | 117 | Ends buffering character data and returns all data saved since the | 
| Fred Drake | 526467c | 1998-02-10 21:42:27 +0000 | [diff] [blame] | 118 | preceeding call to \method{save_bgn()}.  If the \code{nofill} flag is | 
|  | 119 | false, whitespace is collapsed to single spaces.  A call to this | 
|  | 120 | method without a preceeding call to \method{save_bgn()} will raise a | 
|  | 121 | \exception{TypeError} exception. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 122 | \end{funcdesc} |