| Guido van Rossum | 470be14 | 1995-03-17 16:07:09 +0000 | [diff] [blame] | 1 | \section{Standard Module \sectcode{sgmllib}} | 
| Guido van Rossum | a12ef94 | 1995-02-27 17:53:25 +0000 | [diff] [blame] | 2 | \stmodindex{sgmllib} | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 3 | \index{SGML} | 
 | 4 |  | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 5 | This module defines a class \code{SGMLParser} which serves as the | 
 | 6 | basis for parsing text files formatted in SGML (Standard Generalized | 
 | 7 | Mark-up Language).  In fact, it does not provide a full SGML parser | 
| Fred Drake | 8f92595 | 1996-10-09 16:13:22 +0000 | [diff] [blame] | 8 | --- it only parses SGML insofar as it is used by HTML, and the module | 
 | 9 | only exists as a base for the \code{htmllib} module. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 10 | \stmodindex{htmllib} | 
 | 11 |  | 
 | 12 | In particular, the parser is hardcoded to recognize the following | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 13 | constructs: | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 14 |  | 
 | 15 | \begin{itemize} | 
 | 16 |  | 
 | 17 | \item | 
 | 18 | Opening and closing tags of the form | 
 | 19 | ``\code{<\var{tag} \var{attr}="\var{value}" ...>}'' and | 
 | 20 | ``\code{</\var{tag}>}'', respectively. | 
 | 21 |  | 
 | 22 | \item | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 23 | Numeric character references of the form ``\code{\&\#\var{name};}''. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 24 |  | 
 | 25 | \item | 
 | 26 | Entity references of the form ``\code{\&\var{name};}''. | 
 | 27 |  | 
 | 28 | \item | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 29 | SGML comments of the form ``\code{<!--\var{text}-->}''.  Note that | 
 | 30 | spaces, tabs, and newlines are allowed between the trailing | 
 | 31 | ``\code{>}'' and the immediately preceeding ``\code{--}''. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 32 |  | 
 | 33 | \end{itemize} | 
 | 34 |  | 
 | 35 | The \code{SGMLParser} class must be instantiated without arguments. | 
 | 36 | It has the following interface methods: | 
 | 37 |  | 
| Fred Drake | 8f92595 | 1996-10-09 16:13:22 +0000 | [diff] [blame] | 38 | \renewcommand{\indexsubitem}{({\tt SGMLParser} method)} | 
 | 39 |  | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 40 | \begin{funcdesc}{reset}{} | 
 | 41 | Reset the instance.  Loses all unprocessed data.  This is called | 
 | 42 | implicitly at instantiation time. | 
 | 43 | \end{funcdesc} | 
 | 44 |  | 
 | 45 | \begin{funcdesc}{setnomoretags}{} | 
 | 46 | Stop processing tags.  Treat all following input as literal input | 
 | 47 | (CDATA).  (This is only provided so the HTML tag \code{<PLAINTEXT>} | 
 | 48 | can be implemented.) | 
 | 49 | \end{funcdesc} | 
 | 50 |  | 
 | 51 | \begin{funcdesc}{setliteral}{} | 
 | 52 | Enter literal mode (CDATA mode). | 
 | 53 | \end{funcdesc} | 
 | 54 |  | 
 | 55 | \begin{funcdesc}{feed}{data} | 
 | 56 | Feed some text to the parser.  It is processed insofar as it consists | 
 | 57 | of complete elements; incomplete data is buffered until more data is | 
 | 58 | fed or \code{close()} is called. | 
 | 59 | \end{funcdesc} | 
 | 60 |  | 
 | 61 | \begin{funcdesc}{close}{} | 
 | 62 | Force processing of all buffered data as if it were followed by an | 
 | 63 | end-of-file mark.  This method may be redefined by a derived class to | 
 | 64 | define additional processing at the end of the input, but the | 
 | 65 | redefined version should always call \code{SGMLParser.close()}. | 
 | 66 | \end{funcdesc} | 
 | 67 |  | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 68 | \begin{funcdesc}{handle_starttag}{tag\, method\, attributes} | 
 | 69 | This method is called to handle start tags for which either a | 
 | 70 | \code{start_\var{tag}()} or \code{do_\var{tag}()} method has been | 
 | 71 | defined.  The \code{tag} argument is the name of the tag converted to | 
 | 72 | lower case, and the \code{method} argument is the bound method which | 
 | 73 | should be used to support semantic interpretation of the start tag. | 
 | 74 | The \var{attributes} argument is a list of (\var{name}, \var{value}) | 
 | 75 | pairs containing the attributes found inside the tag's \code{<>} | 
 | 76 | brackets.  The \var{name} has been translated to lower case and double | 
 | 77 | quotes and backslashes in the \var{value} have been interpreted.  For | 
 | 78 | instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this | 
 | 79 | method would be called as \code{unknown_starttag('a', [('href', | 
 | 80 | 'http://www.cwi.nl/')])}.  The base implementation simply calls | 
 | 81 | \code{method} with \code{attributes} as the only argument. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 82 | \end{funcdesc} | 
 | 83 |  | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 84 | \begin{funcdesc}{handle_endtag}{tag\, method} | 
 | 85 |  | 
 | 86 | This method is called to handle endtags for which an | 
 | 87 | \code{end_\var{tag}()} method has been defined.  The \code{tag} | 
 | 88 | argument is the name of the tag converted to lower case, and the | 
 | 89 | \code{method} argument is the bound method which should be used to | 
 | 90 | support semantic interpretation of the end tag.  If no | 
 | 91 | \code{end_\var{tag}()} method is defined for the closing element, this | 
 | 92 | handler is not called.  The base implementation simply calls | 
 | 93 | \code{method}. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 94 | \end{funcdesc} | 
 | 95 |  | 
 | 96 | \begin{funcdesc}{handle_data}{data} | 
 | 97 | This method is called to process arbitrary data.  It is intended to be | 
 | 98 | overridden by a derived class; the base class implementation does | 
 | 99 | nothing. | 
 | 100 | \end{funcdesc} | 
 | 101 |  | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 102 | \begin{funcdesc}{handle_charref}{ref} | 
 | 103 | This method is called to process a character reference of the form | 
 | 104 | ``\code{\&\#\var{ref};}''.  In the base implementation, \var{ref} must | 
 | 105 | be a decimal number in the | 
 | 106 | range 0-255.  It translates the character to \ASCII{} and calls the | 
 | 107 | method \code{handle_data()} with the character as argument.  If | 
 | 108 | \var{ref} is invalid or out of range, the method | 
 | 109 | \code{unknown_charref(\var{ref})} is called to handle the error.  A | 
 | 110 | subclass must override this method to provide support for named | 
 | 111 | character entities. | 
 | 112 | \end{funcdesc} | 
 | 113 |  | 
 | 114 | \begin{funcdesc}{handle_entityref}{ref} | 
 | 115 | This method is called to process a general entity reference of the form | 
 | 116 | ``\code{\&\var{ref};}'' where \var{ref} is an general entity | 
 | 117 | reference.  It looks for \var{ref} in the instance (or class) | 
 | 118 | variable \code{entitydefs} which should be a mapping from entity names | 
 | 119 | to corresponding translations. | 
 | 120 | If a translation is found, it calls the method \code{handle_data()} | 
 | 121 | with the translation; otherwise, it calls the method | 
 | 122 | \code{unknown_entityref(\var{ref})}.  The default \code{entitydefs} | 
 | 123 | defines translations for \code{\&}, \code{\&apos}, \code{\>}, | 
 | 124 | \code{\<}, and \code{\"}. | 
 | 125 | \end{funcdesc} | 
 | 126 |  | 
 | 127 | \begin{funcdesc}{handle_comment}{comment} | 
 | 128 | This method is called when a comment is encountered.  The | 
 | 129 | \code{comment} argument is a string containing the text between the | 
 | 130 | ``\code{<!--}'' and ``\code{-->}'' delimiters, but not the delimiters | 
 | 131 | themselves.  For example, the comment ``\code{<!--text-->}'' will | 
 | 132 | cause this method to be called with the argument \code{'text'}.  The | 
 | 133 | default method does nothing. | 
 | 134 | \end{funcdesc} | 
 | 135 |  | 
 | 136 | \begin{funcdesc}{report_unbalanced}{tag} | 
 | 137 | This method is called when an end tag is found which does not | 
 | 138 | correspond to any open element. | 
 | 139 | \end{funcdesc} | 
 | 140 |  | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 141 | \begin{funcdesc}{unknown_starttag}{tag\, attributes} | 
 | 142 | This method is called to process an unknown start tag.  It is intended | 
 | 143 | to be overridden by a derived class; the base class implementation | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 144 | does nothing. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 145 | \end{funcdesc} | 
 | 146 |  | 
 | 147 | \begin{funcdesc}{unknown_endtag}{tag} | 
 | 148 | This method is called to process an unknown end tag.  It is intended | 
 | 149 | to be overridden by a derived class; the base class implementation | 
 | 150 | does nothing. | 
 | 151 | \end{funcdesc} | 
 | 152 |  | 
 | 153 | \begin{funcdesc}{unknown_charref}{ref} | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 154 | This method is called to process unresolvable numeric character | 
 | 155 | references.  It is intended to be overridden by a derived class; the | 
 | 156 | base class implementation does nothing. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 157 | \end{funcdesc} | 
 | 158 |  | 
 | 159 | \begin{funcdesc}{unknown_entityref}{ref} | 
 | 160 | This method is called to process an unknown entity reference.  It is | 
 | 161 | intended to be overridden by a derived class; the base class | 
 | 162 | implementation does nothing. | 
 | 163 | \end{funcdesc} | 
 | 164 |  | 
 | 165 | Apart from overriding or extending the methods listed above, derived | 
 | 166 | classes may also define methods of the following form to define | 
 | 167 | processing of specific tags.  Tag names in the input stream are case | 
 | 168 | independent; the \var{tag} occurring in method names must be in lower | 
 | 169 | case: | 
 | 170 |  | 
 | 171 | \begin{funcdesc}{start_\var{tag}}{attributes} | 
 | 172 | This method is called to process an opening tag \var{tag}.  It has | 
 | 173 | preference over \code{do_\var{tag}()}.  The \var{attributes} argument | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 174 | has the same meaning as described for \code{handle_starttag()} above. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 175 | \end{funcdesc} | 
 | 176 |  | 
 | 177 | \begin{funcdesc}{do_\var{tag}}{attributes} | 
 | 178 | This method is called to process an opening tag \var{tag} that does | 
 | 179 | not come with a matching closing tag.  The \var{attributes} argument | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 180 | has the same meaning as described for \code{handle_starttag()} above. | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 181 | \end{funcdesc} | 
 | 182 |  | 
 | 183 | \begin{funcdesc}{end_\var{tag}}{} | 
 | 184 | This method is called to process a closing tag \var{tag}. | 
 | 185 | \end{funcdesc} | 
 | 186 |  | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 187 | Note that the parser maintains a stack of open elements for which no | 
 | 188 | end tag has been found yet.  Only tags processed by | 
 | 189 | \code{start_\var{tag}()} are pushed on this stack.  Definition of an | 
| Guido van Rossum | 8675115 | 1995-02-28 17:14:32 +0000 | [diff] [blame] | 190 | \code{end_\var{tag}()} method is optional for these tags.  For tags | 
 | 191 | processed by \code{do_\var{tag}()} or by \code{unknown_tag()}, no | 
| Fred Drake | 42439ad | 1996-10-08 21:51:49 +0000 | [diff] [blame] | 192 | \code{end_\var{tag}()} method must be defined; if defined, it will not | 
 | 193 | be used.  If both \code{start_\var{tag}()} and \code{do_\var{tag}()} | 
 | 194 | methods exist for a tag, the \code{start_\var{tag}()} method takes | 
 | 195 | precedence. |