| \section{\module{sgmllib} --- | 
 |          Simple SGML parser} | 
 |  | 
 | \declaremodule{standard}{sgmllib} | 
 | \modulesynopsis{Only as much of an SGML parser as needed to parse HTML.} | 
 |  | 
 | \index{SGML} | 
 |  | 
 | This module defines a class \class{SGMLParser} which serves as the | 
 | basis for parsing text files formatted in SGML (Standard Generalized | 
 | Mark-up Language).  In fact, it does not provide a full SGML parser | 
 | --- it only parses SGML insofar as it is used by HTML, and the module | 
 | only exists as a base for the \refmodule{htmllib} module.  Another | 
 | HTML parser which supports XHTML and offers a somewhat different | 
 | interface is available in the \refmodule{HTMLParser} module. | 
 |  | 
 | \begin{classdesc}{SGMLParser}{} | 
 | The \class{SGMLParser} class is instantiated without arguments. | 
 | The parser is hardcoded to recognize the following | 
 | constructs: | 
 |  | 
 | \begin{itemize} | 
 | \item | 
 | Opening and closing tags of the form | 
 | \samp{<\var{tag} \var{attr}="\var{value}" ...>} and | 
 | \samp{</\var{tag}>}, respectively. | 
 |  | 
 | \item | 
 | Numeric character references of the form \samp{\&\#\var{name};}. | 
 |  | 
 | \item | 
 | Entity references of the form \samp{\&\var{name};}. | 
 |  | 
 | \item | 
 | SGML comments of the form \samp{<!--\var{text}-->}.  Note that | 
 | spaces, tabs, and newlines are allowed between the trailing | 
 | \samp{>} and the immediately preceding \samp{--}. | 
 |  | 
 | \end{itemize} | 
 | \end{classdesc} | 
 |  | 
 | A single exception is defined as well: | 
 |  | 
 | \begin{excdesc}{SGMLParseError} | 
 | Exception raised by the \class{SGMLParser} class when it encounters an | 
 | error while parsing. | 
 | \versionadded{2.1} | 
 | \end{excdesc} | 
 |  | 
 |  | 
 | \class{SGMLParser} instances have the following methods: | 
 |  | 
 |  | 
 | \begin{methoddesc}{reset}{} | 
 | Reset the instance.  Loses all unprocessed data.  This is called | 
 | implicitly at instantiation time. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{setnomoretags}{} | 
 | Stop processing tags.  Treat all following input as literal input | 
 | (CDATA).  (This is only provided so the HTML tag | 
 | \code{<PLAINTEXT>} can be implemented.) | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{setliteral}{} | 
 | Enter literal mode (CDATA mode). | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{feed}{data} | 
 | Feed some text to the parser.  It is processed insofar as it consists | 
 | of complete elements; incomplete data is buffered until more data is | 
 | fed or \method{close()} is called. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{close}{} | 
 | Force processing of all buffered data as if it were followed by an | 
 | end-of-file mark.  This method may be redefined by a derived class to | 
 | define additional processing at the end of the input, but the | 
 | redefined version should always call \method{close()}. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{get_starttag_text}{} | 
 | Return the text of the most recently opened start tag.  This should | 
 | not normally be needed for structured processing, but may be useful in | 
 | dealing with HTML ``as deployed'' or for re-generating input with | 
 | minimal changes (whitespace between attributes can be preserved, | 
 | etc.). | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{handle_starttag}{tag, method, attributes} | 
 | This method is called to handle start tags for which either a | 
 | \method{start_\var{tag}()} or \method{do_\var{tag}()} method has been | 
 | defined.  The \var{tag} argument is the name of the tag converted to | 
 | lower case, and the \var{method} argument is the bound method which | 
 | should be used to support semantic interpretation of the start tag. | 
 | The \var{attributes} argument is a list of \code{(\var{name}, | 
 | \var{value})} pairs containing the attributes found inside the tag's | 
 | \code{<>} brackets.  The \var{name} has been translated to lower case | 
 | and double quotes and backslashes in the \var{value} have been interpreted. | 
 | For instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this | 
 | method would be called as \samp{unknown_starttag('a', [('href', | 
 | 'http://www.cwi.nl/')])}.  The base implementation simply calls | 
 | \var{method} with \var{attributes} as the only argument. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{handle_endtag}{tag, method} | 
 | This method is called to handle endtags for which an | 
 | \method{end_\var{tag}()} method has been defined.  The | 
 | \var{tag} argument is the name of the tag converted to lower case, and | 
 | the \var{method} argument is the bound method which should be used to | 
 | support semantic interpretation of the end tag.  If no | 
 | \method{end_\var{tag}()} method is defined for the closing element, | 
 | this handler is not called.  The base implementation simply calls | 
 | \var{method}. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{handle_data}{data} | 
 | This method is called to process arbitrary data.  It is intended to be | 
 | overridden by a derived class; the base class implementation does | 
 | nothing. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{handle_charref}{ref} | 
 | This method is called to process a character reference of the form | 
 | \samp{\&\#\var{ref};}.  In the base implementation, \var{ref} must | 
 | be a decimal number in the | 
 | range 0-255.  It translates the character to \ASCII{} and calls the | 
 | method \method{handle_data()} with the character as argument.  If | 
 | \var{ref} is invalid or out of range, the method | 
 | \code{unknown_charref(\var{ref})} is called to handle the error.  A | 
 | subclass must override this method to provide support for named | 
 | character entities. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{handle_entityref}{ref} | 
 | This method is called to process a general entity reference of the | 
 | form \samp{\&\var{ref};} where \var{ref} is an general entity | 
 | reference.  It looks for \var{ref} in the instance (or class) | 
 | variable \member{entitydefs} which should be a mapping from entity | 
 | names to corresponding translations.  If a translation is found, it | 
 | calls the method \method{handle_data()} with the translation; | 
 | otherwise, it calls the method \code{unknown_entityref(\var{ref})}. | 
 | The default \member{entitydefs} defines translations for | 
 | \code{\&}, \code{\&apos}, \code{\>}, \code{\<}, and | 
 | \code{\"}. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{handle_comment}{comment} | 
 | This method is called when a comment is encountered.  The | 
 | \var{comment} argument is a string containing the text between the | 
 | \samp{<!--} and \samp{-->} delimiters, but not the delimiters | 
 | themselves.  For example, the comment \samp{<!--text-->} will | 
 | cause this method to be called with the argument \code{'text'}.  The | 
 | default method does nothing. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{handle_decl}{data} | 
 | Method called when an SGML declaration is read by the parser.  In | 
 | practice, the \code{DOCTYPE} declaration is the only thing observed in | 
 | HTML, but the parser does not discriminate among different (or broken) | 
 | declarations.  Internal subsets in a \code{DOCTYPE} declaration are | 
 | not supported.  The \var{data} parameter will be the entire contents | 
 | of the declaration inside the \code{<!}...\code{>} markup.  The | 
 | default implementation does nothing. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{report_unbalanced}{tag} | 
 | This method is called when an end tag is found which does not | 
 | correspond to any open element. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{unknown_starttag}{tag, attributes} | 
 | This method is called to process an unknown start tag.  It is intended | 
 | to be overridden by a derived class; the base class implementation | 
 | does nothing. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{unknown_endtag}{tag} | 
 | This method is called to process an unknown end tag.  It is intended | 
 | to be overridden by a derived class; the base class implementation | 
 | does nothing. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{unknown_charref}{ref} | 
 | This method is called to process unresolvable numeric character | 
 | references.  Refer to \method{handle_charref()} to determine what is | 
 | handled by default.  It is intended to be overridden by a derived | 
 | class; the base class implementation does nothing. | 
 | \end{methoddesc} | 
 |  | 
 | \begin{methoddesc}{unknown_entityref}{ref} | 
 | This method is called to process an unknown entity reference.  It is | 
 | intended to be overridden by a derived class; the base class | 
 | implementation does nothing. | 
 | \end{methoddesc} | 
 |  | 
 | Apart from overriding or extending the methods listed above, derived | 
 | classes may also define methods of the following form to define | 
 | processing of specific tags.  Tag names in the input stream are case | 
 | independent; the \var{tag} occurring in method names must be in lower | 
 | case: | 
 |  | 
 | \begin{methoddescni}{start_\var{tag}}{attributes} | 
 | This method is called to process an opening tag \var{tag}.  It has | 
 | preference over \method{do_\var{tag}()}.  The | 
 | \var{attributes} argument has the same meaning as described for | 
 | \method{handle_starttag()} above. | 
 | \end{methoddescni} | 
 |  | 
 | \begin{methoddescni}{do_\var{tag}}{attributes} | 
 | This method is called to process an opening tag \var{tag} that does | 
 | not come with a matching closing tag.  The \var{attributes} argument | 
 | has the same meaning as described for \method{handle_starttag()} above. | 
 | \end{methoddescni} | 
 |  | 
 | \begin{methoddescni}{end_\var{tag}}{} | 
 | This method is called to process a closing tag \var{tag}. | 
 | \end{methoddescni} | 
 |  | 
 | Note that the parser maintains a stack of open elements for which no | 
 | end tag has been found yet.  Only tags processed by | 
 | \method{start_\var{tag}()} are pushed on this stack.  Definition of an | 
 | \method{end_\var{tag}()} method is optional for these tags.  For tags | 
 | processed by \method{do_\var{tag}()} or by \method{unknown_tag()}, no | 
 | \method{end_\var{tag}()} method must be defined; if defined, it will | 
 | not be used.  If both \method{start_\var{tag}()} and | 
 | \method{do_\var{tag}()} methods exist for a tag, the | 
 | \method{start_\var{tag}()} method takes precedence. |