blob: dc3582b078c10ad3c496b8b190d9d7a204a4236f [file] [log] [blame]
Guido van Rossum470be141995-03-17 16:07:09 +00001\section{Standard Module \sectcode{sgmllib}}
Guido van Rossuma12ef941995-02-27 17:53:25 +00002\stmodindex{sgmllib}
Guido van Rossum86751151995-02-28 17:14:32 +00003\index{SGML}
4
Guido van Rossum86751151995-02-28 17:14:32 +00005This module defines a class \code{SGMLParser} which serves as the
6basis for parsing text files formatted in SGML (Standard Generalized
7Mark-up Language). In fact, it does not provide a full SGML parser
Fred Drake8f925951996-10-09 16:13:22 +00008--- it only parses SGML insofar as it is used by HTML, and the module
9only exists as a base for the \code{htmllib} module.
Guido van Rossum86751151995-02-28 17:14:32 +000010\stmodindex{htmllib}
11
12In particular, the parser is hardcoded to recognize the following
Fred Drake42439ad1996-10-08 21:51:49 +000013constructs:
Guido van Rossum86751151995-02-28 17:14:32 +000014
15\begin{itemize}
16
17\item
18Opening and closing tags of the form
19``\code{<\var{tag} \var{attr}="\var{value}" ...>}'' and
20``\code{</\var{tag}>}'', respectively.
21
22\item
Fred Drake42439ad1996-10-08 21:51:49 +000023Numeric character references of the form ``\code{\&\#\var{name};}''.
Guido van Rossum86751151995-02-28 17:14:32 +000024
25\item
26Entity references of the form ``\code{\&\var{name};}''.
27
28\item
Fred Drake42439ad1996-10-08 21:51:49 +000029SGML comments of the form ``\code{<!--\var{text}-->}''. Note that
30spaces, tabs, and newlines are allowed between the trailing
31``\code{>}'' and the immediately preceeding ``\code{--}''.
Guido van Rossum86751151995-02-28 17:14:32 +000032
33\end{itemize}
34
35The \code{SGMLParser} class must be instantiated without arguments.
36It has the following interface methods:
37
Fred Drake8f925951996-10-09 16:13:22 +000038\renewcommand{\indexsubitem}{({\tt SGMLParser} method)}
39
Guido van Rossum86751151995-02-28 17:14:32 +000040\begin{funcdesc}{reset}{}
41Reset the instance. Loses all unprocessed data. This is called
42implicitly at instantiation time.
43\end{funcdesc}
44
45\begin{funcdesc}{setnomoretags}{}
46Stop processing tags. Treat all following input as literal input
47(CDATA). (This is only provided so the HTML tag \code{<PLAINTEXT>}
48can be implemented.)
49\end{funcdesc}
50
51\begin{funcdesc}{setliteral}{}
52Enter literal mode (CDATA mode).
53\end{funcdesc}
54
55\begin{funcdesc}{feed}{data}
56Feed some text to the parser. It is processed insofar as it consists
57of complete elements; incomplete data is buffered until more data is
58fed or \code{close()} is called.
59\end{funcdesc}
60
61\begin{funcdesc}{close}{}
62Force processing of all buffered data as if it were followed by an
63end-of-file mark. This method may be redefined by a derived class to
64define additional processing at the end of the input, but the
65redefined version should always call \code{SGMLParser.close()}.
66\end{funcdesc}
67
Fred Drake42439ad1996-10-08 21:51:49 +000068\begin{funcdesc}{handle_starttag}{tag\, method\, attributes}
69This method is called to handle start tags for which either a
70\code{start_\var{tag}()} or \code{do_\var{tag}()} method has been
71defined. The \code{tag} argument is the name of the tag converted to
72lower case, and the \code{method} argument is the bound method which
73should be used to support semantic interpretation of the start tag.
74The \var{attributes} argument is a list of (\var{name}, \var{value})
75pairs containing the attributes found inside the tag's \code{<>}
76brackets. The \var{name} has been translated to lower case and double
77quotes and backslashes in the \var{value} have been interpreted. For
78instance, for the tag \code{<A HREF="http://www.cwi.nl/">}, this
79method would be called as \code{unknown_starttag('a', [('href',
80'http://www.cwi.nl/')])}. The base implementation simply calls
81\code{method} with \code{attributes} as the only argument.
Guido van Rossum86751151995-02-28 17:14:32 +000082\end{funcdesc}
83
Fred Drake42439ad1996-10-08 21:51:49 +000084\begin{funcdesc}{handle_endtag}{tag\, method}
85
86This method is called to handle endtags for which an
87\code{end_\var{tag}()} method has been defined. The \code{tag}
88argument is the name of the tag converted to lower case, and the
89\code{method} argument is the bound method which should be used to
90support semantic interpretation of the end tag. If no
91\code{end_\var{tag}()} method is defined for the closing element, this
92handler is not called. The base implementation simply calls
93\code{method}.
Guido van Rossum86751151995-02-28 17:14:32 +000094\end{funcdesc}
95
96\begin{funcdesc}{handle_data}{data}
97This method is called to process arbitrary data. It is intended to be
98overridden by a derived class; the base class implementation does
99nothing.
100\end{funcdesc}
101
Fred Drake42439ad1996-10-08 21:51:49 +0000102\begin{funcdesc}{handle_charref}{ref}
103This method is called to process a character reference of the form
104``\code{\&\#\var{ref};}''. In the base implementation, \var{ref} must
105be a decimal number in the
106range 0-255. It translates the character to \ASCII{} and calls the
107method \code{handle_data()} with the character as argument. If
108\var{ref} is invalid or out of range, the method
109\code{unknown_charref(\var{ref})} is called to handle the error. A
110subclass must override this method to provide support for named
111character entities.
112\end{funcdesc}
113
114\begin{funcdesc}{handle_entityref}{ref}
115This method is called to process a general entity reference of the form
116``\code{\&\var{ref};}'' where \var{ref} is an general entity
117reference. It looks for \var{ref} in the instance (or class)
118variable \code{entitydefs} which should be a mapping from entity names
119to corresponding translations.
120If a translation is found, it calls the method \code{handle_data()}
121with the translation; otherwise, it calls the method
122\code{unknown_entityref(\var{ref})}. The default \code{entitydefs}
123defines translations for \code{\&amp;}, \code{\&apos}, \code{\&gt;},
124\code{\&lt;}, and \code{\&quot;}.
125\end{funcdesc}
126
127\begin{funcdesc}{handle_comment}{comment}
128This method is called when a comment is encountered. The
129\code{comment} argument is a string containing the text between the
130``\code{<!--}'' and ``\code{-->}'' delimiters, but not the delimiters
131themselves. For example, the comment ``\code{<!--text-->}'' will
132cause this method to be called with the argument \code{'text'}. The
133default method does nothing.
134\end{funcdesc}
135
136\begin{funcdesc}{report_unbalanced}{tag}
137This method is called when an end tag is found which does not
138correspond to any open element.
139\end{funcdesc}
140
Guido van Rossum86751151995-02-28 17:14:32 +0000141\begin{funcdesc}{unknown_starttag}{tag\, attributes}
142This method is called to process an unknown start tag. It is intended
143to be overridden by a derived class; the base class implementation
Fred Drake42439ad1996-10-08 21:51:49 +0000144does nothing.
Guido van Rossum86751151995-02-28 17:14:32 +0000145\end{funcdesc}
146
147\begin{funcdesc}{unknown_endtag}{tag}
148This method is called to process an unknown end tag. It is intended
149to be overridden by a derived class; the base class implementation
150does nothing.
151\end{funcdesc}
152
153\begin{funcdesc}{unknown_charref}{ref}
Fred Drake42439ad1996-10-08 21:51:49 +0000154This method is called to process unresolvable numeric character
155references. It is intended to be overridden by a derived class; the
156base class implementation does nothing.
Guido van Rossum86751151995-02-28 17:14:32 +0000157\end{funcdesc}
158
159\begin{funcdesc}{unknown_entityref}{ref}
160This method is called to process an unknown entity reference. It is
161intended to be overridden by a derived class; the base class
162implementation does nothing.
163\end{funcdesc}
164
165Apart from overriding or extending the methods listed above, derived
166classes may also define methods of the following form to define
167processing of specific tags. Tag names in the input stream are case
168independent; the \var{tag} occurring in method names must be in lower
169case:
170
171\begin{funcdesc}{start_\var{tag}}{attributes}
172This method is called to process an opening tag \var{tag}. It has
173preference over \code{do_\var{tag}()}. The \var{attributes} argument
Fred Drake42439ad1996-10-08 21:51:49 +0000174has the same meaning as described for \code{handle_starttag()} above.
Guido van Rossum86751151995-02-28 17:14:32 +0000175\end{funcdesc}
176
177\begin{funcdesc}{do_\var{tag}}{attributes}
178This method is called to process an opening tag \var{tag} that does
179not come with a matching closing tag. The \var{attributes} argument
Fred Drake42439ad1996-10-08 21:51:49 +0000180has the same meaning as described for \code{handle_starttag()} above.
Guido van Rossum86751151995-02-28 17:14:32 +0000181\end{funcdesc}
182
183\begin{funcdesc}{end_\var{tag}}{}
184This method is called to process a closing tag \var{tag}.
185\end{funcdesc}
186
Fred Drake42439ad1996-10-08 21:51:49 +0000187Note that the parser maintains a stack of open elements for which no
188end tag has been found yet. Only tags processed by
189\code{start_\var{tag}()} are pushed on this stack. Definition of an
Guido van Rossum86751151995-02-28 17:14:32 +0000190\code{end_\var{tag}()} method is optional for these tags. For tags
191processed by \code{do_\var{tag}()} or by \code{unknown_tag()}, no
Fred Drake42439ad1996-10-08 21:51:49 +0000192\code{end_\var{tag}()} method must be defined; if defined, it will not
193be used. If both \code{start_\var{tag}()} and \code{do_\var{tag}()}
194methods exist for a tag, the \code{start_\var{tag}()} method takes
195precedence.