blob: 15b46ae57ddf29d6c0d4647a2271f482fc9979a2 [file] [log] [blame]
Fred Drake38e5d272000-04-03 20:13:55 +00001\section{\module{parser} ---
2 Access Python parse trees}
3
Guido van Rossum4b73a061995-10-11 17:30:04 +00004% Copyright 1995 Virginia Polytechnic Institute and State University
5% and Fred L. Drake, Jr. This copyright notice must be distributed on
6% all copies, but this document otherwise may be distributed as part
7% of the Python distribution. No fee may be charged for this document
8% in any representation, either on paper or electronically. This
9% restriction does not affect other elements in a distributed package
10% in any way.
Fred Drake9f033801999-02-19 22:56:08 +000011
Fred Drakeb91e9341998-07-23 17:59:49 +000012\declaremodule{builtin}{parser}
Fred Drake9f033801999-02-19 22:56:08 +000013\modulesynopsis{Access parse trees for Python source code.}
Fred Drake295da241998-08-10 19:42:37 +000014\moduleauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
15\sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
Fred Drakeb91e9341998-07-23 17:59:49 +000016
Fred Drakeb91e9341998-07-23 17:59:49 +000017
Fred Drake88223901998-02-09 20:52:48 +000018\index{parsing!Python source code}
Guido van Rossum4b73a061995-10-11 17:30:04 +000019
Fred Drake88223901998-02-09 20:52:48 +000020The \module{parser} module provides an interface to Python's internal
Guido van Rossum4b73a061995-10-11 17:30:04 +000021parser and byte-code compiler. The primary purpose for this interface
22is to allow Python code to edit the parse tree of a Python expression
Fred Drake4b7d5a41996-09-11 21:57:40 +000023and create executable code from this. This is better than trying
24to parse and modify an arbitrary Python code fragment as a string
25because parsing is performed in a manner identical to the code
26forming the application. It is also faster.
Guido van Rossum4b73a061995-10-11 17:30:04 +000027
28There are a few things to note about this module which are important
29to making use of the data structures created. This is not a tutorial
Fred Drake4b7d5a41996-09-11 21:57:40 +000030on editing the parse trees for Python code, but some examples of using
Fred Drake88223901998-02-09 20:52:48 +000031the \module{parser} module are presented.
Guido van Rossum4b73a061995-10-11 17:30:04 +000032
33Most importantly, a good understanding of the Python grammar processed
34by the internal parser is required. For full information on the
Fred Drake37f15741999-11-10 16:21:37 +000035language syntax, refer to the \citetitle[../ref/ref.html]{Python
36Language Reference}. The parser itself is created from a grammar
37specification defined in the file \file{Grammar/Grammar} in the
38standard Python distribution. The parse trees stored in the AST
39objects created by this module are the actual output from the internal
40parser when created by the \function{expr()} or \function{suite()}
41functions, described below. The AST objects created by
42\function{sequence2ast()} faithfully simulate those structures. Be
43aware that the values of the sequences which are considered
44``correct'' will vary from one version of Python to another as the
45formal grammar for the language is revised. However, transporting
46code from one Python version to another as source text will always
47allow correct parse trees to be created in the target version, with
48the only restriction being that migrating to an older version of the
49interpreter will not support more recent language constructs. The
50parse trees are not typically compatible from one version to another,
51whereas source code has always been forward-compatible.
Guido van Rossum4b73a061995-10-11 17:30:04 +000052
Fred Drake88223901998-02-09 20:52:48 +000053Each element of the sequences returned by \function{ast2list()} or
54\function{ast2tuple()} has a simple form. Sequences representing
Guido van Rossum47478871996-08-21 14:32:37 +000055non-terminal elements in the grammar always have a length greater than
56one. The first element is an integer which identifies a production in
57the grammar. These integers are given symbolic names in the C header
Fred Drake4b7d5a41996-09-11 21:57:40 +000058file \file{Include/graminit.h} and the Python module
Fred Drakeffbe6871999-04-22 21:23:22 +000059\refmodule{symbol}. Each additional element of the sequence represents
Guido van Rossum47478871996-08-21 14:32:37 +000060a component of the production as recognized in the input string: these
61are always sequences which have the same form as the parent. An
62important aspect of this structure which should be noted is that
63keywords used to identify the parent node type, such as the keyword
Fred Drake88223901998-02-09 20:52:48 +000064\keyword{if} in an \constant{if_stmt}, are included in the node tree without
65any special treatment. For example, the \keyword{if} keyword is
Guido van Rossum4b73a061995-10-11 17:30:04 +000066represented by the tuple \code{(1, 'if')}, where \code{1} is the
Fred Drakec71b8021999-08-02 14:30:52 +000067numeric value associated with all \constant{NAME} tokens, including
Guido van Rossum47478871996-08-21 14:32:37 +000068variable and function names defined by the user. In an alternate form
69returned when line number information is requested, the same token
70might be represented as \code{(1, 'if', 12)}, where the \code{12}
71represents the line number at which the terminal symbol was found.
Guido van Rossum4b73a061995-10-11 17:30:04 +000072
73Terminal elements are represented in much the same way, but without
74any child elements and the addition of the source text which was
Fred Drake88223901998-02-09 20:52:48 +000075identified. The example of the \keyword{if} keyword above is
Guido van Rossum4b73a061995-10-11 17:30:04 +000076representative. The various types of terminal symbols are defined in
Fred Drake4b7d5a41996-09-11 21:57:40 +000077the C header file \file{Include/token.h} and the Python module
Fred Drakeffbe6871999-04-22 21:23:22 +000078\refmodule{token}.
Guido van Rossum4b73a061995-10-11 17:30:04 +000079
Fred Drake4b7d5a41996-09-11 21:57:40 +000080The AST objects are not required to support the functionality of this
81module, but are provided for three purposes: to allow an application
82to amortize the cost of processing complex parse trees, to provide a
83parse tree representation which conserves memory space when compared
84to the Python list or tuple representation, and to ease the creation
85of additional modules in C which manipulate parse trees. A simple
86``wrapper'' class may be created in Python to hide the use of AST
Fred Drake88223901998-02-09 20:52:48 +000087objects.
Guido van Rossum4b73a061995-10-11 17:30:04 +000088
Fred Drake88223901998-02-09 20:52:48 +000089The \module{parser} module defines functions for a few distinct
Fred Drake4b7d5a41996-09-11 21:57:40 +000090purposes. The most important purposes are to create AST objects and
91to convert AST objects to other representations such as parse trees
92and compiled code objects, but there are also functions which serve to
93query the type of parse tree represented by an AST object.
Guido van Rossum4b73a061995-10-11 17:30:04 +000094
Fred Drake4b7d5a41996-09-11 21:57:40 +000095
Fred Drake03c05a51999-05-11 15:15:54 +000096\begin{seealso}
97 \seemodule{symbol}{Useful constants representing internal nodes of
98 the parse tree.}
99 \seemodule{token}{Useful constants representing leaf nodes of the
100 parse tree and functions for testing node values.}
101\end{seealso}
102
103
Fred Drakec71b8021999-08-02 14:30:52 +0000104\subsection{Creating AST Objects \label{Creating ASTs}}
Fred Drake4b7d5a41996-09-11 21:57:40 +0000105
106AST objects may be created from source code or from a parse tree.
107When creating an AST object from source, different functions are used
108to create the \code{'eval'} and \code{'exec'} forms.
109
Fred Drake244ad3c1999-09-09 14:16:36 +0000110\begin{funcdesc}{expr}{source}
111The \function{expr()} function parses the parameter \var{source}
Fred Drake625d70a2000-05-09 17:10:23 +0000112as if it were an input to \samp{compile(\var{source}, 'file.py',
113'eval')}. If the parse succeeds, an AST object is created to hold the
114internal parse tree representation, otherwise an appropriate exception
115is thrown.
Fred Drake4b7d5a41996-09-11 21:57:40 +0000116\end{funcdesc}
117
Fred Drake244ad3c1999-09-09 14:16:36 +0000118\begin{funcdesc}{suite}{source}
119The \function{suite()} function parses the parameter \var{source}
Fred Drake625d70a2000-05-09 17:10:23 +0000120as if it were an input to \samp{compile(\var{source}, 'file.py',
121'exec')}. If the parse succeeds, an AST object is created to hold the
122internal parse tree representation, otherwise an appropriate exception
123is thrown.
Fred Drake4b7d5a41996-09-11 21:57:40 +0000124\end{funcdesc}
125
126\begin{funcdesc}{sequence2ast}{sequence}
127This function accepts a parse tree represented as a sequence and
128builds an internal representation if possible. If it can validate
129that the tree conforms to the Python grammar and all nodes are valid
130node types in the host version of Python, an AST object is created
131from the internal representation and returned to the called. If there
132is a problem creating the internal representation, or if the tree
Fred Drake88223901998-02-09 20:52:48 +0000133cannot be validated, a \exception{ParserError} exception is thrown. An AST
Fred Drake4b7d5a41996-09-11 21:57:40 +0000134object created this way should not be assumed to compile correctly;
135normal exceptions thrown by compilation may still be initiated when
Fred Drake88223901998-02-09 20:52:48 +0000136the AST object is passed to \function{compileast()}. This may indicate
137problems not related to syntax (such as a \exception{MemoryError}
Fred Drake4b7d5a41996-09-11 21:57:40 +0000138exception), but may also be due to constructs such as the result of
139parsing \code{del f(0)}, which escapes the Python parser but is
140checked by the bytecode compiler.
141
142Sequences representing terminal tokens may be represented as either
143two-element lists of the form \code{(1, 'name')} or as three-element
144lists of the form \code{(1, 'name', 56)}. If the third element is
145present, it is assumed to be a valid line number. The line number
146may be specified for any subset of the terminal symbols in the input
147tree.
148\end{funcdesc}
149
150\begin{funcdesc}{tuple2ast}{sequence}
Fred Drake88223901998-02-09 20:52:48 +0000151This is the same function as \function{sequence2ast()}. This entry point
Fred Drake4b7d5a41996-09-11 21:57:40 +0000152is maintained for backward compatibility.
153\end{funcdesc}
154
155
Fred Drakec71b8021999-08-02 14:30:52 +0000156\subsection{Converting AST Objects \label{Converting ASTs}}
Fred Drake4b7d5a41996-09-11 21:57:40 +0000157
158AST objects, regardless of the input used to create them, may be
159converted to parse trees represented as list- or tuple- trees, or may
160be compiled into executable code objects. Parse trees may be
161extracted with or without line numbering information.
162
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000163\begin{funcdesc}{ast2list}{ast\optional{, line_info}}
Guido van Rossum4b73a061995-10-11 17:30:04 +0000164This function accepts an AST object from the caller in
Fred Drakec71b8021999-08-02 14:30:52 +0000165\var{ast} and returns a Python list representing the
Fred Drake38e5d272000-04-03 20:13:55 +0000166equivalent parse tree. The resulting list representation can be used
Fred Drake4b7d5a41996-09-11 21:57:40 +0000167for inspection or the creation of a new parse tree in list form. This
168function does not fail so long as memory is available to build the
169list representation. If the parse tree will only be used for
Fred Drake88223901998-02-09 20:52:48 +0000170inspection, \function{ast2tuple()} should be used instead to reduce memory
Fred Drake4b7d5a41996-09-11 21:57:40 +0000171consumption and fragmentation. When the list representation is
172required, this function is significantly faster than retrieving a
173tuple representation and converting that to nested lists.
Guido van Rossum47478871996-08-21 14:32:37 +0000174
Fred Drakec71b8021999-08-02 14:30:52 +0000175If \var{line_info} is true, line number information will be
Fred Drake4b7d5a41996-09-11 21:57:40 +0000176included for all terminal tokens as a third element of the list
Fred Drake9abe64a1996-12-05 22:28:43 +0000177representing the token. Note that the line number provided specifies
Fred Drake88223901998-02-09 20:52:48 +0000178the line on which the token \emph{ends}. This information is
Fred Drake9abe64a1996-12-05 22:28:43 +0000179omitted if the flag is false or omitted.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000180\end{funcdesc}
181
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000182\begin{funcdesc}{ast2tuple}{ast\optional{, line_info}}
Guido van Rossum47478871996-08-21 14:32:37 +0000183This function accepts an AST object from the caller in
Fred Drakec71b8021999-08-02 14:30:52 +0000184\var{ast} and returns a Python tuple representing the
Fred Drake38e5d272000-04-03 20:13:55 +0000185equivalent parse tree. Other than returning a tuple instead of a
Fred Drake88223901998-02-09 20:52:48 +0000186list, this function is identical to \function{ast2list()}.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000187
Fred Drakec71b8021999-08-02 14:30:52 +0000188If \var{line_info} is true, line number information will be
Fred Drake4b7d5a41996-09-11 21:57:40 +0000189included for all terminal tokens as a third element of the list
190representing the token. This information is omitted if the flag is
191false or omitted.
Guido van Rossum47478871996-08-21 14:32:37 +0000192\end{funcdesc}
193
Fred Drakecce10901998-03-17 06:33:25 +0000194\begin{funcdesc}{compileast}{ast\optional{, filename\code{ = '<ast>'}}}
Guido van Rossum4b73a061995-10-11 17:30:04 +0000195The Python byte compiler can be invoked on an AST object to produce
Fred Drakec71b8021999-08-02 14:30:52 +0000196code objects which can be used as part of an \keyword{exec} statement or
Fred Drake88223901998-02-09 20:52:48 +0000197a call to the built-in \function{eval()}\bifuncindex{eval} function.
198This function provides the interface to the compiler, passing the
Fred Drakec71b8021999-08-02 14:30:52 +0000199internal parse tree from \var{ast} to the parser, using the
200source file name specified by the \var{filename} parameter.
201The default value supplied for \var{filename} indicates that
Fred Drake88223901998-02-09 20:52:48 +0000202the source was an AST object.
Guido van Rossum47478871996-08-21 14:32:37 +0000203
204Compiling an AST object may result in exceptions related to
Fred Drake88223901998-02-09 20:52:48 +0000205compilation; an example would be a \exception{SyntaxError} caused by the
Fred Drake4b7d5a41996-09-11 21:57:40 +0000206parse tree for \code{del f(0)}: this statement is considered legal
Guido van Rossum47478871996-08-21 14:32:37 +0000207within the formal grammar for Python but is not a legal language
Fred Drake88223901998-02-09 20:52:48 +0000208construct. The \exception{SyntaxError} raised for this condition is
Guido van Rossum47478871996-08-21 14:32:37 +0000209actually generated by the Python byte-compiler normally, which is why
Fred Drake88223901998-02-09 20:52:48 +0000210it can be raised at this point by the \module{parser} module. Most
Guido van Rossum47478871996-08-21 14:32:37 +0000211causes of compilation failure can be diagnosed programmatically by
212inspection of the parse tree.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000213\end{funcdesc}
214
215
Fred Drakec71b8021999-08-02 14:30:52 +0000216\subsection{Queries on AST Objects \label{Querying ASTs}}
Guido van Rossum4b73a061995-10-11 17:30:04 +0000217
Fred Drake4b7d5a41996-09-11 21:57:40 +0000218Two functions are provided which allow an application to determine if
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000219an AST was created as an expression or a suite. Neither of these
Fred Drake4b7d5a41996-09-11 21:57:40 +0000220functions can be used to determine if an AST was created from source
Fred Drake88223901998-02-09 20:52:48 +0000221code via \function{expr()} or \function{suite()} or from a parse tree
222via \function{sequence2ast()}.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000223
224\begin{funcdesc}{isexpr}{ast}
Fred Drakec71b8021999-08-02 14:30:52 +0000225When \var{ast} represents an \code{'eval'} form, this function
Fred Drake88223901998-02-09 20:52:48 +0000226returns true, otherwise it returns false. This is useful, since code
227objects normally cannot be queried for this information using existing
228built-in functions. Note that the code objects created by
229\function{compileast()} cannot be queried like this either, and are
230identical to those created by the built-in
231\function{compile()}\bifuncindex{compile} function.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000232\end{funcdesc}
233
234
235\begin{funcdesc}{issuite}{ast}
Fred Drake88223901998-02-09 20:52:48 +0000236This function mirrors \function{isexpr()} in that it reports whether an
Fred Drake4b7d5a41996-09-11 21:57:40 +0000237AST object represents an \code{'exec'} form, commonly known as a
Fred Drake38e5d272000-04-03 20:13:55 +0000238``suite.'' It is not safe to assume that this function is equivalent
Fred Drake88223901998-02-09 20:52:48 +0000239to \samp{not isexpr(\var{ast})}, as additional syntactic fragments may
Fred Drake4b7d5a41996-09-11 21:57:40 +0000240be supported in the future.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000241\end{funcdesc}
242
243
Fred Drakec71b8021999-08-02 14:30:52 +0000244\subsection{Exceptions and Error Handling \label{AST Errors}}
Guido van Rossum4b73a061995-10-11 17:30:04 +0000245
246The parser module defines a single exception, but may also pass other
247built-in exceptions from other portions of the Python runtime
248environment. See each function for information about the exceptions
249it can raise.
250
251\begin{excdesc}{ParserError}
252Exception raised when a failure occurs within the parser module. This
253is generally produced for validation failures rather than the built in
Fred Drake88223901998-02-09 20:52:48 +0000254\exception{SyntaxError} thrown during normal parsing.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000255The exception argument is either a string describing the reason of the
Guido van Rossum47478871996-08-21 14:32:37 +0000256failure or a tuple containing a sequence causing the failure from a parse
Fred Drake88223901998-02-09 20:52:48 +0000257tree passed to \function{sequence2ast()} and an explanatory string. Calls to
258\function{sequence2ast()} need to be able to handle either type of exception,
Guido van Rossum4b73a061995-10-11 17:30:04 +0000259while calls to other functions in the module will only need to be
260aware of the simple string values.
261\end{excdesc}
262
Fred Drake88223901998-02-09 20:52:48 +0000263Note that the functions \function{compileast()}, \function{expr()}, and
264\function{suite()} may throw exceptions which are normally thrown by the
Guido van Rossum4b73a061995-10-11 17:30:04 +0000265parsing and compilation process. These include the built in
Fred Drake88223901998-02-09 20:52:48 +0000266exceptions \exception{MemoryError}, \exception{OverflowError},
267\exception{SyntaxError}, and \exception{SystemError}. In these cases, these
Guido van Rossum4b73a061995-10-11 17:30:04 +0000268exceptions carry all the meaning normally associated with them. Refer
269to the descriptions of each function for detailed information.
270
Guido van Rossum4b73a061995-10-11 17:30:04 +0000271
Fred Drakec71b8021999-08-02 14:30:52 +0000272\subsection{AST Objects \label{AST Objects}}
Guido van Rossum47478871996-08-21 14:32:37 +0000273
Fred Drakeaf370ea1998-04-05 20:23:02 +0000274Ordered and equality comparisons are supported between AST objects.
Fred Drakeffbe6871999-04-22 21:23:22 +0000275Pickling of AST objects (using the \refmodule{pickle} module) is also
Fred Drakec4f1ca11998-04-13 16:27:27 +0000276supported.
Fred Drakeaf370ea1998-04-05 20:23:02 +0000277
Fred Drakecc444e31998-03-08 06:47:24 +0000278\begin{datadesc}{ASTType}
279The type of the objects returned by \function{expr()},
280\function{suite()} and \function{sequence2ast()}.
Fred Drakecc444e31998-03-08 06:47:24 +0000281\end{datadesc}
Guido van Rossum47478871996-08-21 14:32:37 +0000282
283
Fred Drake916d8f81998-04-13 18:46:16 +0000284AST objects have the following methods:
285
286
287\begin{methoddesc}[AST]{compile}{\optional{filename}}
288Same as \code{compileast(\var{ast}, \var{filename})}.
289\end{methoddesc}
290
291\begin{methoddesc}[AST]{isexpr}{}
292Same as \code{isexpr(\var{ast})}.
293\end{methoddesc}
294
295\begin{methoddesc}[AST]{issuite}{}
296Same as \code{issuite(\var{ast})}.
297\end{methoddesc}
298
299\begin{methoddesc}[AST]{tolist}{\optional{line_info}}
300Same as \code{ast2list(\var{ast}, \var{line_info})}.
301\end{methoddesc}
302
303\begin{methoddesc}[AST]{totuple}{\optional{line_info}}
304Same as \code{ast2tuple(\var{ast}, \var{line_info})}.
305\end{methoddesc}
306
307
Fred Drakec71b8021999-08-02 14:30:52 +0000308\subsection{Examples \label{AST Examples}}
Guido van Rossum4b73a061995-10-11 17:30:04 +0000309
Guido van Rossum47478871996-08-21 14:32:37 +0000310The parser modules allows operations to be performed on the parse tree
311of Python source code before the bytecode is generated, and provides
Fred Drake4b7d5a41996-09-11 21:57:40 +0000312for inspection of the parse tree for information gathering purposes.
313Two examples are presented. The simple example demonstrates emulation
Fred Drake88223901998-02-09 20:52:48 +0000314of the \function{compile()}\bifuncindex{compile} built-in function and
315the complex example shows the use of a parse tree for information
316discovery.
Guido van Rossum8206fb91996-08-26 00:33:29 +0000317
Fred Drakeaf370ea1998-04-05 20:23:02 +0000318\subsubsection{Emulation of \function{compile()}}
Guido van Rossum8206fb91996-08-26 00:33:29 +0000319
320While many useful operations may take place between parsing and
Guido van Rossum47478871996-08-21 14:32:37 +0000321bytecode generation, the simplest operation is to do nothing. For
Fred Drake88223901998-02-09 20:52:48 +0000322this purpose, using the \module{parser} module to produce an
Fred Drake38e5d272000-04-03 20:13:55 +0000323intermediate data structure is equivalent to the code
Guido van Rossum47478871996-08-21 14:32:37 +0000324
Fred Drake19479911998-02-13 06:58:54 +0000325\begin{verbatim}
Fred Drake625d70a2000-05-09 17:10:23 +0000326>>> code = compile('a + 5', 'file.py', 'eval')
Guido van Rossum47478871996-08-21 14:32:37 +0000327>>> a = 5
328>>> eval(code)
32910
Fred Drake19479911998-02-13 06:58:54 +0000330\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000331
Fred Drake38e5d272000-04-03 20:13:55 +0000332The equivalent operation using the \module{parser} module is somewhat
Guido van Rossum47478871996-08-21 14:32:37 +0000333longer, and allows the intermediate internal parse tree to be retained
334as an AST object:
Guido van Rossum4b73a061995-10-11 17:30:04 +0000335
Fred Drake19479911998-02-13 06:58:54 +0000336\begin{verbatim}
Guido van Rossum4b73a061995-10-11 17:30:04 +0000337>>> import parser
338>>> ast = parser.expr('a + 5')
Fred Drake625d70a2000-05-09 17:10:23 +0000339>>> code = ast.compile('file.py')
Guido van Rossum4b73a061995-10-11 17:30:04 +0000340>>> a = 5
341>>> eval(code)
34210
Fred Drake19479911998-02-13 06:58:54 +0000343\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000344
Guido van Rossum8206fb91996-08-26 00:33:29 +0000345An application which needs both AST and code objects can package this
346code into readily available functions:
347
Fred Drake19479911998-02-13 06:58:54 +0000348\begin{verbatim}
Guido van Rossum8206fb91996-08-26 00:33:29 +0000349import parser
350
351def load_suite(source_string):
352 ast = parser.suite(source_string)
Fred Drakec71b8021999-08-02 14:30:52 +0000353 return ast, ast.compile()
Guido van Rossum8206fb91996-08-26 00:33:29 +0000354
355def load_expression(source_string):
356 ast = parser.expr(source_string)
Fred Drakec71b8021999-08-02 14:30:52 +0000357 return ast, ast.compile()
Fred Drake19479911998-02-13 06:58:54 +0000358\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000359
Guido van Rossum8206fb91996-08-26 00:33:29 +0000360\subsubsection{Information Discovery}
361
Fred Drake4b7d5a41996-09-11 21:57:40 +0000362Some applications benefit from direct access to the parse tree. The
363remainder of this section demonstrates how the parse tree provides
Fred Drakec71b8021999-08-02 14:30:52 +0000364access to module documentation defined in
365docstrings\index{string!documentation}\index{docstrings} without
366requiring that the code being examined be loaded into a running
367interpreter via \keyword{import}. This can be very useful for
368performing analyses of untrusted code.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000369
Guido van Rossum47478871996-08-21 14:32:37 +0000370Generally, the example will demonstrate how the parse tree may be
371traversed to distill interesting information. Two functions and a set
Fred Drake4b7d5a41996-09-11 21:57:40 +0000372of classes are developed which provide programmatic access to high
Guido van Rossum47478871996-08-21 14:32:37 +0000373level function and class definitions provided by a module. The
374classes extract information from the parse tree and provide access to
375the information at a useful semantic level, one function provides a
376simple low-level pattern matching capability, and the other function
377defines a high-level interface to the classes by handling file
378operations on behalf of the caller. All source files mentioned here
379which are not part of the Python installation are located in the
Fred Drake4b7d5a41996-09-11 21:57:40 +0000380\file{Demo/parser/} directory of the distribution.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000381
Guido van Rossum8206fb91996-08-26 00:33:29 +0000382The dynamic nature of Python allows the programmer a great deal of
383flexibility, but most modules need only a limited measure of this when
384defining classes, functions, and methods. In this example, the only
385definitions that will be considered are those which are defined in the
Fred Drake88223901998-02-09 20:52:48 +0000386top level of their context, e.g., a function defined by a \keyword{def}
Guido van Rossum8206fb91996-08-26 00:33:29 +0000387statement at column zero of a module, but not a function defined
Fred Drakec71b8021999-08-02 14:30:52 +0000388within a branch of an \keyword{if} ... \keyword{else} construct, though
Guido van Rossum8206fb91996-08-26 00:33:29 +0000389there are some good reasons for doing so in some situations. Nesting
390of definitions will be handled by the code developed in the example.
391
Guido van Rossum47478871996-08-21 14:32:37 +0000392To construct the upper-level extraction methods, we need to know what
393the parse tree structure looks like and how much of it we actually
Fred Drake4b7d5a41996-09-11 21:57:40 +0000394need to be concerned about. Python uses a moderately deep parse tree
Guido van Rossum47478871996-08-21 14:32:37 +0000395so there are a large number of intermediate nodes. It is important to
396read and understand the formal grammar used by Python. This is
397specified in the file \file{Grammar/Grammar} in the distribution.
398Consider the simplest case of interest when searching for docstrings:
Guido van Rossum8206fb91996-08-26 00:33:29 +0000399a module consisting of a docstring and nothing else. (See file
400\file{docstring.py}.)
Guido van Rossum4b73a061995-10-11 17:30:04 +0000401
Fred Drake19479911998-02-13 06:58:54 +0000402\begin{verbatim}
Guido van Rossum47478871996-08-21 14:32:37 +0000403"""Some documentation.
404"""
Fred Drake19479911998-02-13 06:58:54 +0000405\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000406
Guido van Rossum47478871996-08-21 14:32:37 +0000407Using the interpreter to take a look at the parse tree, we find a
408bewildering mass of numbers and parentheses, with the documentation
Fred Drake4b7d5a41996-09-11 21:57:40 +0000409buried deep in nested tuples.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000410
Fred Drake19479911998-02-13 06:58:54 +0000411\begin{verbatim}
Guido van Rossum47478871996-08-21 14:32:37 +0000412>>> import parser
413>>> import pprint
414>>> ast = parser.suite(open('docstring.py').read())
Fred Drakec71b8021999-08-02 14:30:52 +0000415>>> tup = ast.totuple()
Guido van Rossum47478871996-08-21 14:32:37 +0000416>>> pprint.pprint(tup)
417(257,
418 (264,
419 (265,
420 (266,
421 (267,
422 (307,
423 (287,
424 (288,
425 (289,
426 (290,
427 (292,
428 (293,
429 (294,
430 (295,
431 (296,
432 (297,
433 (298,
434 (299,
Ka-Ping Yeefa004ad2001-01-24 17:19:08 +0000435 (300, (3, '"""Some documentation.\n"""'))))))))))))))))),
Guido van Rossum47478871996-08-21 14:32:37 +0000436 (4, ''))),
437 (4, ''),
438 (0, ''))
Fred Drake19479911998-02-13 06:58:54 +0000439\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000440
Guido van Rossum47478871996-08-21 14:32:37 +0000441The numbers at the first element of each node in the tree are the node
442types; they map directly to terminal and non-terminal symbols in the
443grammar. Unfortunately, they are represented as integers in the
444internal representation, and the Python structures generated do not
Fred Drakeffbe6871999-04-22 21:23:22 +0000445change that. However, the \refmodule{symbol} and \refmodule{token} modules
Guido van Rossum47478871996-08-21 14:32:37 +0000446provide symbolic names for the node types and dictionaries which map
447from the integers to the symbolic names for the node types.
448
449In the output presented above, the outermost tuple contains four
450elements: the integer \code{257} and three additional tuples. Node
Fred Drake88223901998-02-09 20:52:48 +0000451type \code{257} has the symbolic name \constant{file_input}. Each of
Guido van Rossum47478871996-08-21 14:32:37 +0000452these inner tuples contains an integer as the first element; these
453integers, \code{264}, \code{4}, and \code{0}, represent the node types
Fred Drake88223901998-02-09 20:52:48 +0000454\constant{stmt}, \constant{NEWLINE}, and \constant{ENDMARKER},
455respectively.
Guido van Rossum47478871996-08-21 14:32:37 +0000456Note that these values may change depending on the version of Python
457you are using; consult \file{symbol.py} and \file{token.py} for
458details of the mapping. It should be fairly clear that the outermost
459node is related primarily to the input source rather than the contents
Fred Drake88223901998-02-09 20:52:48 +0000460of the file, and may be disregarded for the moment. The \constant{stmt}
Guido van Rossum47478871996-08-21 14:32:37 +0000461node is much more interesting. In particular, all docstrings are
462found in subtrees which are formed exactly as this node is formed,
463with the only difference being the string itself. The association
464between the docstring in a similar tree and the defined entity (class,
465function, or module) which it describes is given by the position of
466the docstring subtree within the tree defining the described
467structure.
468
469By replacing the actual docstring with something to signify a variable
Fred Drake4b7d5a41996-09-11 21:57:40 +0000470component of the tree, we allow a simple pattern matching approach to
Fred Drake38e5d272000-04-03 20:13:55 +0000471check any given subtree for equivalence to the general pattern for
Fred Drake4b7d5a41996-09-11 21:57:40 +0000472docstrings. Since the example demonstrates information extraction, we
473can safely require that the tree be in tuple form rather than list
474form, allowing a simple variable representation to be
475\code{['variable_name']}. A simple recursive function can implement
Fred Drake6c81e2a2001-10-01 17:04:10 +0000476the pattern matching, returning a Boolean and a dictionary of variable
Guido van Rossum8206fb91996-08-26 00:33:29 +0000477name to value mappings. (See file \file{example.py}.)
Guido van Rossum47478871996-08-21 14:32:37 +0000478
Fred Drake19479911998-02-13 06:58:54 +0000479\begin{verbatim}
Guido van Rossum47478871996-08-21 14:32:37 +0000480from types import ListType, TupleType
481
482def match(pattern, data, vars=None):
483 if vars is None:
484 vars = {}
485 if type(pattern) is ListType:
486 vars[pattern[0]] = data
487 return 1, vars
488 if type(pattern) is not TupleType:
489 return (pattern == data), vars
490 if len(data) != len(pattern):
491 return 0, vars
492 for pattern, data in map(None, pattern, data):
493 same, vars = match(pattern, data, vars)
494 if not same:
495 break
496 return same, vars
Fred Drake19479911998-02-13 06:58:54 +0000497\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000498
Fred Drake4b7d5a41996-09-11 21:57:40 +0000499Using this simple representation for syntactic variables and the symbolic
Guido van Rossum8206fb91996-08-26 00:33:29 +0000500node types, the pattern for the candidate docstring subtrees becomes
501fairly readable. (See file \file{example.py}.)
Guido van Rossum47478871996-08-21 14:32:37 +0000502
Fred Drake19479911998-02-13 06:58:54 +0000503\begin{verbatim}
Guido van Rossum8206fb91996-08-26 00:33:29 +0000504import symbol
505import token
506
507DOCSTRING_STMT_PATTERN = (
508 symbol.stmt,
509 (symbol.simple_stmt,
510 (symbol.small_stmt,
511 (symbol.expr_stmt,
512 (symbol.testlist,
513 (symbol.test,
514 (symbol.and_test,
515 (symbol.not_test,
516 (symbol.comparison,
517 (symbol.expr,
518 (symbol.xor_expr,
519 (symbol.and_expr,
520 (symbol.shift_expr,
521 (symbol.arith_expr,
522 (symbol.term,
523 (symbol.factor,
524 (symbol.power,
525 (symbol.atom,
526 (token.STRING, ['docstring'])
527 )))))))))))))))),
528 (token.NEWLINE, '')
529 ))
Fred Drake19479911998-02-13 06:58:54 +0000530\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000531
Fred Drake88223901998-02-09 20:52:48 +0000532Using the \function{match()} function with this pattern, extracting the
Guido van Rossum47478871996-08-21 14:32:37 +0000533module docstring from the parse tree created previously is easy:
534
Fred Drake19479911998-02-13 06:58:54 +0000535\begin{verbatim}
Guido van Rossum47478871996-08-21 14:32:37 +0000536>>> found, vars = match(DOCSTRING_STMT_PATTERN, tup[1])
537>>> found
5381
539>>> vars
Ka-Ping Yeefa004ad2001-01-24 17:19:08 +0000540{'docstring': '"""Some documentation.\n"""'}
Fred Drake19479911998-02-13 06:58:54 +0000541\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000542
Guido van Rossum47478871996-08-21 14:32:37 +0000543Once specific data can be extracted from a location where it is
544expected, the question of where information can be expected
545needs to be answered. When dealing with docstrings, the answer is
Fred Drake88223901998-02-09 20:52:48 +0000546fairly simple: the docstring is the first \constant{stmt} node in a code
547block (\constant{file_input} or \constant{suite} node types). A module
548consists of a single \constant{file_input} node, and class and function
549definitions each contain exactly one \constant{suite} node. Classes and
Guido van Rossum47478871996-08-21 14:32:37 +0000550functions are readily identified as subtrees of code block nodes which
551start with \code{(stmt, (compound_stmt, (classdef, ...} or
552\code{(stmt, (compound_stmt, (funcdef, ...}. Note that these subtrees
Fred Drake88223901998-02-09 20:52:48 +0000553cannot be matched by \function{match()} since it does not support multiple
Guido van Rossum47478871996-08-21 14:32:37 +0000554sibling nodes to match without regard to number. A more elaborate
555matching function could be used to overcome this limitation, but this
556is sufficient for the example.
557
Guido van Rossum8206fb91996-08-26 00:33:29 +0000558Given the ability to determine whether a statement might be a
559docstring and extract the actual string from the statement, some work
560needs to be performed to walk the parse tree for an entire module and
561extract information about the names defined in each context of the
562module and associate any docstrings with the names. The code to
563perform this work is not complicated, but bears some explanation.
564
565The public interface to the classes is straightforward and should
566probably be somewhat more flexible. Each ``major'' block of the
567module is described by an object providing several methods for inquiry
568and a constructor which accepts at least the subtree of the complete
Fred Drakeb0df5671998-02-18 15:59:13 +0000569parse tree which it represents. The \class{ModuleInfo} constructor
570accepts an optional \var{name} parameter since it cannot
Guido van Rossum8206fb91996-08-26 00:33:29 +0000571otherwise determine the name of the module.
572
Fred Drake88223901998-02-09 20:52:48 +0000573The public classes include \class{ClassInfo}, \class{FunctionInfo},
574and \class{ModuleInfo}. All objects provide the
575methods \method{get_name()}, \method{get_docstring()},
576\method{get_class_names()}, and \method{get_class_info()}. The
577\class{ClassInfo} objects support \method{get_method_names()} and
578\method{get_method_info()} while the other classes provide
579\method{get_function_names()} and \method{get_function_info()}.
Guido van Rossum8206fb91996-08-26 00:33:29 +0000580
581Within each of the forms of code block that the public classes
582represent, most of the required information is in the same form and is
Fred Drake4b7d5a41996-09-11 21:57:40 +0000583accessed in the same way, with classes having the distinction that
Guido van Rossum8206fb91996-08-26 00:33:29 +0000584functions defined at the top level are referred to as ``methods.''
585Since the difference in nomenclature reflects a real semantic
Fred Drake4b7d5a41996-09-11 21:57:40 +0000586distinction from functions defined outside of a class, the
587implementation needs to maintain the distinction.
Guido van Rossum8206fb91996-08-26 00:33:29 +0000588Hence, most of the functionality of the public classes can be
Fred Drake88223901998-02-09 20:52:48 +0000589implemented in a common base class, \class{SuiteInfoBase}, with the
Guido van Rossum8206fb91996-08-26 00:33:29 +0000590accessors for function and method information provided elsewhere.
591Note that there is only one class which represents function and method
Fred Drake88223901998-02-09 20:52:48 +0000592information; this parallels the use of the \keyword{def} statement to
Fred Drake4b7d5a41996-09-11 21:57:40 +0000593define both types of elements.
Guido van Rossum8206fb91996-08-26 00:33:29 +0000594
Fred Drake88223901998-02-09 20:52:48 +0000595Most of the accessor functions are declared in \class{SuiteInfoBase}
Thomas Woutersf8316632000-07-16 19:01:10 +0000596and do not need to be overridden by subclasses. More importantly, the
Guido van Rossum8206fb91996-08-26 00:33:29 +0000597extraction of most information from a parse tree is handled through a
Fred Drake88223901998-02-09 20:52:48 +0000598method called by the \class{SuiteInfoBase} constructor. The example
Guido van Rossum8206fb91996-08-26 00:33:29 +0000599code for most of the classes is clear when read alongside the formal
600grammar, but the method which recursively creates new information
601objects requires further examination. Here is the relevant part of
Fred Drake88223901998-02-09 20:52:48 +0000602the \class{SuiteInfoBase} definition from \file{example.py}:
Guido van Rossum8206fb91996-08-26 00:33:29 +0000603
Fred Drake19479911998-02-13 06:58:54 +0000604\begin{verbatim}
Guido van Rossum8206fb91996-08-26 00:33:29 +0000605class SuiteInfoBase:
606 _docstring = ''
607 _name = ''
608
609 def __init__(self, tree = None):
610 self._class_info = {}
611 self._function_info = {}
612 if tree:
613 self._extract_info(tree)
614
615 def _extract_info(self, tree):
616 # extract docstring
617 if len(tree) == 2:
618 found, vars = match(DOCSTRING_STMT_PATTERN[1], tree[1])
619 else:
620 found, vars = match(DOCSTRING_STMT_PATTERN, tree[3])
621 if found:
622 self._docstring = eval(vars['docstring'])
623 # discover inner definitions
624 for node in tree[1:]:
625 found, vars = match(COMPOUND_STMT_PATTERN, node)
626 if found:
627 cstmt = vars['compound']
628 if cstmt[0] == symbol.funcdef:
629 name = cstmt[2][1]
630 self._function_info[name] = FunctionInfo(cstmt)
631 elif cstmt[0] == symbol.classdef:
632 name = cstmt[2][1]
633 self._class_info[name] = ClassInfo(cstmt)
Fred Drake19479911998-02-13 06:58:54 +0000634\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000635
Guido van Rossum8206fb91996-08-26 00:33:29 +0000636After initializing some internal state, the constructor calls the
Fred Drake88223901998-02-09 20:52:48 +0000637\method{_extract_info()} method. This method performs the bulk of the
Guido van Rossum8206fb91996-08-26 00:33:29 +0000638information extraction which takes place in the entire example. The
639extraction has two distinct phases: the location of the docstring for
640the parse tree passed in, and the discovery of additional definitions
641within the code block represented by the parse tree.
642
Fred Drake88223901998-02-09 20:52:48 +0000643The initial \keyword{if} test determines whether the nested suite is of
Guido van Rossum8206fb91996-08-26 00:33:29 +0000644the ``short form'' or the ``long form.'' The short form is used when
645the code block is on the same line as the definition of the code
646block, as in
647
Fred Drakebbe60681998-01-09 22:24:14 +0000648\begin{verbatim}
Guido van Rossum8206fb91996-08-26 00:33:29 +0000649def square(x): "Square an argument."; return x ** 2
Fred Drakebbe60681998-01-09 22:24:14 +0000650\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000651
Guido van Rossum8206fb91996-08-26 00:33:29 +0000652while the long form uses an indented block and allows nested
653definitions:
654
Fred Drake19479911998-02-13 06:58:54 +0000655\begin{verbatim}
Guido van Rossum8206fb91996-08-26 00:33:29 +0000656def make_power(exp):
657 "Make a function that raises an argument to the exponent `exp'."
658 def raiser(x, y=exp):
659 return x ** y
660 return raiser
Fred Drake19479911998-02-13 06:58:54 +0000661\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000662
Guido van Rossum8206fb91996-08-26 00:33:29 +0000663When the short form is used, the code block may contain a docstring as
Fred Drake88223901998-02-09 20:52:48 +0000664the first, and possibly only, \constant{small_stmt} element. The
Guido van Rossum8206fb91996-08-26 00:33:29 +0000665extraction of such a docstring is slightly different and requires only
666a portion of the complete pattern used in the more common case. As
Fred Drake4b7d5a41996-09-11 21:57:40 +0000667implemented, the docstring will only be found if there is only
Fred Drake88223901998-02-09 20:52:48 +0000668one \constant{small_stmt} node in the \constant{simple_stmt} node.
669Since most functions and methods which use the short form do not
670provide a docstring, this may be considered sufficient. The
671extraction of the docstring proceeds using the \function{match()} function
672as described above, and the value of the docstring is stored as an
673attribute of the \class{SuiteInfoBase} object.
Guido van Rossum8206fb91996-08-26 00:33:29 +0000674
Fred Drake4b7d5a41996-09-11 21:57:40 +0000675After docstring extraction, a simple definition discovery
Fred Drake88223901998-02-09 20:52:48 +0000676algorithm operates on the \constant{stmt} nodes of the
677\constant{suite} node. The special case of the short form is not
678tested; since there are no \constant{stmt} nodes in the short form,
679the algorithm will silently skip the single \constant{simple_stmt}
680node and correctly not discover any nested definitions.
Guido van Rossum8206fb91996-08-26 00:33:29 +0000681
Fred Drake4b7d5a41996-09-11 21:57:40 +0000682Each statement in the code block is categorized as
683a class definition, function or method definition, or
Guido van Rossum8206fb91996-08-26 00:33:29 +0000684something else. For the definition statements, the name of the
Fred Drake4b7d5a41996-09-11 21:57:40 +0000685element defined is extracted and a representation object
Guido van Rossum8206fb91996-08-26 00:33:29 +0000686appropriate to the definition is created with the defining subtree
Thomas Woutersf8316632000-07-16 19:01:10 +0000687passed as an argument to the constructor. The representation objects
Guido van Rossum8206fb91996-08-26 00:33:29 +0000688are stored in instance variables and may be retrieved by name using
689the appropriate accessor methods.
690
691The public classes provide any accessors required which are more
Fred Drake88223901998-02-09 20:52:48 +0000692specific than those provided by the \class{SuiteInfoBase} class, but
Guido van Rossum8206fb91996-08-26 00:33:29 +0000693the real extraction algorithm remains common to all forms of code
694blocks. A high-level function can be used to extract the complete set
Fred Drake4b7d5a41996-09-11 21:57:40 +0000695of information from a source file. (See file \file{example.py}.)
Guido van Rossum8206fb91996-08-26 00:33:29 +0000696
Fred Drake19479911998-02-13 06:58:54 +0000697\begin{verbatim}
Guido van Rossum8206fb91996-08-26 00:33:29 +0000698def get_docs(fileName):
Guido van Rossum8206fb91996-08-26 00:33:29 +0000699 import os
Guido van Rossum8206fb91996-08-26 00:33:29 +0000700 import parser
Fred Drakec71b8021999-08-02 14:30:52 +0000701
702 source = open(fileName).read()
703 basename = os.path.basename(os.path.splitext(fileName)[0])
Guido van Rossum8206fb91996-08-26 00:33:29 +0000704 ast = parser.suite(source)
Fred Drakec71b8021999-08-02 14:30:52 +0000705 return ModuleInfo(ast.totuple(), basename)
Fred Drake19479911998-02-13 06:58:54 +0000706\end{verbatim}
Fred Drake5bd7fcc1998-04-03 05:31:45 +0000707
Guido van Rossum8206fb91996-08-26 00:33:29 +0000708This provides an easy-to-use interface to the documentation of a
709module. If information is required which is not extracted by the code
710of this example, the code may be extended at clearly defined points to
711provide additional capabilities.