blob: 4398177759a090c25db3b36b2b67af053ff6e882 [file] [log] [blame]
Guido van Rossum4b73a061995-10-11 17:30:04 +00001% libparser.tex
2%
3% Introductory documentation for the new parser built-in module.
4%
5% Copyright 1995 Virginia Polytechnic Institute and State University
6% and Fred L. Drake, Jr. This copyright notice must be distributed on
7% all copies, but this document otherwise may be distributed as part
8% of the Python distribution. No fee may be charged for this document
9% in any representation, either on paper or electronically. This
10% restriction does not affect other elements in a distributed package
11% in any way.
12%
13
14\section{Built-in Module \sectcode{parser}}
15\bimodindex{parser}
16
Guido van Rossum4b73a061995-10-11 17:30:04 +000017The \code{parser} module provides an interface to Python's internal
18parser and byte-code compiler. The primary purpose for this interface
19is to allow Python code to edit the parse tree of a Python expression
20and create executable code from this. This can be better than trying
21to parse and modify an arbitrary Python code fragment as a string, and
22ensures that parsing is performed in a manner identical to the code
23forming the application. It's also faster.
24
25There are a few things to note about this module which are important
26to making use of the data structures created. This is not a tutorial
27on editing the parse trees for Python code.
28
29Most importantly, a good understanding of the Python grammar processed
30by the internal parser is required. For full information on the
31language syntax, refer to the Language Reference. The parser itself
32is created from a grammar specification defined in the file
33\code{Grammar/Grammar} in the standard Python distribution. The parse
34trees stored in the ``AST objects'' created by this module are the
35actual output from the internal parser when created by the
36\code{expr()} or \code{suite()} functions, described below. The AST
Guido van Rossum47478871996-08-21 14:32:37 +000037objects created by \code{sequence2ast()} faithfully simulate those
38structures. Be aware that the values of the sequences which are
39considered ``correct'' will vary from one version of Python to another
40as the formal grammar for the language is revised. However,
41transporting code from one Python version to another as source text
42will always allow correct parse trees to be created in the target
43version, with the only restriction being that migrating to an older
44version of the interpreter will not support more recent language
45constructs. The parse trees are not typically compatible from one
46version to another, whereas source code has always been
47forward-compatible.
Guido van Rossum4b73a061995-10-11 17:30:04 +000048
Guido van Rossum47478871996-08-21 14:32:37 +000049Each element of the sequences returned by \code{ast2list} or
50\code{ast2tuple()} has a simple form. Sequences representing
51non-terminal elements in the grammar always have a length greater than
52one. The first element is an integer which identifies a production in
53the grammar. These integers are given symbolic names in the C header
54file \code{Include/graminit.h} and the Python module
55\code{Lib/symbol.py}. Each additional element of the sequence represents
56a component of the production as recognized in the input string: these
57are always sequences which have the same form as the parent. An
58important aspect of this structure which should be noted is that
59keywords used to identify the parent node type, such as the keyword
60\code{if} in an \emph{if\_stmt}, are included in the node tree without
61any special treatment. For example, the \code{if} keyword is
Guido van Rossum4b73a061995-10-11 17:30:04 +000062represented by the tuple \code{(1, 'if')}, where \code{1} is the
63numeric value associated with all \code{NAME} elements, including
Guido van Rossum47478871996-08-21 14:32:37 +000064variable and function names defined by the user. In an alternate form
65returned when line number information is requested, the same token
66might be represented as \code{(1, 'if', 12)}, where the \code{12}
67represents the line number at which the terminal symbol was found.
Guido van Rossum4b73a061995-10-11 17:30:04 +000068
69Terminal elements are represented in much the same way, but without
70any child elements and the addition of the source text which was
71identified. The example of the \code{if} keyword above is
72representative. The various types of terminal symbols are defined in
73the C header file \code{Include/token.h} and the Python module
74\code{Lib/token.py}.
75
76The AST objects are not actually required to support the functionality
77of this module, but are provided for three purposes: to allow an
78application to amortize the cost of processing complex parse trees, to
79provide a parse tree representation which conserves memory space when
Guido van Rossum47478871996-08-21 14:32:37 +000080compared to the Python list or tuple representation, and to ease the
81creation of additional modules in C which manipulate parse trees. A
82simple ``wrapper'' module may be created in Python to hide the use of
83AST objects.
Guido van Rossum4b73a061995-10-11 17:30:04 +000084
85
Guido van Rossum4b73a061995-10-11 17:30:04 +000086The \code{parser} module defines the following functions:
87
Guido van Rossum4b73a061995-10-11 17:30:04 +000088\renewcommand{\indexsubitem}{(in module parser)}
89
Guido van Rossum47478871996-08-21 14:32:37 +000090\begin{funcdesc}{ast2list}{ast\optional{\, line\_info\code{ = 0}}}
Guido van Rossum4b73a061995-10-11 17:30:04 +000091This function accepts an AST object from the caller in
Guido van Rossum47478871996-08-21 14:32:37 +000092\code{\var{ast}} and returns a Python list representing the
93equivelent parse tree. The resulting list representation can be used
94for inspection or the creation of a new parse tree in list form.
Guido van Rossum4b73a061995-10-11 17:30:04 +000095This function does not fail so long as memory is available to build
Guido van Rossum47478871996-08-21 14:32:37 +000096the list representation. If a parse tree will only be used for
97inspection, \code{ast2tuple()} should be used instead to reduce memory
98consumption and fragmentation. When modifications are to be made to
99the parse tree, this function is significantly faster than retrieving
100a tuple representation and converting that to nested lists.
101
102If the \code{line\_info} flag is given true value, line number
103information will be included for all terminal tokens as a third
104element of the list representing the token. This information is
105omitted if the flag is false or omitted.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000106\end{funcdesc}
107
Guido van Rossum47478871996-08-21 14:32:37 +0000108\begin{funcdesc}{ast2tuple}{ast\optional{\, line\_info\code{ = 0}}}
109This function accepts an AST object from the caller in
110\code{\var{ast}} and returns a Python tuple representing the
111equivelent parse tree. Other than returning a tuple instead of a
112list, this function is identical to \code{ast2list()}.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000113
Guido van Rossum47478871996-08-21 14:32:37 +0000114If the \code{line\_info} flag is given true value, line number
115information will be included for all terminal tokens as a third
116element of the list representing the token. This information is
117omitted if the flag is false or omitted.
118\end{funcdesc}
119
120\begin{funcdesc}{compileast}{ast\optional{\, filename\code{ = '<ast>'}}}
Guido van Rossum4b73a061995-10-11 17:30:04 +0000121The Python byte compiler can be invoked on an AST object to produce
122code objects which can be used as part of an \code{exec} statement or
123a call to the built-in \code{eval()} function. This function provides
124the interface to the compiler, passing the internal parse tree from
125\code{\var{ast}} to the parser, using the source file name specified
126by the \code{\var{filename}} parameter. The default value supplied
127for \code{\var{filename}} indicates that the source was an AST object.
Guido van Rossum47478871996-08-21 14:32:37 +0000128
129Compiling an AST object may result in exceptions related to
130compilation; an example would be a \code{SyntaxError} caused by the
131parse tree for \code{del f(0)}; this statement is considered legal
132within the formal grammar for Python but is not a legal language
133construct. The \code{SyntaxError} raised for this condition is
134actually generated by the Python byte-compiler normally, which is why
135it can be raised at this point by the \code{parser} module. Most
136causes of compilation failure can be diagnosed programmatically by
137inspection of the parse tree.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000138\end{funcdesc}
139
140
141\begin{funcdesc}{expr}{string}
142The \code{expr()} function parses the parameter \code{\var{string}}
143as if it were an input to \code{compile(\var{string}, 'eval')}. If
144the parse succeeds, an AST object is created to hold the internal
145parse tree representation, otherwise an appropriate exception is
146thrown.
147\end{funcdesc}
148
149
150\begin{funcdesc}{isexpr}{ast}
151When \code{\var{ast}} represents an \code{'eval'} form, this function
152returns a true value (\code{1}), otherwise it returns false
153(\code{0}). This is useful, since code objects normally cannot be
154queried for this information using existing built-in functions. Note
155that the code objects created by \code{compileast()} cannot be queried
156like this either, and are identical to those created by the built-in
157\code{compile()} function.
158\end{funcdesc}
159
160
161\begin{funcdesc}{issuite}{ast}
162This function mirrors \code{isexpr()} in that it reports whether an
163AST object represents a suite of statements. It is not safe to assume
164that this function is equivelent to \code{not isexpr(\var{ast})}, as
165additional syntactic fragments may be supported in the future.
166\end{funcdesc}
167
168
169\begin{funcdesc}{suite}{string}
170The \code{suite()} function parses the parameter \code{\var{string}}
171as if it were an input to \code{compile(\var{string}, 'exec')}. If
172the parse succeeds, an AST object is created to hold the internal
173parse tree representation, otherwise an appropriate exception is
174thrown.
175\end{funcdesc}
176
177
Guido van Rossum47478871996-08-21 14:32:37 +0000178\begin{funcdesc}{sequence2ast}{sequence}
179This function accepts a parse tree represented as a sequence and
180builds an internal representation if possible. If it can validate
181that the tree conforms to the Python grammar and all nodes are valid
182node types in the host version of Python, an AST object is created
183from the internal representation and returned to the called. If there
184is a problem creating the internal representation, or if the tree
185cannot be validated, a \code{ParserError} exception is thrown. An AST
186object created this way should not be assumed to compile correctly;
187normal exceptions thrown by compilation may still be initiated when
188the AST object is passed to \code{compileast()}. This will normally
189indicate problems not related to syntax (such as a \code{MemoryError}
190exception), but may also be due to constructs such as the result of
191parsing \code{del f(0)}, which escapes the Python parser but is
192checked by the bytecode compiler.
193
194Sequences representing terminal tokens may be represented as either
195two-element lists of the form \code{(1, 'name')} or as three-element
196lists of the form \code{(1, 'name', 56)}. If the third element is
197present, it is assumed to be a valid line number. The line number
198may be specified for any subset of the terminal symbols in the input
199tree.
200\end{funcdesc}
201
202\begin{funcdesc}{tuple2ast}{sequence}
203This is the same function as \code{sequence2ast}. This entry point is
204maintained for backward compatibility.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000205\end{funcdesc}
206
207
Guido van Rossum4b73a061995-10-11 17:30:04 +0000208\subsection{Exceptions and Error Handling}
209
210The parser module defines a single exception, but may also pass other
211built-in exceptions from other portions of the Python runtime
212environment. See each function for information about the exceptions
213it can raise.
214
215\begin{excdesc}{ParserError}
216Exception raised when a failure occurs within the parser module. This
217is generally produced for validation failures rather than the built in
218\code{SyntaxError} thrown during normal parsing.
219The exception argument is either a string describing the reason of the
Guido van Rossum47478871996-08-21 14:32:37 +0000220failure or a tuple containing a sequence causing the failure from a parse
221tree passed to \code{sequence2ast()} and an explanatory string. Calls to
222\code{sequence2ast()} need to be able to handle either type of exception,
Guido van Rossum4b73a061995-10-11 17:30:04 +0000223while calls to other functions in the module will only need to be
224aware of the simple string values.
225\end{excdesc}
226
227Note that the functions \code{compileast()}, \code{expr()}, and
228\code{suite()} may throw exceptions which are normally thrown by the
229parsing and compilation process. These include the built in
230exceptions \code{MemoryError}, \code{OverflowError},
231\code{SyntaxError}, and \code{SystemError}. In these cases, these
232exceptions carry all the meaning normally associated with them. Refer
233to the descriptions of each function for detailed information.
234
Guido van Rossum4b73a061995-10-11 17:30:04 +0000235
Guido van Rossum47478871996-08-21 14:32:37 +0000236\subsection{AST Objects}
237
238AST objects (returned by \code{expr()}, \code{suite()}, and
239\code{tuple2ast()}, described above) have no methods of their own.
240Some of the functions defined which accept an AST object as their
241first argument may change to object methods in the future.
242
243Ordered and equality comparisons are supported between AST objects.
244
245
Guido van Rossum4b73a061995-10-11 17:30:04 +0000246\subsection{Example}
247
Guido van Rossum47478871996-08-21 14:32:37 +0000248The parser modules allows operations to be performed on the parse tree
249of Python source code before the bytecode is generated, and provides
250for inspection of the parse tree for information gathering purposes as
251well. While many useful operations may take place between parsing and
252bytecode generation, the simplest operation is to do nothing. For
253this purpose, using the \code{parser} module to produce an
254intermediate data structure is equivelent to the code
255
256\begin{verbatim}
257>>> code = compile('a + 5', 'eval')
258>>> a = 5
259>>> eval(code)
26010
261\end{verbatim}
262
263The equivelent operation using the \code{parser} module is somewhat
264longer, and allows the intermediate internal parse tree to be retained
265as an AST object:
Guido van Rossum4b73a061995-10-11 17:30:04 +0000266
267\begin{verbatim}
268>>> import parser
269>>> ast = parser.expr('a + 5')
270>>> code = parser.compileast(ast)
271>>> a = 5
272>>> eval(code)
27310
274\end{verbatim}
275
Guido van Rossum47478871996-08-21 14:32:37 +0000276Some applications can benfit from access to the parse tree itself, and
277can take advantage of the intermediate data structure provided by the
278\code{parser} module. The remainder of this section of examples will
279demonstrate how the intermediate data structure can provide access to
280module documentation defined in docstrings without requiring that the
281code being examined be imported into a running interpreter. This can
282be very useful for performing analyses of untrusted code.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000283
Guido van Rossum47478871996-08-21 14:32:37 +0000284Generally, the example will demonstrate how the parse tree may be
285traversed to distill interesting information. Two functions and a set
286of classes is developed which provide programmatic access to high
287level function and class definitions provided by a module. The
288classes extract information from the parse tree and provide access to
289the information at a useful semantic level, one function provides a
290simple low-level pattern matching capability, and the other function
291defines a high-level interface to the classes by handling file
292operations on behalf of the caller. All source files mentioned here
293which are not part of the Python installation are located in the
294\file{Demo/parser} directory of the distribution.
Guido van Rossum4b73a061995-10-11 17:30:04 +0000295
Guido van Rossum47478871996-08-21 14:32:37 +0000296To construct the upper-level extraction methods, we need to know what
297the parse tree structure looks like and how much of it we actually
298need to be concerned about. Python uses a moderately deep parse tree,
299so there are a large number of intermediate nodes. It is important to
300read and understand the formal grammar used by Python. This is
301specified in the file \file{Grammar/Grammar} in the distribution.
302Consider the simplest case of interest when searching for docstrings:
303a module consisting of a docstring and nothing else:
Guido van Rossum4b73a061995-10-11 17:30:04 +0000304
Guido van Rossum47478871996-08-21 14:32:37 +0000305\begin{verbatim}
306"""Some documentation.
307"""
308\end{verbatim}
Guido van Rossum4b73a061995-10-11 17:30:04 +0000309
Guido van Rossum47478871996-08-21 14:32:37 +0000310Using the interpreter to take a look at the parse tree, we find a
311bewildering mass of numbers and parentheses, with the documentation
312buried deep in the nested tuples:
Guido van Rossum4b73a061995-10-11 17:30:04 +0000313
Guido van Rossum47478871996-08-21 14:32:37 +0000314\begin{verbatim}
315>>> import parser
316>>> import pprint
317>>> ast = parser.suite(open('docstring.py').read())
318>>> tup = parser.ast2tuple(ast)
319>>> pprint.pprint(tup)
320(257,
321 (264,
322 (265,
323 (266,
324 (267,
325 (307,
326 (287,
327 (288,
328 (289,
329 (290,
330 (292,
331 (293,
332 (294,
333 (295,
334 (296,
335 (297,
336 (298,
337 (299,
338 (300, (3, '"""Some documentation.\012"""'))))))))))))))))),
339 (4, ''))),
340 (4, ''),
341 (0, ''))
342\end{verbatim}
343
344The numbers at the first element of each node in the tree are the node
345types; they map directly to terminal and non-terminal symbols in the
346grammar. Unfortunately, they are represented as integers in the
347internal representation, and the Python structures generated do not
348change that. However, the \code{symbol} and \code{token} modules
349provide symbolic names for the node types and dictionaries which map
350from the integers to the symbolic names for the node types.
351
352In the output presented above, the outermost tuple contains four
353elements: the integer \code{257} and three additional tuples. Node
354type \code{257} has the symbolic name \code{file_input}. Each of
355these inner tuples contains an integer as the first element; these
356integers, \code{264}, \code{4}, and \code{0}, represent the node types
357\code{stmt}, \code{NEWLINE}, and \code{ENDMARKER}, respectively.
358Note that these values may change depending on the version of Python
359you are using; consult \file{symbol.py} and \file{token.py} for
360details of the mapping. It should be fairly clear that the outermost
361node is related primarily to the input source rather than the contents
362of the file, and may be disregarded for the moment. The \code{stmt}
363node is much more interesting. In particular, all docstrings are
364found in subtrees which are formed exactly as this node is formed,
365with the only difference being the string itself. The association
366between the docstring in a similar tree and the defined entity (class,
367function, or module) which it describes is given by the position of
368the docstring subtree within the tree defining the described
369structure.
370
371By replacing the actual docstring with something to signify a variable
372component of the tree, we allow a simple pattern matching approach may
373be taken to checking any given subtree for equivelence to the general
374pattern for docstrings. Since the example demonstrates information
375extraction, we can safely require that the tree be in tuple form
376rather than list form, allowing a simple variable representation to be
377\code{['variable\_name']}. A simple recursive function can implement
378the pattern matching, returning a boolean and a dictionary of variable
379name to value mappings.
380
381\begin{verbatim}
382from types import ListType, TupleType
383
384def match(pattern, data, vars=None):
385 if vars is None:
386 vars = {}
387 if type(pattern) is ListType:
388 vars[pattern[0]] = data
389 return 1, vars
390 if type(pattern) is not TupleType:
391 return (pattern == data), vars
392 if len(data) != len(pattern):
393 return 0, vars
394 for pattern, data in map(None, pattern, data):
395 same, vars = match(pattern, data, vars)
396 if not same:
397 break
398 return same, vars
399\end{verbatim}
400
401Using this simple recursive pattern matching function and the symbolic
402node types, the pattern for the candidate docstring subtrees becomes:
403
404\begin{verbatim}
405>>> DOCSTRING_STMT_PATTERN = (
406... symbol.stmt,
407... (symbol.simple_stmt,
408... (symbol.small_stmt,
409... (symbol.expr_stmt,
410... (symbol.testlist,
411... (symbol.test,
412... (symbol.and_test,
413... (symbol.not_test,
414... (symbol.comparison,
415... (symbol.expr,
416... (symbol.xor_expr,
417... (symbol.and_expr,
418... (symbol.shift_expr,
419... (symbol.arith_expr,
420... (symbol.term,
421... (symbol.factor,
422... (symbol.power,
423... (symbol.atom,
424... (token.STRING, ['docstring'])
425... )))))))))))))))),
426... (token.NEWLINE, '')
427... ))
428\end{verbatim}
429
430Using the \code{match()} function with this pattern, extracting the
431module docstring from the parse tree created previously is easy:
432
433\begin{verbatim}
434>>> found, vars = match(DOCSTRING_STMT_PATTERN, tup[1])
435>>> found
4361
437>>> vars
438{'docstring': '"""Some documentation.\012"""'}
439\end{verbatim}
440
441Once specific data can be extracted from a location where it is
442expected, the question of where information can be expected
443needs to be answered. When dealing with docstrings, the answer is
444fairly simple: the docstring is the first \code{stmt} node in a code
445block (\code{file_input} or \code{suite} node types). A module
446consists of a single \code{file_input} node, and class and function
447definitions each contain exactly one \code{suite} node. Classes and
448functions are readily identified as subtrees of code block nodes which
449start with \code{(stmt, (compound_stmt, (classdef, ...} or
450\code{(stmt, (compound_stmt, (funcdef, ...}. Note that these subtrees
451cannot be matched by \code{match()} since it does not support multiple
452sibling nodes to match without regard to number. A more elaborate
453matching function could be used to overcome this limitation, but this
454is sufficient for the example.
455
456
457
458%%
459%% end of file